learnCustomEntities()

learnCustomEntites( patterns, config ) → {count of patterns}

This method is used for defining the list of custom entities to be extracted as a pattern match from a winkNLP doc. It returns a count of patterns learned as defined in the first argument patterns. For example,

const patterns = [
  { name: 'nounPhrase', patterns: [ '[|DET] [|ADJ] [NOUN|PROPN]' ] }
];
const count = nlp.learnCustomEntities( patterns )
console.log( count );
// -> 1

The learnCustomEntities() method must be called before readDoc() to ensure custom entities’ detection.

patterns

Pattern consists of any combination of literals and token annotations. The annotations are either system entities or POS tags. The annotated value is always in UPPERCASE.

A pattern can contain a number of useful token properties such as POS tags NOUN, VERB, ADJ, URL etc. For example using POS tag ADJ in a pattern will match any of the words such as blue, green, large, medium or fast.

A Shorthand Pattern allows definition of a custom entity in a single text string. For example, define a location as new [york|delhi|orleans|brunswick].

Shorthand Patterns can also include token’s annotations — POS tag and Entity. The resulting entity will contain one or more grouped tokens.

In a pattern, a single token or part-of-speech tag matches with a single token, whereas an entity may match with one or more tokens.

1 large fries or One small fries is matched using the pattern sequence CARDINAL_Entity, ADJ_{POS tag}, fries_Literal.
Pickup at 5:30PM is matched using Pickup_Literal, at_Literal, TIME_Entity.

During the learning, one can also use the mark( begin index, end index ) method to extract the entity from a matched pattern. Let’s take the example text: “My beautiful friend lives in a small cottage with her fluffy cats and playful dogs.” To extract all adjective-noun pairs, the pattern will be defined as follows:

const patterns = [
  { name: 'adjectiveNounPair', patterns: [ 'ADJ NOUN' ] }
];
nlp.learnCustomEntities( patterns );

This returns the pairs: 'beautiful friend', 'small cottage', 'fluffy cats', and _'playful dogs'. Now, if you want to extract only adjectives used for cats or dogs, you can do that using mark:

const patterns = [
  { name: 'adjectiveAnimalPair', patterns: [ '[ADJ] [cats|dogs]' ], mark: [0, 0] }
];
nlp.learnCustomEntities( patterns );

This returns: 'fluffy', 'playful'.

Mark allows marking w.r.t. the last element of the pattern. For example if a pattern matches a fluffy cat then mark: [-2, -1] will extract fluffy cat — especially useful when the match length is unknown.

Shorthand patterns

Let us say we wish to extract noun phrases from the following text:

const text = 'Each time we gather to inaugurate a President we bear witness to the enduring strength of our Constitution.';

To keep things simple, we assume that a noun phrase could be simply composed as:

const patterns = [
  {
    name: 'nounPhrase',
    patterns: [
      'NOUN',
      'PROPN',
      'DET NOUN',
      'DET PROPN',
      'ADJ NOUN',
      'ADJ PROPN',
      'DET ADJ NOUN',
      'DET ADJ PROPN'
    ]
  }
];

The above list could be transformed into the following shorthand pattern:

const patterns = [
  { name: 'nounPhrase', patterns: [ '[|DET] [|ADJ] [NOUN|PROPN]' ] }
];

In shorthand patterns:

Options are listed between opening and closing square brackets
Each option is separated by a vertical pipe character as in [NOUN|PROPN]
There can not be any space character within the square brackets
An option may be empty as in the case of the first two sets of options — [|DET] and [|ADJ]
Between successive options lists, there should be one or more spaces — [|DET] [|ADJ] [NOUN|PROPN]
All options are automatically generated by finding all possible combinations — here there are 2-options in every list resulting in 2x2x2 = 8 combinations as shown in the previous pattern.

Given below is the complete code along with the output:

const text = 'Each time we gather to inaugurate a President we bear witness to the enduring strength of our Constitution.';
const patterns = [
  { name: 'nounPhrase', patterns: [ '[|DET] [|ADJ] [NOUN|PROPN]' ] }
];
nlp.learnCustomEntities( patterns );
const doc = nlp.readDoc( text );

doc.customEntities().out();
// -> [ 'Each time',
//      'a President',
//      'witness',
//      'the enduring strength',
//      'Constitution' ]

Escaping

If you need to match with literal ‘in the month of January’ instead of entity type DATE, then it must be escaped using a caret sign. This means ^January will match with literal token value January and not as a DATE entity. Similarly, in order to match literally with a part-of-speech tag like NOUN or VERB, you need to escape it too.

Note: If you want to match literally with a caret sign, you need to escape it too — ^^ will match literally with the token having a value ‘^’.

config

There are three parameters on which the match is performed. They are matchValue, usePOS and useEntity. These are of boolean type, which enables the custom entity recognizer to look for patterns in a sequence. The default configuration is matchValue: false, useEntity: true, usePOS: true.

During detection phase, match is performed in the following sequence:

Entity types are matched first
In case of no match at step 1, token value match is attempted
If no match is found at step 2, then at last the token’s part-of-speech is matched

Parameter	Boolean	Purpose
`matchValue`	`true`	Matches with the normal of a token in the document. This is a default behavior. In case token’s value must match as is in the document, then `matchValue` should be set to `true`.
	`false` (default)	Value based match is ignored and match is performed on other two parameters if they are set to `true`.
`usePOS`	`true` (default)	Matches POS tags in the patterns with that of token’s POS tags in the document being processed.
	`false`	Ignores the POS tags in the patterns being matched and attempts matching based on the other two parameters.
`useEntity`	`true` (default)	Matches the pattern using named entities found in the Document. The annotation of system entities will be dependent on the language model being used. If named entities are not matched then, token value is matched. These entities must be included as patterns in ALL CAPS otherwise all words in the pattern shall be taken as its literal value.
	`false`	If set to `false`, it will ignore the named entities in the document and treat them as a literal value of words to be matched.

To understand how the patterns are defined, we take an example of pizza ordering intent:

I wish to order 1 small classic with corn topping and 2 large supreme with Olives, Onion topping.

The table below shows examples of patterns that can be made to extract custom entities from this text. These patterns consist of either a single token, a phrase having multiple words (i.e. tokens), their entity type, part-of-speech, or any combination of these. Each pattern is matched at the token level.

The construct allows you to define more than one value of a custom entity using a | sign.

Pattern combinations	Pattern examples
Words as literals	Pizza categories: `[Classic\|Supreme\|Extravaganza\|Margherita]` → Classic, Supreme
[UPOS] [Words]	`[ADJ] [Classic\|Extravaganza\|Margherita]` → Large Classic, Medium Extravaganza, Small Margherita
[Named Entity] [UPOS] [Words]	`[CARDINAL] [Large\|Small] [Classic\|Margherita]` → 1 Large Classic, 2 Small Margherita
[Named Entity] [Words]	`[Delivery\|Pickup] [at] [TIME]` → Delivery at 6 pm, Pickup at 5:30 pm.
[Named Entity] [UPOS]	`[CARDINAL] [ADJ] [Fries\|Coke]` → 1 Large Fries, 2 Medium Coke
[UPOS] [Words] [Named Entity]	`[ADJ] [Fries\|Coke] [CARDINAL]` → Large Fries 1, Medium Coke 2

These combinations serve as a starting point to learn the required patterns for custom entity detection. Let’s look at the following pizza intent example to understand the learning process. Custom entities can be defined as an array of objects with patterns for learning.

Example:

const text = 'I wish to order 1 small classic with corn topping and 2 large supreme with Olives, Onion topping.';
const pizza = [
  { name: 'Category', patterns: [ '[Classic|Supreme|Extravaganza|Favorite]' ] },
  { name: 'Qty', patterns: [ 'CARDINAL' ] },
  { name: 'Topping', patterns: [ '[Corn|Capsicum|Onion|Peppers|Cheese|Jalapenos|Olives]' ] },
  { name: 'Size', patterns: [ '[Small|Medium|Large|Chairman|Wedge]' ] }
 ];

nlp.learnCustomEntities( pizza, {
  matchValue: false,
  usePOS: true,
  useEntity: true
} );