Custom Entities

Custom Entities have two methods — learnCustomEntities() to define application specific custom entities, and customEntities() to access the custom entities detected during .readDoc() execution.

The customEntities() method offers the same API methods that entities() does.

With learnCustomEntities() you can define your own custom entities in terms of patterns. These patterns consist of either a single token, a phrase having multiple words (i.e. multiple tokens), their entity type, part-of-speech, or any combination of these. In a pattern, a single token or part-of-speech matches with a single token whereas an entity may match with one or more tokens.

The learnCustomEntities() method must be called before readDoc().

The following example illustrates how a single token and a phrase is matched:

const text = 'Manchester United is a football club based in Manchester.';
const patterns = [
  { name: 'club', patterns: [ 'manchester united' ] },
  { name: 'city', patterns: [ 'manchester' ] }
];
nlp.learnCustomEntities(patterns);
const doc = nlp.readDoc(text);
doc.customEntities().out(its.detail);
// -> [ { value: 'Manchester United', type: 'club' },
//      { value: 'Manchester', type: 'city' } ]

It performs a greedy match and in case of multiple matches, the longest one is given preference.

Here is another example to extract adjective-noun pairs from a text:

const text = 'The quick brown fox jumps over the lazy dog.';
const patterns = [
  { name: 'adjectiveNounPair', patterns: [ 'ADJ NOUN' ] }
];
nlp.learnCustomEntities(patterns);
const doc = nlp.readDoc(text);
doc.customEntities().out();
// -> [ 'brown fox', 'lazy dog' ]

Note the part-of-speech tags in the patterns follow the Universal POS tags standard and therefore are always in UPPERCASE. Similarly, entity names are also in UPPERCASE. Some examples are DATE, DURATION, or EMAIL. For a complete list of entities and pos tags refer to the documentation of the desired language model.

Shorthand patterns

Let us say we wish to extract noun phrases from the following text:

const text = `Each time we gather to inaugurate a President we bear witness to the enduring strength of our Constitution.`;

To keep things simple, we assume that a noun phrase could be simply composed as:

const patterns = [
  {
    name: 'nounPhrase',
    patterns: [
      'NOUN',
      'PROPN',
      'DET NOUN',
      'DET PROPN',
      'ADJ NOUN',
      'ADJ PROPN',
      'DET ADJ NOUN',
      'DET ADJ PROPN'
    ]
  }
];

The above list could be transformed into the following shorthand pattern:

const patterns = [
  {
    name: 'nounPhrase',
    patterns: [ '[|DET] [|ADJ] [NOUN|PROPN]' ]
  }
];

In shorthand patterns:

Options are listed between opening and closing square brackets
Each option is separated by a vertical pipe character as in [NOUN|PROPN]`
There can not be any space character within the square brackets
An option may be empty as in the case of the first two sets of options — [|DET] and [|ADJ]
Between successive options lists, there should be one or more spaces — [|DET] [|ADJ] [NOUN|PROPN]
All options are automatically generated by finding all possible combinations — here there are 2-options in every list resulting in 2x2x2 = 8 combinations as shown in the previous pattern.

Given below is the complete code along with the output:

const text = `Each time we gather to inaugurate
a President we bear witness to the enduring strength
of our Constitution.`;
const patterns = [
  {
    name: 'nounPhrase',
    patterns: [ '[|DET] [|ADJ] [NOUN|PROPN]' ]
  }
];
nlp.learnCustomEntities(patterns);
const doc = nlp.readDoc(text);
doc.customEntities().out();
// -> [ 'Each time',
//      'a President',
//      'witness',
//      'the enduring strength',
//      'Constitution' ]

Escaping

In order to match literally with entity or part-of-speech types such as DATE (entity) or NOUN (part-of-speech), we must prefix such literals with a caret sign. For example, the pattern 'DATE' will match with the sequence of tokens representing dates (e.g. August 29, 1961) but the pattern '^DATE' will match with the token having a value ‘DATE’. Think of the caret sign as JavaScript back-slash. Similarly in order to match literally with a caret sign, you need to escape it too — '^^' will match literally with the token having a value ‘^’.

Match sequence

During detection phase, match is performed in the following sequence:

Entity types are matched first;
In case of no match at setp #1, token value match is attempted;
If no match is found at step #2, then at last the token’s part-of-speech is matched.

Custom Entities can be used as a fast pattern search in a corpus.

Previous Visualizing using markup Next WinkNLP in browsers