Custom Entities
Custom Entities have two methods — learnCustomEntities()
to define application specific custom entities, and customEntities()
to access the custom entities detected during .readDoc()
execution.
customEntities()
method offers the same API methods that entities()
does. With learnCustomEntities()
you can define your own custom entities in terms of patterns. These patterns consist of either a single token, a phrase having multiple words (i.e. multiple tokens), their entity type, part-of-speech, or any combination of these. In a pattern, a single token or part-of-speech matches with a single token whereas an entity may match with one or more tokens.
learnCustomEntities()
method must be called before readDoc()
. The following example illustrates how a single token and a phrase is matched:
const text = 'Manchester United is a football club based in Manchester.';
const patterns = [
{ name: 'club', patterns: [ 'manchester united' ] },
{ name: 'city', patterns: [ 'manchester' ] }
];
nlp.learnCustomEntities(patterns);
const doc = nlp.readDoc(text);
doc.customEntities().out(its.detail);
// -> [ { value: 'Manchester United', type: 'club' },
// { value: 'Manchester', type: 'city' } ]
Here is another example to extract adjective-noun pairs from a text:
const text = 'The quick brown fox jumps over the lazy dog.';
const patterns = [
{ name: 'adjectiveNounPair', patterns: [ 'ADJ NOUN' ] }
];
nlp.learnCustomEntities(patterns);
const doc = nlp.readDoc(text);
doc.customEntities().out();
// -> [ 'brown fox', 'lazy dog' ]
Shorthand patterns
Let us say we wish to extract noun phrases from the following text:
const text = `Each time we gather to inaugurate a President we bear witness to the enduring strength of our Constitution.`;
To keep things simple, we assume that a noun phrase could be simply composed as:
const patterns = [
{
name: 'nounPhrase',
patterns: [
'NOUN',
'PROPN',
'DET NOUN',
'DET PROPN',
'ADJ NOUN',
'ADJ PROPN',
'DET ADJ NOUN',
'DET ADJ PROPN'
]
}
];
The above list could be transformed into the following shorthand pattern:
const patterns = [
{
name: 'nounPhrase',
patterns: [ '[|DET] [|ADJ] [NOUN|PROPN]' ]
}
];
In shorthand patterns:
- Options are listed between opening and closing square brackets
- Each option is separated by a vertical pipe character as in
[NOUN|PROPN]
` - There can not be any space character within the square brackets
- An option may be empty as in the case of the first two sets of options —
[|DET]
and[|ADJ]
- Between successive options lists, there should be one or more spaces —
[|DET] [|ADJ] [NOUN|PROPN]
- All options are automatically generated by finding all possible combinations — here there are 2-options in every list resulting in 2x2x2 = 8 combinations as shown in the previous pattern.
Given below is the complete code along with the output:
const text = `Each time we gather to inaugurate
a President we bear witness to the enduring strength
of our Constitution.`;
const patterns = [
{
name: 'nounPhrase',
patterns: [ '[|DET] [|ADJ] [NOUN|PROPN]' ]
}
];
nlp.learnCustomEntities(patterns);
const doc = nlp.readDoc(text);
doc.customEntities().out();
// -> [ 'Each time',
// 'a President',
// 'witness',
// 'the enduring strength',
// 'Constitution' ]
Escaping
In order to match literally with entity or part-of-speech types such as DATE (entity) or NOUN (part-of-speech), we must prefix such literals with a caret sign. For example, the pattern 'DATE'
will match with the sequence of tokens representing dates (e.g. August 29, 1961) but the pattern '^DATE'
will match with the token having a value ‘DATE’. Think of the caret sign as JavaScript back-slash. Similarly in order to match literally with a caret sign, you need to escape it too — '^^'
will match literally with the token having a value ‘^’.
Match sequence
During detection phase, match is performed in the following sequence:
- Entity types are matched first;
- In case of no match at setp #1, token value match is attempted;
- If no match is found at step #2, then at last the token’s part-of-speech is matched.