NER

NER class

Methods

defineConfig

defineConfig(config) → {object}

Defines the criteria to ignore one or more tokens during entity detection. The criteria is specified in terms of array of specific tags and/or values to ignore; this means if any of the listed tag or value is found in a token, it is ignored and it’s value is not considered during entity recognition.

For example by including punctuation in the array of tags to ignore, tokens containing punctuations like - or . will be skipped. This will result in recognition of kg and k.g. as kg (kilogram symbol) or Guinea-Bissau and Guinea Bissau as Guinea-Bissau (a country in West Africa).

Example
// Do not ignore anything!
myNER.defineConfig( { tagsToIgnore: [], ignoreDiacritics: false } );
// -> { tagsToIgnore: [], valuesToIgnore: [], ignoreDiacritics: false }

// Ignore only '-' and '.'
myNER.defineConfig( {
  tagsToIgnore: [],
  valuesToIgnore: [ '-', '.' ],
  ignoreDiacritics: false
} );
// -> {
//      tagsToIgnore: [],
//      valuesToIgnore: [ '-', '.' ],
//      ignoreDiacritics: false
//    }
Parameters
Name Type Description
config object

— defines the values and/or tags to be ignore during entity detection. Note if the match occurs in any one of the array, the token is ignored.

An empty config object is equivalent to setting default configuration.

The table below details the properties of config object:

Properties
Name Type Attributes Default Description
valuesToIgnore Array.<string> <optional>

contains values to be ignored.

tagsToIgnore Array.<string> <optional>
[ 'punctuation' ]

contains tags to be ignored. Duplicate and invaid tags, if any, are ignored. Note: number and word tags can never be ignored.

ignoreDiacritics Array.<string> <optional>
true

a true ensures that diacritic marks are ignored, whereas false will ensure that they are not ignored.

Throws
  • if valuesToIgnore is not an array of strings.

    Type
    error
  • if tagsToIgnore is not an array of strings.

    Type
    error
Returns

a copy of configuration defined.

Type
object

exportJSON

exportJSON() → {json}

Exports the JSON of the learnings generated by learn(), which may be saved in a file that may be used later for NER purpose.

Example
var learnings = myNER.exportJSON();
Returns

of the learnings.

Type
json

importJSON

importJSON(json) → {boolean}

Imports the ner learnings from an already exported ner learnings via the exportJSON().

Example
var myNER = ner();
// Assuming that `json` has valid learnings.
myNER.importJSON( json );
Parameters
Name Type Description
json json

— containg an earlier exported learnings in JSON format.

Throws

if invalid JSON is encountered.

Type
error
Returns

always true.

Type
boolean

learn

learn(entities) → {number}

Learns the entities that must be detected via recognize()/predict() API calls in a sentence that has been already tokenized either using wink-tokenizer or follows it's token format.

It can be used to learn or update learnings incrementally; but it can not be used to unlearn or delete one or more entities.

If duplicate entity definitions are enountered then all the entries except the last one are ignored.

Acronyms must be added with space between each character; for example USA should be added as 'u s a' — this ensure correct detection of U S A or U. S. A. or U.S.A. as USA [Refer to the example below].

Example
var trainingData = [
  { text: 'manchester united', entityType: 'club', uid: 'manu' },
  { text: 'manchester', entityType: 'city' },
  { text: 'U K', entityType: 'country', uid: 'uk' }
];
myNER.learn( trainingData );
// -> 3
Parameters
Name Type Description
entities Array.<object>

— where each element defines an entity via two mandatory properties viz. text and entityType as described later. Note if an element is not an object or does not contain the mandatory properties, it is ignored.

In addition to these two properties, you may optionally define two more properties viz. uid and value, as described in the table below.

Note: Apart from the above mentioned properties, you may also define additional properties . Such properties, along with their values, will be copied to the output token as-is for consumption by any down stream code in the NLP pipe. An example use-case is pos tagging. You can define pos property in an entity defition as { text: 'manchester united', entityType: 'club', pos: 'NNP' }. The wink-pos-tagger will automatically use the pos property (if available) to ensure correct tagging in your context by overriding its algorithm.

Properties
Name Type Attributes Description
text string

that must be detected as entity and may consist of more than one word; for example, India or United Kindom.

entityType string

type of the entity; for example country

uid string <optional>

unique id for the entity; example usecase of uid is using it to access more properties of the entity from a database. If it is undefined then it is automatically generated by joining the key words of the detected entity by underscore (_). For example, 'india' or 'united_kingdom'.

value string <optional>

that is assigned to the value property of the token; if undefined then it is equal to the value of the token in case of uni-word entities; for multi-word entities, it is generated automatically by joining the key words of the entries by space character. For example, 'india' or 'united kingdom'.

Returns

of actual entities learned.

Type
number

recognize

recognize(tokens) → {Array.<object>}

Recognizes entities in the input tokens. Any token(s), which is recognized as an entity, will automatically receive the properties that have been defined for the detected entity using learn(). If a set of tokens together are recognized as a single entity, then they are merged in to a single token; the merged tokens value property becomes the concatenation of all the values from merged tokens, separated by space.

Example
// Use wink tokenizer.
var winkTokenizer = require( 'wink-tokenizer' );
// Instantiate it and use tokenize() api.
var tokenize = winkTokenizer().tokenize;
var tokens = tokenize( 'Manchester United is a professional football club based in Manchester, U. K.' )
// Detect entities.
myNER.recognize( tokens );
// -> [
//      { entityType: 'club', uid: 'manu', originalSeq: [ 'Manchester', 'United' ], value: 'manchester united', tag: 'word' },
//      { value: 'is', tag: 'word' },
//      { value: 'a', tag: 'word' },
//      { value: 'professional', tag: 'word' },
//      { value: 'football', tag: 'word' },
//      { value: 'club', tag: 'word' },
//      { value: 'based', tag: 'word' },
//      { value: 'in', tag: 'word' },
//      { value: 'Manchester', tag: 'word', originalSeq: [ 'Manchester' ], uid: 'manchester', entityType: 'city' },
//      { value: ',', tag: 'punctuation' },
//      { entityType: 'country', uid: 'uk', originalSeq: [ 'U', '.', 'K' ], value: 'u k', tag: 'word' },
//      { value: '.', tag: 'punctuation' }
//    ]
Parameters
Name Type Description
tokens Array.<object>

— tokenized either using wink-tokenizer or follow it's standards.

Returns

of updated tokens with entities tagged.

Type
Array.<object>

reset

reset() → {boolean}

Resets the named entity recognizer by re-initializing all the learnings and by setting the configuration to default.

Example
myNER.reset( );
// -> true
Returns

always true.

Type
boolean