wink-ner - Wink JS

Methods

defineConfig

defineConfig(config) → {object}

Defines the criteria to ignore one or more tokens during entity detection. The criteria is specified in terms of array of specific tags and/or values to ignore; this means if any of the listed tag or value is found in a token, it is ignored and it’s value is not considered during entity recognition.

For example by including punctuation in the array of tags to ignore, tokens containing punctuations like - or . will be skipped. This will result in recognition of kg and k.g. as kg (kilogram symbol) or Guinea-Bissau and Guinea Bissau as Guinea-Bissau (a country in West Africa).

Example

// Do not ignore anything!
myNER.defineConfig( { tagsToIgnore: [], ignoreDiacritics: false } );
// -> { tagsToIgnore: [], valuesToIgnore: [], ignoreDiacritics: false }

// Ignore only '-' and '.'
myNER.defineConfig( {
  tagsToIgnore: [],
  valuesToIgnore: [ '-', '.' ],
  ignoreDiacritics: false
} );
// -> {
//      tagsToIgnore: [],
//      valuesToIgnore: [ '-', '.' ],
//      ignoreDiacritics: false
//    }

Parameters

Name Type Description

config

object

— defines the values and/or tags to be ignore during entity detection. Note if the match occurs in any one of the array, the token is ignored.

An empty config object is equivalent to setting default configuration.

The table below details the properties of config object:

Properties

Name	Type	Attributes	Default	Description
valuesToIgnore	Array.<string>	<optional>		contains values to be ignored.
tagsToIgnore	Array.<string>	<optional>	[ 'punctuation' ]	contains tags to be ignored. Duplicate and invaid tags, if any, are ignored. Note: `number` and `word` tags can never be ignored.
ignoreDiacritics	Array.<string>	<optional>	true	a `true` ensures that diacritic marks are ignored, whereas `false` will ensure that they are not ignored.

Throws

if valuesToIgnore is not an array of strings.

Type

error
if tagsToIgnore is not an array of strings.

Type

error

Returns

a copy of configuration defined.

Type: object

exportJSON

exportJSON() → {json}

Exports the JSON of the learnings generated by learn(), which may be saved in a file that may be used later for NER purpose.

Example

var learnings = myNER.exportJSON();

Returns

of the learnings.

Type: json

importJSON

importJSON(json) → {boolean}

Imports the ner learnings from an already exported ner learnings via the exportJSON().

Example

var myNER = ner();
// Assuming that `json` has valid learnings.
myNER.importJSON( json );

Parameters

Name	Type	Description
json	json	— containg an earlier exported learnings in JSON format.

Throws

if invalid JSON is encountered.
Type error

Returns

always true.

Type: boolean

learn

learn(entities) → {number}

Learns the entities that must be detected via recognize()/predict() API calls in a sentence that has been already tokenized either using wink-tokenizer or follows it's token format.

It can be used to learn or update learnings incrementally; but it can not be used to unlearn or delete one or more entities.

If duplicate entity definitions are enountered then all the entries except the last one are ignored.

Acronyms must be added with space between each character; for example USA should be added as 'u s a' — this ensure correct detection of U S A or U. S. A. or U.S.A. as USA [Refer to the example below].

Example

var trainingData = [
  { text: 'manchester united', entityType: 'club', uid: 'manu' },
  { text: 'manchester', entityType: 'city' },
  { text: 'U K', entityType: 'country', uid: 'uk' }
];
myNER.learn( trainingData );
// -> 3

Parameters

Name Type Description

entities

Array.<object>

— where each element defines an entity via two mandatory properties viz. text and entityType as described later. Note if an element is not an object or does not contain the mandatory properties, it is ignored.

In addition to these two properties, you may optionally define two more properties viz. uid and value, as described in the table below.

Note: Apart from the above mentioned properties, you may also define additional properties . Such properties, along with their values, will be copied to the output token as-is for consumption by any down stream code in the NLP pipe. An example use-case is pos tagging. You can define pos property in an entity defition as { text: 'manchester united', entityType: 'club', pos: 'NNP' }. The wink-pos-tagger will automatically use the pos property (if available) to ensure correct tagging in your context by overriding its algorithm.

Properties

Name	Type	Attributes	Description
text	string		that must be detected as entity and may consist of more than one word; for example, `India` or `United Kindom.`
entityType	string		type of the entity; for example `country`
uid	string	<optional>	unique id for the entity; example usecase of `uid` is using it to access more properties of the entity from a database. If it is `undefined` then it is automatically generated by joining the key words of the detected entity by underscore (_). For example, `'india'` or `'united_kingdom'.`
value	string	<optional>	that is assigned to the value property of the token; if `undefined` then it is equal to the value of the token in case of uni-word entities; for multi-word entities, it is generated automatically by joining the key words of the entries by space character. For example, `'india'` or `'united kingdom'.`

Returns

of actual entities learned.

Type: number

recognize

recognize(tokens) → {Array.<object>}

Recognizes entities in the input tokens. Any token(s), which is recognized as an entity, will automatically receive the properties that have been defined for the detected entity using learn(). If a set of tokens together are recognized as a single entity, then they are merged in to a single token; the merged tokens value property becomes the concatenation of all the values from merged tokens, separated by space.

Example

// Use wink tokenizer.
var winkTokenizer = require( 'wink-tokenizer' );
// Instantiate it and use tokenize() api.
var tokenize = winkTokenizer().tokenize;
var tokens = tokenize( 'Manchester United is a professional football club based in Manchester, U. K.' )
// Detect entities.
myNER.recognize( tokens );
// -> [
//      { entityType: 'club', uid: 'manu', originalSeq: [ 'Manchester', 'United' ], value: 'manchester united', tag: 'word' },
//      { value: 'is', tag: 'word' },
//      { value: 'a', tag: 'word' },
//      { value: 'professional', tag: 'word' },
//      { value: 'football', tag: 'word' },
//      { value: 'club', tag: 'word' },
//      { value: 'based', tag: 'word' },
//      { value: 'in', tag: 'word' },
//      { value: 'Manchester', tag: 'word', originalSeq: [ 'Manchester' ], uid: 'manchester', entityType: 'city' },
//      { value: ',', tag: 'punctuation' },
//      { entityType: 'country', uid: 'uk', originalSeq: [ 'U', '.', 'K' ], value: 'u k', tag: 'word' },
//      { value: '.', tag: 'punctuation' }
//    ]

Parameters

Name	Type	Description
tokens	Array.<object>	— tokenized either using wink-tokenizer or follow it's standards.

Returns

of updated tokens with entities tagged.

Type: Array.<object>

reset

reset() → {boolean}

Resets the named entity recognizer by re-initializing all the learnings and by setting the configuration to default.

Example

myNER.reset( );
// -> true

Returns

always true.

Type: boolean