Introduction

wink-ner

Language agnostic named entity recognizer

Build Status Coverage Status Inline docs dependencies Status devDependencies Status Gitter

Recognize named entities in a sentence using wink-ner. It is a part of wink — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS.

Installation

Use npm to install:

npm install wink-ner --save

Getting Started

Simple Named Entity Recognition

// Load wink ner.
var ner = require( 'wink-ner' );
// Create your instance of wink ner & use defualt config.
var myNER = ner();
// Define training data.
var trainingData = [
  { text: 'manchester united', entityType: 'club', uid: 'manu' },
  { text: 'manchester', entityType: 'city' },
  { text: 'U K', entityType: 'country', uid: 'uk' }
];
// Learn from the training data.
myNER.learn( trainingData );
// Since recognize() requires tokens, use wink-tokenizer.
var winkTokenizer = require( 'wink-tokenizer' );
// Instantiate it and extract tokenize() api.
var tokenize = winkTokenizer().tokenize;
// Tokenize the sentence.
var tokens = tokenize( 'Manchester United is a football club based in Manchester, U. K.' )
// Simply Detect entities!
tokens = myNER.recognize( tokens );
console.log( tokens );
// -> [
//      { entityType: 'club', uid: 'manu', originalSeq: [ 'Manchester', 'United' ],
//        value: 'manchester united', tag: 'word' },
//      { value: 'is', tag: 'word' },
//      { value: 'a', tag: 'word' },
//      { value: 'football', tag: 'word' },
//      { value: 'club', tag: 'word' },
//      { value: 'based', tag: 'word' },
//      { value: 'in', tag: 'word' },
//      { entityType: 'city', value: 'Manchester', tag: 'word',
//        originalSeq: [ 'Manchester' ], uid: 'manchester' },
//      { value: ',', tag: 'punctuation' },
//      { entityType: 'country', uid: 'uk', originalSeq: [ 'U', '.', 'K' ],
//        value: 'u k', tag: 'word' },
//      { value: '.', tag: 'punctuation' }
//    ]

Integration with POS Tagging

The tokens returned from recognize() may be further passed down to tag() api of wink-pos-tagger for pos tagging.

Just in case you need to assign a specific pos tag to an entity, the same can be achieved by including a property pos in the entity definition and assigning it the desired pos tag (e.g. 'NNP'); the wink-pos-tagger will automatically do the needful. For details please refer to learn() api of wink-ner.

// Load pos tagger.
var tagger = require( 'wink-pos-tagger' );
// Instantiate it and extract tag api.
var tag = tagger().tag;
tokens = tag( tokens );
console.log( tokens );
// -> [ { entityType: 'club', uid: 'manu', originalSeq: [ 'Manchester', 'United' ],
//        value: 'manchester united', tag: 'word', normal: 'manchester united', pos: 'NNP' },
//      { value: 'is', tag: 'word', normal: 'is', pos: 'VBZ', lemma: 'be' },
//      { value: 'a', tag: 'word', normal: 'a', pos: 'DT' },
//      { value: 'football', tag: 'word', normal: 'football', pos: 'NN', lemma: 'football' },
//      { value: 'club', tag: 'word', normal: 'club', pos: 'NN', lemma: 'club' },
//      { value: 'based', tag: 'word', normal: 'based', pos: 'VBN', lemma: 'base' },
//      { value: 'in', tag: 'word', normal: 'in', pos: 'IN' },
//      { value: 'Manchester', tag: 'word', originalSeq: [ 'Manchester' ],
//        uid: 'manchester', entityType: 'city', normal: 'manchester', pos: 'NNP' },
//      { value: ',', tag: 'punctuation', normal: ',', pos: ',' },
//      { entityType: 'country', uid: 'uk', originalSeq: [ 'U', '.', 'K' ],
//        value: 'u k', tag: 'word', normal: 'u k', pos: 'NNP' },
//      { value: '.', tag: 'punctuation', normal: '.', pos: '.' }
//    ]

Documentation

Check out the named entity recognizer API documentation to learn more.

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.

Copyright & License

wink-ner is copyright 2017-18 GRAYPE Systems Private Limited.

It is licensed under the under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 of the License.

Introduction

wink-ner

Language agnostic named entity recognizer

Build Status Coverage Status Inline docs dependencies Status devDependencies Status Gitter

Recognize named entities in a sentence using wink-ner. It is a part of wink — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS.

Installation

Use npm to install:

npm install wink-ner --save

Getting Started

Simple Named Entity Recognition

// Load wink ner.
var ner = require( 'wink-ner' );
// Create your instance of wink ner & use defualt config.
var myNER = ner();
// Define training data.
var trainingData = [
  { text: 'manchester united', entityType: 'club', uid: 'manu' },
  { text: 'manchester', entityType: 'city' },
  { text: 'U K', entityType: 'country', uid: 'uk' }
];
// Learn from the training data.
myNER.learn( trainingData );
// Since recognize() requires tokens, use wink-tokenizer.
var winkTokenizer = require( 'wink-tokenizer' );
// Instantiate it and extract tokenize() api.
var tokenize = winkTokenizer().tokenize;
// Tokenize the sentence.
var tokens = tokenize( 'Manchester United is a football club based in Manchester, U. K.' )
// Simply Detect entities!
tokens = myNER.recognize( tokens );
console.log( tokens );
// -> [
//      { entityType: 'club', uid: 'manu', originalSeq: [ 'Manchester', 'United' ],
//        value: 'manchester united', tag: 'word' },
//      { value: 'is', tag: 'word' },
//      { value: 'a', tag: 'word' },
//      { value: 'football', tag: 'word' },
//      { value: 'club', tag: 'word' },
//      { value: 'based', tag: 'word' },
//      { value: 'in', tag: 'word' },
//      { entityType: 'city', value: 'Manchester', tag: 'word',
//        originalSeq: [ 'Manchester' ], uid: 'manchester' },
//      { value: ',', tag: 'punctuation' },
//      { entityType: 'country', uid: 'uk', originalSeq: [ 'U', '.', 'K' ],
//        value: 'u k', tag: 'word' },
//      { value: '.', tag: 'punctuation' }
//    ]

Integration with POS Tagging

The tokens returned from recognize() may be further passed down to tag() api of wink-pos-tagger for pos tagging.

Just in case you need to assign a specific pos tag to an entity, the same can be achieved by including a property pos in the entity definition and assigning it the desired pos tag (e.g. 'NNP'); the wink-pos-tagger will automatically do the needful. For details please refer to learn() api of wink-ner.

// Load pos tagger.
var tagger = require( 'wink-pos-tagger' );
// Instantiate it and extract tag api.
var tag = tagger().tag;
tokens = tag( tokens );
console.log( tokens );
// -> [ { entityType: 'club', uid: 'manu', originalSeq: [ 'Manchester', 'United' ],
//        value: 'manchester united', tag: 'word', normal: 'manchester united', pos: 'NNP' },
//      { value: 'is', tag: 'word', normal: 'is', pos: 'VBZ', lemma: 'be' },
//      { value: 'a', tag: 'word', normal: 'a', pos: 'DT' },
//      { value: 'football', tag: 'word', normal: 'football', pos: 'NN', lemma: 'football' },
//      { value: 'club', tag: 'word', normal: 'club', pos: 'NN', lemma: 'club' },
//      { value: 'based', tag: 'word', normal: 'based', pos: 'VBN', lemma: 'base' },
//      { value: 'in', tag: 'word', normal: 'in', pos: 'IN' },
//      { value: 'Manchester', tag: 'word', originalSeq: [ 'Manchester' ],
//        uid: 'manchester', entityType: 'city', normal: 'manchester', pos: 'NNP' },
//      { value: ',', tag: 'punctuation', normal: ',', pos: ',' },
//      { entityType: 'country', uid: 'uk', originalSeq: [ 'U', '.', 'K' ],
//        value: 'u k', tag: 'word', normal: 'u k', pos: 'NNP' },
//      { value: '.', tag: 'punctuation', normal: '.', pos: '.' }
//    ]

Documentation

Check out the named entity recognizer API documentation to learn more.

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.

Copyright & License

wink-ner is copyright 2017-18 GRAYPE Systems Private Limited.

It is licensed under the under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 of the License.

Creating an Instance

Creating an Instance

ner

Creates an instance of wink-ner.

ner(): methods
Returns
methods: object conatining set of API methods for named entity training, recognition, etc.
Example
// Load wink ner.
var ner = require( 'wink-ner' );
// Create your instance of wink ner.
var myNER = ner();

API Methods

API Methods

defineConfig

Defines the criteria to ignore one or more tokens during entity detection. The criteria is specified in terms of array of specific tags and/or values to ignore; this means if any of the listed tag or value is found in a token, it is ignored and it’s value is not considered during entity recognition.

For example by including punctuation in the array of tags to ignore, tokens containing punctuations like - or . will be skipped. This will result in recognition of kg and k.g. as kg (kilogram symbol) or Guinea-Bissau and Guinea Bissau as Guinea-Bissau (a country in West Africa).

defineConfig(config: object): object
Parameters
config (object) — defines the values and/or tags to be ignore during entity detection. Note if the match occurs in any one of the array, the token is ignored.

An empty config object is equivalent to setting default configuration.

The table below details the properties of config object:

Name Description
config.valuesToIgnore Array<string> (default undefined) contains values to be ignored.
config.tagsToIgnore Array<string> (default ['punctuation']) contains tags to be ignored. Duplicate and invaid tags, if any, are ignored. Note: number and word tags can never be ignored.
config.ignoreDiacritics Array<string> (default true) a true ensures that diacritic marks are ignored, whereas false will ensure that they are not ignored.
Returns
object: a copy of configuration defined.
Throws
  • error: if valuesToIgnore is not an array of strings.
  • error: if tagsToIgnore is not an array of strings.
Example
// Do not ignore anything!
myNER.defineConfig( { tagsToIgnore: [], ignoreDiacritics: false } );
// -> { tagsToIgnore: [], valuesToIgnore: [], ignoreDiacritics: false }

// Ignore only '-' and '.'
myNER.defineConfig( {
  tagsToIgnore: [],
  valuesToIgnore: [ '-', '.' ],
  ignoreDiacritics: false
} );
// -> {
//      tagsToIgnore: [],
//      valuesToIgnore: [ '-', '.' ],
//      ignoreDiacritics: false
//    }

learn

Learns the entities that must be detected via recognize()/predict() API calls in a sentence that has been already tokenized either using wink-tokenizer or follows it's token format.

It can be used to learn or update learnings incrementally; but it can not be used to unlearn or delete one or more entities.

If duplicate entity definitions are enountered then all the entries except the last one are ignored.

Acronyms must be added with space between each character; for example USA should be added as 'u s a' — this ensure correct detection of U S A or U. S. A. or U.S.A. as USA [Refer to the example below].

learn(entities: Array<object>): number
Parameters
entities (Array<object>) — where each element defines an entity via two mandatory properties viz. text and entityType as described later. Note if an element is not an object or does not contain the mandatory properties, it is ignored.

In addition to these two properties, you may optionally define two more properties viz. uid and value, as described in the table below.

Note: Apart from the above mentioned properties, you may also define additional properties . Such properties, along with their values, will be copied to the output token as-is for consumption by any down stream code in the NLP pipe. An example use-case is pos tagging. You can define pos property in an entity defition as { text: 'manchester united', entityType: 'club', pos: 'NNP' }. The wink-pos-tagger will automatically use the pos property (if available) to ensure correct tagging in your context by overriding its algorithm.

Name Description
entities[].text string that must be detected as entity and may consist of more than one word; for example, India or United Kindom.
entities[].entityType string type of the entity; for example country
entities[].uid string (default undefined) unique id for the entity; example usecase of uid is using it to access more properties of the entity from a database. If it is undefined then it is automatically generated by joining the key words of the detected entity by underscore (_). For example, 'india' or 'united_kingdom'.
entities[].value string (default undefined) that is assigned to the value property of the token; if undefined then it is equal to the value of the token in case of uni-word entities; for multi-word entities, it is generated automatically by joining the key words of the entries by space character. For example, 'india' or 'united kingdom'.
Returns
number: of actual entities learned.
Example
var trainingData = [
  { text: 'manchester united', entityType: 'club', uid: 'manu' },
  { text: 'manchester', entityType: 'city' },
  { text: 'U K', entityType: 'country', uid: 'uk' }
];
myNER.learn( trainingData );
// -> 3

recognize

Recognizes entities in the input tokens. Any token(s), which is recognized as an entity, will automatically receive the properties that have been defined for the detected entity using learn(). If a set of tokens together are recognized as a single entity, then they are merged in to a single token; the merged tokens value property becomes the concatenation of all the values from merged tokens, separated by space.

recognize(tokens: Array<object>): Array<object>
Parameters
tokens (Array<object>) — tokenized either using wink-tokenizer or follow it's standards.
Returns
Array<object>: of updated tokens with entities tagged.
Example
// Use wink tokenizer.
var winkTokenizer = require( 'wink-tokenizer' );
// Instantiate it and use tokenize() api.
var tokenize = winkTokenizer().tokenize;
var tokens = tokenize( 'Manchester United is a professional football club based in Manchester, U. K.' )
// Detect entities.
myNER.recognize( tokens );
// -> [
//      { entityType: 'club', uid: 'manu', originalSeq: [ 'Manchester', 'United' ], value: 'manchester united', tag: 'word' },
//      { value: 'is', tag: 'word' },
//      { value: 'a', tag: 'word' },
//      { value: 'professional', tag: 'word' },
//      { value: 'football', tag: 'word' },
//      { value: 'club', tag: 'word' },
//      { value: 'based', tag: 'word' },
//      { value: 'in', tag: 'word' },
//      { value: 'Manchester', tag: 'word', originalSeq: [ 'Manchester' ], uid: 'manchester', entityType: 'city' },
//      { value: ',', tag: 'punctuation' },
//      { entityType: 'country', uid: 'uk', originalSeq: [ 'U', '.', 'K' ], value: 'u k', tag: 'word' },
//      { value: '.', tag: 'punctuation' }
//    ]

reset

Resets the named entity recognizer by re-initializing all the learnings and by setting the configuration to default.

reset(): boolean
Returns
boolean: always true.
Example
myNER.reset( );
// -> true

exportJSON

Exports the JSON of the learnings generated by learn(), which may be saved in a file that may be used later for NER purpose.

exportJSON(): json
Returns
json: of the learnings.
Example
var learnings = myNER.exportJSON();

importJSON

Imports the ner learnings from an already exported ner learnings via the exportJSON().

importJSON(json: json): boolean
Parameters
json (json) — containg an earlier exported learnings in JSON format.
Returns
boolean: always true .
Throws
  • error: if invalid JSON is encountered.
Example
var myNER = ner();
// Assuming that `json` has valid learnings.
myNER.importJSON( json );