Processing Pipeline

The .readDoc() method, when used with the default instance of winkNLP, splits the text into tokens, entities, and sentences. It also determines a range of their properties. These are accessible via the .out() method on the basis of the input parameter — its.property. Some examples of properties are value, stopWordFlag, pos, and lemma:

const text = 'cats are cool';
const doc = nlp.readDoc( text );
console.log( doc.tokens().out( its.value) );
// -> ["cats", "are", "cool"]
console.log( doc.tokens().out( its.stopWordFlag ) );
// -> [false, true, false]
console.log( doc.tokens().out( its.pos ) );
// -> ["NOUN", "AUX", "VERB"]
console.log( doc.tokens().out( its.lemma ) );
// -> ["cat", "be", "cool"]

.readDoc() API processes the input text in many stages. All the stages together form a processing pipeline also referred as pipe. The first stage is tokenization, which is mandatory. The later stages such as sentence boundary detection (SBD) or part-of-speech tagging (POS) are optional. The optional stages are user configurable. The following figure and table illustrates the pipe:

Diagram showing the processing pipeline of winkNLP

Common its.properties that become available at each stage are also highlighted. For more details please refer to sections on item and its properties and its & as helpers.

Stage Description
tokenization Splits text into tokens.
sbd Sentence boundary detection — determines **span** of each sentence in terms of start & end token indexes.
negation Negation handling — sets the negationFlag for every token whose meaning is negated due a "not" word.
sentiment Computes sentiment score of each sentence and the entire document.
ner Named entity recognition — detects all named entities and also determines their type & span.
pos Performs part-of-speech (pos) tagging.
cer Custom entity recognition — detects all custom entities and their type & span. The detection is carried out on the basis of training carried out using learnCustomEntities() method.

The default instance of winkNLP is created using only the model as input parameter:

// Load wink-nlp package.
const winkNLP = require( 'wink-nlp' );
// Load english language model — light version.
const model = require( 'wink-eng-lite-model' );
// Instantiate winkNLP — default — will run all the above mentioned
// stages.
const nlp = winkNLP( model );

It also accepts an additional parameter — pipe that controls the processing pipeline. This parameter is an array that contains the names of the stages that you wish to run. For example, the following will only run sentence boundary detection and pos tagging after tokenization:

const nlp = winkNLP( model, [ 'sbd', 'pos' ] );
The interplay between stages and properties is outlined below:
  1. While the sequence of stages in a pipe is not important as `winkNLP` handles it automatically, it is recommended to always provide names in the correct logical sequence.
  2. Without `sbd`, the entire text is treated as a single sentence.
  3. `sentiment` is dependent on `negation`; without negation, the accuracy of sentiment score may drop.
  4. Without `pos`, `its.lemma` accuracy drops drastically.
  5. Without `ner`, the count of named entities will always be zero i.e. `doc.entities().length()` will return a zero.
  6. Without `cer`, the count of custom entities will always be zero i.e. `doc.customEntities().length()` will return a zero.
Running only required stages can give a performance advantage. For example only "tokenization" runs at the speed of 2.1 million tokens per second. And "tokenization+sbd" delivers speed of about 1.5 million tokens per second. Contrast it with the speed of 0.5 million tokens per second, when all the stages are active.

Leave feedback