.readDoc() method, when used with the default instance of winkNLP, splits the text into tokens, entities, and sentences. It also determines a range of their properties. These are accessible via the
.out() method on the basis of the input parameter —
its.property. Some examples of properties are
const text = 'cats are cool'; const doc = nlp.readDoc( text ); console.log( doc.tokens().out( its.value) ); // -> ["cats", "are", "cool"] console.log( doc.tokens().out( its.stopWordFlag ) ); // -> [false, true, false] console.log( doc.tokens().out( its.pos ) ); // -> ["NOUN", "AUX", "VERB"] console.log( doc.tokens().out( its.lemma ) ); // -> ["cat", "be", "cool"]
.readDoc() API processes the input text in many stages. All the stages together form a processing pipeline also referred as pipe. The first stage is tokenization, which is mandatory. The later stages such as sentence boundary detection (SBD) or part-of-speech tagging (POS) are optional. The optional stages are user configurable. The following figure and table illustrates the pipe:
| ||Splits text into tokens.|
| ||Sentence boundary detection — determines **span** of each sentence in terms of start & end token indexes.|
| ||Negation handling — sets the |
| ||Computes sentiment score of each sentence and the entire document.|
| ||Named entity recognition — detects all named entities and also determines their type & span.|
| ||Performs part-of-speech (pos) tagging.|
| ||Custom entity recognition — detects all custom entities and their type & span. The detection is carried out on the basis of training carried out using |
The default instance of
winkNLP is created using only the
model as input parameter:
// Load wink-nlp package. const winkNLP = require( 'wink-nlp' ); // Load english language model — light version. const model = require( 'wink-eng-lite-model' ); // Instantiate winkNLP — default — will run all the above mentioned // stages. const nlp = winkNLP( model );
It also accepts an additional parameter —
pipe that controls the processing pipeline. This parameter is an array that contains the names of the stages that you wish to run. For example, the following will only run sentence boundary detection and pos tagging after tokenization:
const nlp = winkNLP( model, [ 'sbd', 'pos' ] );
- While the sequence of stages in a pipe is not important as `winkNLP` handles it automatically, it is recommended to always provide names in the correct logical sequence.
- Without `sbd`, the entire text is treated as a single sentence.
- `sentiment` is dependent on `negation`; without negation, the accuracy of sentiment score may drop.
- Without `pos`, `its.lemma` accuracy drops drastically.
- Without `ner`, the count of named entities will always be zero i.e. `doc.entities().length()` will return a zero.
- Without `cer`, the count of custom entities will always be zero i.e. `doc.customEntities().length()` will return a zero.