How to tokenize a string?

To tokenize a string using winkNLP, read the text using readDoc. Then use the tokens method to extract a collection of tokens from the string. Follow this with the out method to get this collection as a JavaScript array. This is how you can tokenize a string:

// Load wink-nlp package  & helpers.
const winkNLP = require( 'wink-nlp' );
// Load "its" helper to extract item properties.
const its = require( 'wink-nlp/src/its.js' );
// Load english language model — light version.
const model = require( 'wink-eng-lite-model' );
// Instantiate winkNLP.
const nlp = winkNLP( model );

// Input string
const text = '#Breaking:D Can’t get over this #Oscars selfie from @TheEllenShow🤩https://pic.twitter.com/C9U5NOtGap';
// Read text
const doc = nlp.readDoc( text );
// Tokenize the string
const tokens = doc.tokens();
console.log( tokens.out() );

This returns an array of tokens:

[
  '#Breaking', ':D', 'Ca', 'n’t', 'get', 'over', 'this', '#Oscars', 'selfie','from', '@TheEllenShow',
  '🤩', 'https://pic.twitter.com/C9U5NOtGap'
]

winkNLP has a lossless tokenizer which preserves and reproduces the original text. The tokenizer intelligently handles hyphenation, contractions and abbreviations. It also detects token types like ‘word’, ‘number’, ‘punctuation’, ‘symbol’, etc.


Leave feedback