Introduction

wink-tokenizer

Multilingual tokenizer that automatically tags each token with its type

Build Status Coverage Status Inline docs devDependencies Status

Tokenize sentences in Latin and Devanagari scripts using wink-tokenizer. It is a part of wink — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS.

Some of it's top feature are:

  1. Support for English, French, German, Hindi, Sanskrit, Marathi and many more.

  2. Intelligent tokenization of sentence containing words in more than one language.

  3. Automatic detection & tagging of token's feature;

    • These include word, punctuation, email, mention, hashtag, emoticon, and emoji etc.

Installation

Use npm to install:

npm install wink-tokenizer --save

Getting Started

// Load tokenizer.
var tokenizer = require( 'wink-tokenizer' );
// Create it's instance.
var myTokenizer = tokenizer();

// Tokenize a tweet.
var s = '@superman: hit me up on my email r2d2@gmail.com, 2 of us plan party🎉 tom at 3pm:) #fun';
myTokenizer.tokenize( s );
// -> [ { value: '@superman', tag: 'mention' },
//      { value: ':', tag: 'punctuation' },
//      { value: 'hit', tag: 'word' },
//      { value: 'me', tag: 'word' },
//      { value: 'up', tag: 'word' },
//      { value: 'on', tag: 'word' },
//      { value: 'my', tag: 'word' },
//      { value: 'email', tag: 'word' },
//      { value: 'r2d2@gmail.com', tag: 'email' },
//      { value: ',', tag: 'punctuation' },
//      { value: '2', tag: 'number' },
//      { value: 'of', tag: 'word' },
//      { value: 'us', tag: 'word' },
//      { value: 'plan', tag: 'word' },
//      { value: 'party', tag: 'word' },
//      { value: '🎉', tag: 'emoji' },
//      { value: 'tom', tag: 'word' },
//      { value: 'at', tag: 'word' },
//      { value: '3pm', tag: 'time' },
//      { value: ':)', tag: 'emoticon' },
//      { value: '#fun', tag: 'hashtag' } ]

// Tokenize a French sentence.
s = 'Mieux vaut prévenir que guérir:-)';
myTokenizer.tokenize( s );
// -> [ { value: 'Mieux', tag: 'word' },
//      { value: 'vaut', tag: 'word' },
//      { value: 'prévenir', tag: 'word' },
//      { value: 'que', tag: 'word' },
//      { value: 'guérir', tag: 'word' },
//      { value: ':-)', tag: 'emoticon' } ]

// Tokenize a sentence containing Hindi and English.
s = 'द्रविड़ ने टेस्ट में ३६ शतक जमाए, उनमें 21 विदेशी playground पर हैं।';
myTokenizer.tokenize( s );
// -> [ { value: 'द्रविड़', tag: 'word' },
//      { value: 'ने', tag: 'word' },
//      { value: 'टेस्ट', tag: 'word' },
//      { value: 'में', tag: 'word' },
//      { value: '३६', tag: 'number' },
//      { value: 'शतक', tag: 'word' },
//      { value: 'जमाए', tag: 'word' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'उनमें', tag: 'word' },
//      { value: '21', tag: 'number' },
//      { value: 'विदेशी', tag: 'word' },
//      { value: 'playground', tag: 'word' },
//      { value: 'पर', tag: 'word' },
//      { value: 'हैं', tag: 'word' },
//      { value: '।', tag: 'punctuation' } ]

Documentation

Check out the tokenizer API documentation to learn more.

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.

Copyright & License

wink-tokenizer is copyright 2017-18 GRAYPE Systems Private Limited.

It is licensed under the under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 of the License.

Creating an Instance

tokenizer

Creates an instance of wink-tokenizer.

tokenizer(): methods
Returns
methods: object conatining set of API methods for tokenizing a sentence and defining configuration, plugin etc.
Example
// Load wink tokenizer.
var tokenizer = require( 'wink-tokenizer' );
// Create your instance of wink tokenizer.
var myTokenizer = tokenizer();

API Methods

defineConfig

Defines the configuration in terms of the types of token that will be extracted by tokenize() method. Note by default, all types of tokens will be detected and tagged automatically.

defineConfig(config: object): number
Parameters
config (object) — It defines 0 or more properties from the list of 14 properties. A true value for a property ensures tokenization for that type of text; whereas false value will mean that the tokenization of that type of text will not be attempted.

An empty config object is equivalent to splitting on spaces. Whatever tokens are created like this are tagged as alien and z is the finger print code of this token type.

The table below gives the name of each property and it's description including examples. The character with in paranthesis is the finger print code for the token of that type.

Name Description
config.currency boolean (default true) such as $ or £ symbols ( r )
config.email boolean (default true) for example john@acme.com or superman1@gmail.com ( e )
config.emoji boolean (default true) any standard unicode emojis e.g. 😊 or 😂 or 🎉 ( j )
config.emoticon boolean (default true) common emoticons such as :-) or :D ( c )
config.hashtag boolean (default true) hash tags such as #happy or #followme ( h )
config.number boolean (default true) any integer, decimal number, fractions such as 19 , 2.718 or 1/4 and numerals containing " , - / . ", for example 12-12-1924 ( n )
config.ordinal boolean (default true) ordinals like 1st , 2nd , 3rd , 4th or 12th or 91st ( o )
config.punctuation boolean (default true) common punctuation such as ? or , ( token becomes fingerprint )
config.quoted_phrase boolean (default true) any "quoted text" in the sentence. ( q )
config.symbol boolean (default true) for example ~ or + or & or % ( token becomes fingerprint )
config.time boolean (default true) common representation of time such as 4pm or 16:00 hours ( t )
config.mention boolean (default true) @mention as in github or twitter ( m )
config.url boolean (default true) URL such as https://github.com ( u )
config.word boolean (default true) word such as faster or résumé or prévenir ( w )
Returns
number: number of properties set to true from the list of above 13.
Example
// Do not tokenize & tag @mentions.
var myTokenizer.defineConfig( { mention: false } );
// -> 13
// Only tokenize words as defined above.
var myTokenizer.defineConfig( {} );
// -> 0

tokenize

Tokenizes the input sentence using the configuration specified via defineConfig(). Common contractions and possessive nouns are split into 2 separate tokens; for example I'll splits as 'I' and '\'ll' or won't splits as 'wo' and 'n\'t'.

tokenize(sentence: string): Array<object>
Parameters
sentence (string) — the input sentence.
Returns
Array<object>: of tokens; each one of them is an object with 2-keys viz. value and its tag identifying the type of the token.
Example
var s = 'For detailed API docs, check out http://winkjs.org/wink-regression-tree/ URL!';
myTokenizer.tokenize( s );
// -> [ { value: 'For', tag: 'word' },
//      { value: 'detailed', tag: 'word' },
//      { value: 'API', tag: 'word' },
//      { value: 'docs', tag: 'word' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'check', tag: 'word' },
//      { value: 'out', tag: 'word' },
//      { value: 'http://winkjs.org/wink-regression-tree/', tag: 'url' },
//      { value: 'URL', tag: 'word' },
//      { value: '!', tag: 'punctuation' } ]

getTokensFP

Returns the finger print of the tokens generated by the last call to tokenize(). A finger print is a string created by sequentially joining the unique code of each token's type. Refer to table given under defineConfig() for values of these codes.

A finger print is extremely useful in spotting patterns present in the sentence using regexes, which is otherwise a complex and time consuming task.

getTokensFP(): string
Returns
string: finger print of tokens generated by the last call to tokenize() .
Example
// Generate finger print of sentence given in the previous example
// under tokenize().
myTokenizer.getTokensFP();
// -> 'wwww,wwuw!'