wink-tokenizer - Wink JS

Methods

addRegex

addRegex(regex, tag, fingerprintCodeopt) → {void}

Adds a regex for parsing a new type of token. This regex can either be mapped to an existing tag or it allows creation of a new tag along with its finger print. The uniqueness of the finger prints have to ensured by the user.

The added regex(s) will supersede the internal parsing.

Example

// Adding a regex for an existing tag
myTokenizer.addRegex( /\(oo\)/gi, 'emoticon' );
myTokenizer.tokenize( '(oo) Hi!' )
// -> [ { value: '(oo)', tag: 'emoticon' },
//      { value: 'Hi', tag: 'word' },
//      { value: '!', tag: 'punctuation' } ]

// Adding a regex to parse a new token type
myTokenizer.addRegex( /hello/gi, 'greeting', 'g' );
myTokenizer.tokenize( 'hello, how are you?' );
// -> [ { value: 'hello', tag: 'greeting' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'how', tag: 'word' },
//      { value: 'are', tag: 'word' },
//      { value: 'you', tag: 'word' },
//      { value: '?', tag: 'punctuation' } ]
// Notice how "hello" is now tagged as "greeting" and not as "word".

// Using definConfig will reset the above!
myTokenizer.defineConfig( { word: true } );
myTokenizer.tokenize( 'hello, how are you?' );
// -> [ { value: 'hello', tag: 'word' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'how', tag: 'word' },
//      { value: 'are', tag: 'word' },
//      { value: 'you', tag: 'word' },
//      { value: '?', tag: 'punctuation' } ]

Parameters

Name	Type	Attributes	Description
regex	RegExp		the new regular expression.
tag	string		tokens matching the `regex` will be assigned this tag.
fingerprintCode	string	<optional>	required if adding a new tag; ignored if using an existing tag.

Returns

nothing!

Type: void

defineConfig

defineConfig(config) → {number}

Defines the configuration in terms of the types of token that will be extracted by tokenize() method. Note by default, all types of tokens will be detected and tagged automatically.

Example

// Do not tokenize & tag @mentions.
var myTokenizer.defineConfig( { mention: false } );
// -> 13
// Only tokenize words as defined above.
var myTokenizer.defineConfig( {} );
// -> 0

Parameters

Name Type Description

config

object

It defines 0 or more properties from the list of 14 properties. A true value for a property ensures tokenization for that type of text; whereas false value will mean that the tokenization of that type of text will not be attempted. It also resets the effect of any previous call(s) to the addRegex() API.

An empty config object is equivalent to splitting on spaces. Whatever tokens are created like this are tagged as alien and z is the finger print code of this token type.

The table below gives the name of each property and it's description including examples. The character with in paranthesis is the finger print code for the token of that type.

Properties

Name	Type	Attributes	Default	Description
currency	boolean	<optional>	true	such as $ or £ symbols (`r`)
email	boolean	<optional>	true	for example john@acme.com or superman1@gmail.com (`e`)
emoji	boolean	<optional>	true	any standard unicode emojis e.g. 😊 or 😂 or 🎉 (`j`)
emoticon	boolean	<optional>	true	common emoticons such as `:-)` or `:D` (`c`)
hashtag	boolean	<optional>	true	hash tags such as `#happy` or `#followme` (`h`)
number	boolean	<optional>	true	any integer, decimal number, fractions such as 19, 2.718 or 1/4 and numerals containing "`, - / .`", for example 12-12-1924 (`n`)
ordinal	boolean	<optional>	true	ordinals like 1st, 2nd, 3rd, 4th or 12th or 91st (`o`)
punctuation	boolean	<optional>	true	common punctuation such as `?` or `,` ( token becomes fingerprint )
quoted_phrase	boolean	<optional>	false	any "quoted text" in the sentence. Note: its default value is false. (`q`)
symbol	boolean	<optional>	true	for example `~` or `+` or `&` or `%` or `/` ( token becomes fingerprint )
time	boolean	<optional>	true	common representation of time such as 4pm or 16:00 hours (`t`)
mention	boolean	<optional>	true	@mention as in github or twitter (`m`)
url	boolean	<optional>	true	URL such as https://github.com (`u`)
word	boolean	<optional>	true	word such as faster or résumé or prévenir (`w`)

Returns

number of properties set to true from the list of above 13.

Type: number

getTokensFP

getTokensFP() → {string}

Returns the finger print of the tokens generated by the last call to tokenize(). A finger print is a string created by sequentially joining the unique code of each token's type. Refer to table given under defineConfig() for values of these codes.

A finger print is extremely useful in spotting patterns present in the sentence using regexes, which is otherwise a complex and time consuming task.

Example

// Generate finger print of sentence given in the previous example
// under tokenize().
myTokenizer.getTokensFP();
// -> 'wwww,wwuw!'

Returns

finger print of tokens generated by the last call to tokenize().

Type: string

tokenize

tokenize(sentence) → {Array.<object>}

Tokenizes the input sentence using the configuration specified via defineConfig(). Common contractions and possessive nouns are split into 2 separate tokens; for example I'll splits as 'I' and '\'ll' or won't splits as 'wo' and 'n\'t'.

Example

var s = 'For detailed API docs, check out http://winkjs.org/wink-regression-tree/ URL!';
myTokenizer.tokenize( s );
// -> [ { value: 'For', tag: 'word' },
//      { value: 'detailed', tag: 'word' },
//      { value: 'API', tag: 'word' },
//      { value: 'docs', tag: 'word' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'check', tag: 'word' },
//      { value: 'out', tag: 'word' },
//      { value: 'http://winkjs.org/wink-regression-tree/', tag: 'url' },
//      { value: 'URL', tag: 'word' },
//      { value: '!', tag: 'punctuation' } ]

Parameters

Name	Type	Description
sentence	string	the input sentence.

Returns

of tokens; each one of them is an object with 2-keys viz. value and its tag identifying the type of the token.

Type: Array.<object>