Tokenizer

Tokenizer class

Methods

addRegex

addRegex(regex, tag, fingerprintCodeopt) → {void}

Adds a regex for parsing a new type of token. This regex can either be mapped to an existing tag or it allows creation of a new tag along with its finger print. The uniqueness of the finger prints have to ensured by the user.

The added regex(s) will supersede the internal parsing.

Example
// Adding a regex for an existing tag
myTokenizer.addRegex( /\(oo\)/gi, 'emoticon' );
myTokenizer.tokenize( '(oo) Hi!' )
// -> [ { value: '(oo)', tag: 'emoticon' },
//      { value: 'Hi', tag: 'word' },
//      { value: '!', tag: 'punctuation' } ]

// Adding a regex to parse a new token type
myTokenizer.addRegex( /hello/gi, 'greeting', 'g' );
myTokenizer.tokenize( 'hello, how are you?' );
// -> [ { value: 'hello', tag: 'greeting' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'how', tag: 'word' },
//      { value: 'are', tag: 'word' },
//      { value: 'you', tag: 'word' },
//      { value: '?', tag: 'punctuation' } ]
// Notice how "hello" is now tagged as "greeting" and not as "word".

// Using definConfig will reset the above!
myTokenizer.defineConfig( { word: true } );
myTokenizer.tokenize( 'hello, how are you?' );
// -> [ { value: 'hello', tag: 'word' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'how', tag: 'word' },
//      { value: 'are', tag: 'word' },
//      { value: 'you', tag: 'word' },
//      { value: '?', tag: 'punctuation' } ]
Parameters
Name Type Attributes Description
regex RegExp

the new regular expression.

tag string

tokens matching the regex will be assigned this tag.

fingerprintCode string <optional>

required if adding a new tag; ignored if using an existing tag.

Returns

nothing!

Type
void

defineConfig

defineConfig(config) → {number}

Defines the configuration in terms of the types of token that will be extracted by tokenize() method. Note by default, all types of tokens will be detected and tagged automatically.

Example
// Do not tokenize & tag @mentions.
var myTokenizer.defineConfig( { mention: false } );
// -> 13
// Only tokenize words as defined above.
var myTokenizer.defineConfig( {} );
// -> 0
Parameters
Name Type Description
config object

It defines 0 or more properties from the list of 14 properties. A true value for a property ensures tokenization for that type of text; whereas false value will mean that the tokenization of that type of text will not be attempted. It also resets the effect of any previous call(s) to the addRegex() API.

An empty config object is equivalent to splitting on spaces. Whatever tokens are created like this are tagged as alien and z is the finger print code of this token type.

The table below gives the name of each property and it's description including examples. The character with in paranthesis is the finger print code for the token of that type.

Properties
Name Type Attributes Default Description
currency boolean <optional>
true

such as $ or £ symbols (r)

email boolean <optional>
true

for example john@acme.com or superman1@gmail.com (e)

emoji boolean <optional>
true

any standard unicode emojis e.g. 😊 or 😂 or 🎉 (j)

emoticon boolean <optional>
true

common emoticons such as :-) or :D (c)

hashtag boolean <optional>
true

hash tags such as #happy or #followme (h)

number boolean <optional>
true

any integer, decimal number, fractions such as 19, 2.718 or 1/4 and numerals containing ", - / .", for example 12-12-1924 (n)

ordinal boolean <optional>
true

ordinals like 1st, 2nd, 3rd, 4th or 12th or 91st (o)

punctuation boolean <optional>
true

common punctuation such as ? or , ( token becomes fingerprint )

quoted_phrase boolean <optional>
false

any "quoted text" in the sentence. Note: its default value is false. (q)

symbol boolean <optional>
true

for example ~ or + or & or % or / ( token becomes fingerprint )

time boolean <optional>
true

common representation of time such as 4pm or 16:00 hours (t)

mention boolean <optional>
true

@mention as in github or twitter (m)

url boolean <optional>
true

URL such as https://github.com (u)

word boolean <optional>
true

word such as faster or résumé or prévenir (w)

Returns

number of properties set to true from the list of above 13.

Type
number

getTokensFP

getTokensFP() → {string}

Returns the finger print of the tokens generated by the last call to tokenize(). A finger print is a string created by sequentially joining the unique code of each token's type. Refer to table given under defineConfig() for values of these codes.

A finger print is extremely useful in spotting patterns present in the sentence using regexes, which is otherwise a complex and time consuming task.

Example
// Generate finger print of sentence given in the previous example
// under tokenize().
myTokenizer.getTokensFP();
// -> 'wwww,wwuw!'
Returns

finger print of tokens generated by the last call to tokenize().

Type
string

tokenize

tokenize(sentence) → {Array.<object>}

Tokenizes the input sentence using the configuration specified via defineConfig(). Common contractions and possessive nouns are split into 2 separate tokens; for example I'll splits as 'I' and '\'ll' or won't splits as 'wo' and 'n\'t'.

Example
var s = 'For detailed API docs, check out http://winkjs.org/wink-regression-tree/ URL!';
myTokenizer.tokenize( s );
// -> [ { value: 'For', tag: 'word' },
//      { value: 'detailed', tag: 'word' },
//      { value: 'API', tag: 'word' },
//      { value: 'docs', tag: 'word' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'check', tag: 'word' },
//      { value: 'out', tag: 'word' },
//      { value: 'http://winkjs.org/wink-regression-tree/', tag: 'url' },
//      { value: 'URL', tag: 'word' },
//      { value: '!', tag: 'punctuation' } ]
Parameters
Name Type Description
sentence string

the input sentence.

Returns

of tokens; each one of them is an object with 2-keys viz. value and its tag identifying the type of the token.

Type
Array.<object>