Methods
addRegex
Adds a regex for parsing a new type of token. This regex can either be mapped to an existing tag or it allows creation of a new tag along with its finger print. The uniqueness of the finger prints have to ensured by the user.
The added regex(s) will supersede the internal parsing.
Example
// Adding a regex for an existing tag
myTokenizer.addRegex( /\(oo\)/gi, 'emoticon' );
myTokenizer.tokenize( '(oo) Hi!' )
// -> [ { value: '(oo)', tag: 'emoticon' },
// { value: 'Hi', tag: 'word' },
// { value: '!', tag: 'punctuation' } ]
// Adding a regex to parse a new token type
myTokenizer.addRegex( /hello/gi, 'greeting', 'g' );
myTokenizer.tokenize( 'hello, how are you?' );
// -> [ { value: 'hello', tag: 'greeting' },
// { value: ',', tag: 'punctuation' },
// { value: 'how', tag: 'word' },
// { value: 'are', tag: 'word' },
// { value: 'you', tag: 'word' },
// { value: '?', tag: 'punctuation' } ]
// Notice how "hello" is now tagged as "greeting" and not as "word".
// Using definConfig will reset the above!
myTokenizer.defineConfig( { word: true } );
myTokenizer.tokenize( 'hello, how are you?' );
// -> [ { value: 'hello', tag: 'word' },
// { value: ',', tag: 'punctuation' },
// { value: 'how', tag: 'word' },
// { value: 'are', tag: 'word' },
// { value: 'you', tag: 'word' },
// { value: '?', tag: 'punctuation' } ]
Parameters
Name | Type | Attributes | Description |
---|---|---|---|
regex | RegExp | the new regular expression. |
|
tag | string | tokens matching the |
|
fingerprintCode | string |
<optional> |
required if adding a new tag; ignored if using an existing tag. |
Returns
nothing!
- Type
- void
defineConfig
Defines the configuration in terms of the types of token that will be
extracted by tokenize()
method. Note by default, all types
of tokens will be detected and tagged automatically.
Example
// Do not tokenize & tag @mentions.
var myTokenizer.defineConfig( { mention: false } );
// -> 13
// Only tokenize words as defined above.
var myTokenizer.defineConfig( {} );
// -> 0
Parameters
Name | Type | Description | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
config | object | It defines 0 or more properties from the list of
14 properties. A true value for a property ensures tokenization
for that type of text; whereas false value will mean that the tokenization of that
type of text will not be attempted. It also resets the effect of any previous
call(s) to the An empty config object is equivalent to splitting on spaces. Whatever tokens
are created like this are tagged as alien and The table below gives the name of each property and it's description including examples. The character with in paranthesis is the finger print code for the token of that type. Properties
|
Returns
number of properties set to true from the list of above 13.
- Type
- number
getTokensFP
Returns the finger print of the tokens generated by the last call to
tokenize()
. A finger print is a string created by sequentially
joining the unique code of each token's type. Refer to table given under
defineConfig()
for values of these codes.
A finger print is extremely useful in spotting patterns present in the sentence
using regexes
, which is otherwise a complex and time consuming task.
Example
// Generate finger print of sentence given in the previous example
// under tokenize().
myTokenizer.getTokensFP();
// -> 'wwww,wwuw!'
Returns
finger print of tokens generated by the last call to tokenize()
.
- Type
- string
tokenize
Tokenizes the input sentence
using the configuration specified via
defineConfig()
.
Common contractions and possessive nouns are split into 2 separate tokens;
for example I'll splits as 'I'
and '\'ll'
or won't splits as
'wo'
and 'n\'t'
.
Example
var s = 'For detailed API docs, check out http://winkjs.org/wink-regression-tree/ URL!';
myTokenizer.tokenize( s );
// -> [ { value: 'For', tag: 'word' },
// { value: 'detailed', tag: 'word' },
// { value: 'API', tag: 'word' },
// { value: 'docs', tag: 'word' },
// { value: ',', tag: 'punctuation' },
// { value: 'check', tag: 'word' },
// { value: 'out', tag: 'word' },
// { value: 'http://winkjs.org/wink-regression-tree/', tag: 'url' },
// { value: 'URL', tag: 'word' },
// { value: '!', tag: 'punctuation' } ]
Parameters
Name | Type | Description |
---|---|---|
sentence | string | the input sentence. |
Returns
of tokens; each one of them is an object with 2-keys viz.
value
and its tag
identifying the type of the token.
- Type
- Array.<object>