tokens

Tokens

Methods

appendBigrams

appendBigrams(tokens) → {Array.<string>}

Generates bigrams from the input tokens and appends them to the input tokens.

Example
appendBigrams( [ 'he', 'acted', 'decisively', 'today' ] );
// -> [ 'he',
//      'acted',
//      'decisively',
//      'today',
//      'he_acted',
//      'acted_decisively',
//      'decisively_today' ]
Parameters
Name Type Description
tokens Array.<string>

the input tokens.

Returns

the input tokens appended with their bigrams.

Type
Array.<string>

bagOfWords

bagOfWords(tokens, logCountsopt, ifnopt, idxopt) → {object}

Generates the bag of words from the input string. By default it uses word count as it's frequency; but if logCounts parameter is set to true then it will use log2( word counts + 1 ) as it's frequency. It also has an alias bow().

Example
bagOfWords( [ 'rain', 'rain', 'go', 'away' ] );
// -> { rain: 2, go: 1, away: 1 }
bow( [ 'rain', 'rain', 'go', 'away' ], true );
// -> { rain: 1.584962500721156, go: 1, away: 1 }
Parameters
Name Type Attributes Default Description
tokens Array.<string>

the input tokens.

logCounts number <optional>
false

a true value flags the use of log2( word count + 1 ) instead of just word count as frequency.

ifn function <optional>

a function to build index; it is called for every unique occurrence of word in tokens; and it receives the word and the idx as input arguments. The build() function of helper.returnIndexer may be used as ifn. If undefined then index is not built.

idx number <optional>

the index; passed as the second argument to the ifn function.

Returns

bag of words from tokens.

Type
object

bigrams

bigrams(tokens) → {Array.<string>}

Generates bigrams from the input tokens.

Example
bigrams( [ 'he', 'acted', 'decisively', 'today' ] );
// -> [ [ 'he', 'acted' ],
//      [ 'acted', 'decisively' ],
//      [ 'decisively', 'today' ] ]
Parameters
Name Type Description
tokens Array.<string>

the input tokens.

Returns

the bigrams.

Type
Array.<string>

phonetize

phonetize(tokens) → {Array.<string>}

Phonetizes input tokens using using an algorithmic adaptation of Metaphone.

Example
phonetize( [ 'he', 'acted', 'decisively', 'today' ] );
// -> [ 'h', 'aktd', 'dssvl', 'td' ]
Parameters
Name Type Description
tokens Array.<string>

the input tokens.

Returns

phonetized tokens.

Type
Array.<string>

propagateNegations

propagateNegations(tokens, uptoopt) → {Array.<string>}

It looks for negation tokens in the input array of tokens and propagates negation to subsequent upto tokens by prefixing them by a !. It is useful in handling text containing negations during tasks like similarity detection, classification or search.

Example
propagateNegations( [ 'mary', 'is', 'not', 'feeling', 'good', 'today' ] );
// -> [ 'mary', 'is', 'not', '!feeling', '!good', 'today' ]
Parameters
Name Type Attributes Default Description
tokens Array.<string>

the input tokens.

upto number <optional>
2

number of tokens to be negated after the negation token. Note, tokens are only negated either upto tokens or up to the token preceeding the , . ; : ! ? punctuations.

Returns

tokens with negation propagated.

Type
Array.<string>

removeWords

removeWords(tokens, stopWordsopt) → {Array.<string>}

Removes the stop words from the input array of tokens.

Example
removeWords( [ 'this', 'is', 'a', 'cat' ] );
// -> [ 'cat' ]
Parameters
Name Type Attributes Default Description
tokens Array.<string>

the input tokens.

stopWords wordsFilter <optional>
defaultStopWords

default stop words are loaded from stop_words.json located under the src/dictionaries/ directory. Custom stop words can be created using helper.returnWordsFilter .

Returns

balance tokens.

Type
Array.<string>

setOfWords

setOfWords(tokens, ifnopt, idxopt) → {set}

Generates the set of words from the input string. It also has an alias sow().

Example
setOfWords( [ 'rain', 'rain', 'go', 'away' ] );
// -> Set { 'rain', 'go', 'away' }
Parameters
Name Type Attributes Description
tokens Array.<string>

the input tokens.

ifn function <optional>

a function to build index; it is called for every **member word of the set **; and it receives the word and the idx as input arguments. The build() function of helper.returnIndexer may be used as ifn. If undefined then index is not built.

idx number <optional>

the index; passed as the second argument to the ifn function.

Returns

of words from tokens.

Type
set

soundex

soundex(tokens) → {Array.<string>}

Generates the soundex coded tokens from the input tokens.

Example
soundex( [ 'he', 'acted', 'decisively', 'today' ] );
// -> [ 'H000', 'A233', 'D221', 'T300' ]
Parameters
Name Type Description
tokens Array.<string>

the input tokens.

Returns

soundex coded tokens.

Type
Array.<string>

stem

stem(tokens) → {Array.<string>}

Stems input tokens using Porter Stemming Algorithm Version 2.

Example
stem( [ 'he', 'acted', 'decisively', 'today' ] );
// -> [ 'he', 'act', 'decis', 'today' ]
Parameters
Name Type Description
tokens Array.<string>

the input tokens.

Returns

stemmed tokens.

Type
Array.<string>