string

String

Methods

amplifyNotElision

amplifyNotElision(str) → {string}

Amplifies the not elision by converting it into not; for example isn't becomes is not.

Example
amplifyNotElision( "someone's wallet, isn't it?" );
// -> "someone's wallet, is not it?"
Parameters
Name Type Description
str string

the input string.

Returns

input string after not elision amplification.

Type
string

bagOfNGrams

bagOfNGrams(str, sizeopt, ifnopt, idxopt) → {object}

Generates the bag of ngrams of size from the input string. The default size is 2, which means it will generate bag of bigrams by default. It also has an alias bong().

Example
bagOfNGrams( 'mama' );
// -> { ma: 2, am: 1 }
bong( 'mamma' );
// -> { ma: 2, am: 1, mm: 1 }
Parameters
Name Type Attributes Default Description
str string

the input string.

size number <optional>
2

ngram size.

ifn function <optional>

a function to build index; it is called for every unique occurrence of ngram of str; and it receives the ngram and the idx as input arguments. The build() function of helper.returnIndexer may be used as ifn. If undefined then index is not built.

idx number <optional>

the index; passed as the second argument to the ifn function.

Returns

bag of ngrams of size from str.

Type
object

composeCorpus

composeCorpus(str) → {Array.<string>}

Generates all possible sentences from the input argument string. The string s must follow a special syntax as illustrated in the example below:
'[I] [am having|have] [a] [problem|question]'

Each phrase must be quoted between [ ] and each possible option of phrases (if any) must be separated by a | character. The corpus is composed by computing the cartesian product of all the phrases.

Example
composeCorpus( '[I] [am having|have] [a] [problem|question]' );
// -> [ 'I am having a problem',
//      'I am having a question',
//      'I have a problem',
//      'I have a question' ]
Parameters
Name Type Description
str string

the input string.

Returns

of all possible sentences.

Type
Array.<string>

edgeNGrams

edgeNGrams(str, minopt, maxopt, deltaopt, ifnopt, idxopt) → {Array.<string>}

Generates the edge ngrams from the input string.

Example
edgeNGrams( 'decisively' );
// -> [ 'de', 'deci', 'decisi', 'decisive' ]
edgeNGrams( 'decisively', 8, 10, 1 );
// -> [ 'decisive', 'decisivel', 'decisively' ]
Parameters
Name Type Attributes Default Description
str string

the input string.

min number <optional>
2

size of ngram generated.

max number <optional>
8

size of ngram is generated.

delta number <optional>
2

edge ngrams are generated in increments of this value.

ifn function <optional>

a function to build index; it is called for every edge ngram of str; and it receives the edge ngram and the idx as input arguments. The build() function of helper.returnIndexer may be used as ifn. If undefined then index is not built.

idx number <optional>

the index; passed as the second argument to the ifn function.

Returns

of edge ngrams.

Type
Array.<string>

extractPersonsName

extractPersonsName(str) → {string}

Attempts to extract person's name from input string. It assmues the following name format:
[<salutations>] <name part as FN [MN] [LN]> [<degrees>]
Entities in square brackets are optional. Note, it is not a named entity detection mechanism.

Example
extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' );
// -> 'Sarah Connor'
Parameters
Name Type Description
str string

the input string.

Returns

extracted name.

Type
string

extractRunOfCapitalWords

extractRunOfCapitalWords(str) → {Array.<string>}

Extracts the array of text appearing as Title Case or in ALL CAPS from the input string.

Example
extractRunOfCapitalWords( 'In The Terminator, Sarah Connor is in Los Angeles' );
// -> [ 'In The Terminator', 'Sarah Connor', 'Los Angeles' ]
Parameters
Name Type Description
str string

the input string.

Returns

of text appearing in Title Case or in ALL CAPS; if no such text is found then null is returned.

Type
Array.<string>

lowerCase

lowerCase(str) → {string}

Converts the input string to lower case.

Example
lowerCase( 'Lower Case' );
// -> 'lower case'
Parameters
Name Type Description
str string

the input string.

Returns

input string in lower case.

Type
string

marker

marker(str) → {string}

Generates marker of the input string; it is defined as 1-gram, sorted and joined back as a string again. Marker is a quick and aggressive way to detect similarity between short strings. Its aggression may lead to more false positives such as Meter and Metre or no melon and no lemon.

Example
marker( 'the quick brown fox jumps over the lazy dog' );
// -> ' abcdefghijklmnopqrstuvwxyz'
Parameters
Name Type Description
str string

the input string.

Returns

the marker.

Type
string

ngram

ngram(str, sizeopt) → {Array.<string>}

Generates an array of ngrams of a specified size from the input string. The default size is 2, which means it will generate bigrams by default.

Example
ngram( 'FRANCE' );
// -> [ 'FR', 'RA', 'AN', 'NC', 'CE' ]
ngram( 'FRENCH' );
// -> [ 'FR', 'RE', 'EN', 'NC', 'CH' ]
ngram( 'FRANCE', 3 );
// -> [ 'FRA', 'RAN', 'ANC', 'NCE' ]
Parameters
Name Type Attributes Default Description
str string

the input string.

size number <optional>
2

ngram's size.

Returns

ngrams of size from str.

Type
Array.<string>

phonetize

phonetize(word) → {string}

Phonetizes the input string using an algorithmic adaptation of Metaphone; It is not an exact implementation of Metaphone.

Example
phonetize( 'perspective' );
// -> 'prspktv'
phonetize( 'phenomenon' );
// -> 'fnmnn'
Parameters
Name Type Description
word string

the input word.

Returns

phonetic code of word.

Type
string

removeElisions

removeElisions(str) → {string}

Removes basic elisions found in the input string. Typical example of elisions are it's, let's, where's, I'd, I'm, I'll, I've, and Isn't etc. Note it retains apostrophe used to indicate possession.

Example
removeElisions( "someone's wallet, isn't it?" );
// -> "someone's wallet, is it?"
Parameters
Name Type Description
str string

the input string.

Returns

input string after removal of elisions.

Type
string

removeExtraSpaces

removeExtraSpaces(str) → {string}

Removes leading, trailing and any extra in-between whitespaces from the input string.

Example
removeExtraSpaces( '   Padded   Text    ' );
// -> 'Padded Text'
Parameters
Name Type Description
str string

the input string.

Returns

input string after removal of leading, trailing and extra whitespaces.

Type
string

removeHTMLTags

removeHTMLTags(str) → {string}

Removes each HTML tag by replacing it with a whitespace.

Extra spaces, if required, may be removed using string.removeExtraSpaces function.

Example
removeHTMLTags( '<p>Vive la France&nbsp;&#160;!</p>' );
// -> ' Vive la France  ! '
Parameters
Name Type Description
str string

the input string.

Returns

input string after removal of HTML tags.

Type
string

removePunctuations

removePunctuations(str) → {string}

Removes each punctuation mark by replacing it with a whitespace. It looks for the following punctuations — .,;!?:"!'... - () [] {}.

Extra spaces, if required, may be removed using string.removeExtraSpaces function.

Example
removePunctuations( 'Punctuations like "\'\',;!?:"!... are removed' );
// -> 'Punctuations like               are removed'
Parameters
Name Type Description
str string

the input string.

Returns

input string after removal of punctuations.

Type
string

removeSplChars

removeSplChars(str) → {string}

Removes each special character by replacing it with a whitespace. It looks for the following special characters — ~@#%^*+=.

Extra spaces, if required, may be removed using string.removeExtraSpaces function.

Example
removeSplChars( '4 + 4*2 = 12' );
// -> '4   4 2   12'
Parameters
Name Type Description
str string

the input string.

Returns

input string after removal of special characters.

Type
string

retainAlphaNums

retainAlphaNums(str) → {string}

Retains only apha, numerals, and removes all other characters from the input string, including leading, trailing and extra in-between whitespaces.

Example
retainAlphaNums( ' This, text here, has  (other) chars_! ' );
// -> 'This text here has other chars'
Parameters
Name Type Description
str string

the input string.

Returns

input string after removal of non-alphanumeric characters, leading, trailing and extra whitespaces.

Type
string

sentences

sentences(paragraph) → {Array.<string>}

Detects the sentence boundaries in the input paragraph and splits it into an array of sentence(s).

Example
sentences( 'AI Inc. is focussing on AI. I work for AI Inc. My mail is r2d2@yahoo.com' );
// -> [ 'AI Inc. is focussing on AI.',
//      'I work for AI Inc.',
//      'My mail is r2d2@yahoo.com' ]

sentences( 'U.S.A is my birth place. I was born on 06.12.1924. I climbed Mt. Everest.' );
// -> [ 'U.S.A is my birth place.',
//      'I was born on 06.12.1924.',
//      'I climbed Mt. Everest.' ]
Parameters
Name Type Description
paragraph string

the input string.

Returns

of sentences.

Type
Array.<string>

setOfChars

setOfChars(str, ifnopt, idxopt) → {string}

Creates a set of chars from the input string s. This is useful in even more aggressive string matching using Jaccard or Tversky compared to marker(). It also has an alias soc().

Example
setOfChars( 'the quick brown fox jumps over the lazy dog' );
// -> ' abcdefghijklmnopqrstuvwxyz'
Parameters
Name Type Attributes Description
str string

the input string.

ifn function <optional>

a function to build index; it receives the first character of str and the idx as input arguments. The build() function of helper.returnIndexer may be used as ifn. If undefined then index is not built.

idx number <optional>

the index; passed as the second argument to the ifn function.

Returns

the soc.

Type
string

setOfNGrams

setOfNGrams(str, sizeopt, ifnopt, idxopt) → {set}

Generates the set of ngrams of size from the input string. The default size is 2, which means it will generate set of bigrams by default. It also has an alias song().

Example
setOfNGrams( 'mama' );
// -> Set { 'ma', 'am' }
song( 'mamma' );
// -> Set { 'ma', 'am', 'mm' }
Parameters
Name Type Attributes Default Description
str string

the input string.

size number <optional>
2

ngram size.

ifn function <optional>

a function to build index; it is called for every unique occurrence of ngram of str; and it receives the ngram and the idx as input arguments. The build() function of helper.returnIndexer may be used as ifn. If undefined then index is not built.

idx number <optional>

the index; passed as the second argument to the ifn function.

Returns

of ngrams of size of str.

Type
set

soundex

soundex(word, maxLengthopt) → {string}

Produces the soundex code from the input word.

Example
soundex( 'Burroughs' );
// -> 'B620'
soundex( 'Burrows' );
// -> 'B620'
Parameters
Name Type Attributes Default Description
word string

the input word.

maxLength number <optional>
4

of soundex code to be returned.

Returns

soundex code of word.

Type
string

splitElisions

splitElisions(str) → {string}

Splits basic elisions found in the input string. Typical example of elisions are it's, let's, where's, I'd, I'm, I'll, I've, and Isn't etc. Note it does not touch apostrophe used to indicate possession.

Example
splitElisions( "someone's wallet, isn't it?" );
// -> "someone's wallet, is n't it?"
Parameters
Name Type Description
str string

the input string.

Returns

input string after splitting of elisions.

Type
string

stem

stem(word) → {string}

Stems an inflected word using Porter2 stemming algorithm.

Example
stem( 'consisting' );
// -> 'consist'
Parameters
Name Type Description
word string

to be stemmed.

Returns

the stemmed word.

Type
string

tokenize

tokenize(sentence, detailedopt) → {Array.<string>|Array.<object>}

Tokenizes the input sentence according to the value of detailed flag. Any occurance of ... in the sentence is converted to ellipses. In detailed = true mode, it tags every token with its type; the supported tags are word, number, url, email, mention, hashtag, emoji, emoticon, time, ordinal, currency, punctuation, symbol, and tabCFLF.

Example
tokenize( "someone's wallet, isn't it? I'll return!" );
// -> [ 'someone', '\'s', 'wallet', ',', 'is', 'n\'t', 'it', '?',
//      'I', '\'ll', 'return', '!' ]

tokenize( 'For details on wink, check out http://winkjs.org/ URL!', true );
// -> [ { value: 'For', tag: 'word' },
//      { value: 'details', tag: 'word' },
//      { value: 'on', tag: 'word' },
//      { value: 'wink', tag: 'word' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'check', tag: 'word' },
//      { value: 'out', tag: 'word' },
//      { value: 'http://winkjs.org/', tag: 'url' },
//      { value: 'URL', tag: 'word' },
//      { value: '!', tag: 'punctuation' } ]
Parameters
Name Type Attributes Default Description
sentence string

the input string.

detailed boolean <optional>
false

if true, each token is a object cotaining value and tag of each token; otherwise each token is a string. It's default value of false ensures compatibility with previous version.

Returns

an array of strings if detailed is false otherwise an array of objects.

Type
Array.<string> Array.<object>

tokenize0

tokenize0(str) → {Array.<string>}

Tokenizes by splitting the input string on non-words. This means tokens would consists of only alphas, numerals and underscores; all other characters will be stripped as they are treated as separators. It also removes all elisions; however negations are retained and amplified.

Example
tokenize0( "someone's wallet, isn't it?" );
// -> [ 'someone', 's', 'wallet', 'is', 'not', 'it' ]
Parameters
Name Type Description
str string

the input string.

Returns

of tokens.

Type
Array.<string>

trim

trim(str) → {string}

Trims leading and trailing whitespaces from the input string.

Example
trim( '  Padded   ' );
// -> 'Padded'
Parameters
Name Type Description
str string

the input string.

Returns

input string with leading & trailing whitespaces removed.

Type
string

upperCase

upperCase(str) → {string}

Converts the input string to upper case.

Example
upperCase( 'Upper Case' );
// -> 'UPPER CASE'
Parameters
Name Type Description
str string

the input string.

Returns

input string in upper case.

Type
string