String
Methods
amplifyNotElision
Amplifies the not elision by converting it into not; for example isn't
becomes is not
.
Example
amplifyNotElision( "someone's wallet, isn't it?" );
// -> "someone's wallet, is not it?"
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
input string after not elision amplification.
- Type
- string
bagOfNGrams
Generates the bag of ngrams of size
from the input string. The
default size is 2, which means it will generate bag of bigrams by default. It
also has an alias bong()
.
Example
bagOfNGrams( 'mama' );
// -> { ma: 2, am: 1 }
bong( 'mamma' );
// -> { ma: 2, am: 1, mm: 1 }
Parameters
Name | Type | Attributes | Default | Description |
---|---|---|---|---|
str | string | the input string. |
||
size | number |
<optional> |
2 | ngram size. |
ifn | function |
<optional> |
a function to build index; it is called for
every unique occurrence of ngram of |
|
idx | number |
<optional> |
the index; passed as the second argument to the |
Returns
bag of ngrams of size
from str
.
- Type
- object
composeCorpus
Generates all possible sentences from the input argument string.
The string s must follow a special syntax as illustrated in the
example below:
'[I] [am having|have] [a] [problem|question]'
Each phrase must be quoted between [ ]
and each possible option of phrases
(if any) must be separated by a |
character. The corpus is composed by
computing the cartesian product of all the phrases.
Example
composeCorpus( '[I] [am having|have] [a] [problem|question]' );
// -> [ 'I am having a problem',
// 'I am having a question',
// 'I have a problem',
// 'I have a question' ]
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
of all possible sentences.
- Type
- Array.<string>
edgeNGrams
Generates the edge ngrams from the input string.
Example
edgeNGrams( 'decisively' );
// -> [ 'de', 'deci', 'decisi', 'decisive' ]
edgeNGrams( 'decisively', 8, 10, 1 );
// -> [ 'decisive', 'decisivel', 'decisively' ]
Parameters
Name | Type | Attributes | Default | Description |
---|---|---|---|---|
str | string | the input string. |
||
min | number |
<optional> |
2 | size of ngram generated. |
max | number |
<optional> |
8 | size of ngram is generated. |
delta | number |
<optional> |
2 | edge ngrams are generated in increments of this value. |
ifn | function |
<optional> |
a function to build index; it is called for
every edge ngram of |
|
idx | number |
<optional> |
the index; passed as the second argument to the |
Returns
of edge ngrams.
- Type
- Array.<string>
extractPersonsName
Attempts to extract person's name from input string.
It assmues the following name format:
[<salutations>] <name part as FN [MN] [LN]> [<degrees>]
Entities in square brackets are optional. Note, it is not a
named entity detection mechanism.
Example
extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' );
// -> 'Sarah Connor'
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
extracted name.
- Type
- string
extractRunOfCapitalWords
Extracts the array of text appearing as Title Case or in ALL CAPS from the input string.
Example
extractRunOfCapitalWords( 'In The Terminator, Sarah Connor is in Los Angeles' );
// -> [ 'In The Terminator', 'Sarah Connor', 'Los Angeles' ]
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
of text appearing in Title Case or in ALL CAPS; if no such
text is found then null
is returned.
- Type
- Array.<string>
lowerCase
Converts the input string to lower case.
Example
lowerCase( 'Lower Case' );
// -> 'lower case'
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
input string in lower case.
- Type
- string
marker
Generates marker
of the input string; it is defined as 1-gram, sorted
and joined back as a string again. Marker is a quick and aggressive way
to detect similarity between short strings. Its aggression may lead to more
false positives such as Meter
and Metre
or no melon
and no lemon
.
Example
marker( 'the quick brown fox jumps over the lazy dog' );
// -> ' abcdefghijklmnopqrstuvwxyz'
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
the marker.
- Type
- string
ngram
Generates an array of ngrams of a specified size from the input string. The default size is 2, which means it will generate bigrams by default.
Example
ngram( 'FRANCE' );
// -> [ 'FR', 'RA', 'AN', 'NC', 'CE' ]
ngram( 'FRENCH' );
// -> [ 'FR', 'RE', 'EN', 'NC', 'CH' ]
ngram( 'FRANCE', 3 );
// -> [ 'FRA', 'RAN', 'ANC', 'NCE' ]
Parameters
Name | Type | Attributes | Default | Description |
---|---|---|---|---|
str | string | the input string. |
||
size | number |
<optional> |
2 | ngram's size. |
Returns
ngrams of size
from str
.
- Type
- Array.<string>
phonetize
Phonetizes the input string using an algorithmic adaptation of Metaphone; It is not an exact implementation of Metaphone.
Example
phonetize( 'perspective' );
// -> 'prspktv'
phonetize( 'phenomenon' );
// -> 'fnmnn'
Parameters
Name | Type | Description |
---|---|---|
word | string | the input word. |
Returns
phonetic code of word
.
- Type
- string
removeElisions
Removes basic elisions found in the input string. Typical example of elisions
are it's, let's, where's, I'd, I'm, I'll, I've, and Isn't
etc. Note it retains
apostrophe used to indicate possession.
Example
removeElisions( "someone's wallet, isn't it?" );
// -> "someone's wallet, is it?"
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
input string after removal of elisions.
- Type
- string
removeExtraSpaces
Removes leading, trailing and any extra in-between whitespaces from the input string.
Example
removeExtraSpaces( ' Padded Text ' );
// -> 'Padded Text'
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
input string after removal of leading, trailing and extra whitespaces.
- Type
- string
removeHTMLTags
Removes each HTML tag by replacing it with a whitespace.
Extra spaces, if required, may be removed using string.removeExtraSpaces function.
Example
removeHTMLTags( '<p>Vive la France  !</p>' );
// -> ' Vive la France ! '
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
input string after removal of HTML tags.
- Type
- string
removePunctuations
Removes each punctuation mark by replacing it with a whitespace. It looks for
the following punctuations — .,;!?:"!'... - () [] {}
.
Extra spaces, if required, may be removed using string.removeExtraSpaces function.
Example
removePunctuations( 'Punctuations like "\'\',;!?:"!... are removed' );
// -> 'Punctuations like are removed'
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
input string after removal of punctuations.
- Type
- string
removeSplChars
Removes each special character by replacing it with a whitespace. It looks for
the following special characters — ~@#%^*+=
.
Extra spaces, if required, may be removed using string.removeExtraSpaces function.
Example
removeSplChars( '4 + 4*2 = 12' );
// -> '4 4 2 12'
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
input string after removal of special characters.
- Type
- string
retainAlphaNums
Retains only apha, numerals, and removes all other characters from the input string, including leading, trailing and extra in-between whitespaces.
Example
retainAlphaNums( ' This, text here, has (other) chars_! ' );
// -> 'This text here has other chars'
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
input string after removal of non-alphanumeric characters, leading, trailing and extra whitespaces.
- Type
- string
sentences
Detects the sentence boundaries in the input paragraph
and splits it into
an array of sentence(s).
Example
sentences( 'AI Inc. is focussing on AI. I work for AI Inc. My mail is r2d2@yahoo.com' );
// -> [ 'AI Inc. is focussing on AI.',
// 'I work for AI Inc.',
// 'My mail is r2d2@yahoo.com' ]
sentences( 'U.S.A is my birth place. I was born on 06.12.1924. I climbed Mt. Everest.' );
// -> [ 'U.S.A is my birth place.',
// 'I was born on 06.12.1924.',
// 'I climbed Mt. Everest.' ]
Parameters
Name | Type | Description |
---|---|---|
paragraph | string | the input string. |
Returns
of sentences.
- Type
- Array.<string>
setOfChars
Creates a set of chars from the input string s
. This is useful
in even more aggressive string matching using Jaccard or Tversky compared to
marker()
. It also has an alias soc()
.
Example
setOfChars( 'the quick brown fox jumps over the lazy dog' );
// -> ' abcdefghijklmnopqrstuvwxyz'
Parameters
Name | Type | Attributes | Description |
---|---|---|---|
str | string | the input string. |
|
ifn | function |
<optional> |
a function to build index; it receives the first
character of |
idx | number |
<optional> |
the index; passed as the second argument to the |
Returns
the soc.
- Type
- string
setOfNGrams
Generates the set of ngrams of size
from the input string. The
default size is 2, which means it will generate set of bigrams by default.
It also has an alias song()
.
Example
setOfNGrams( 'mama' );
// -> Set { 'ma', 'am' }
song( 'mamma' );
// -> Set { 'ma', 'am', 'mm' }
Parameters
Name | Type | Attributes | Default | Description |
---|---|---|---|---|
str | string | the input string. |
||
size | number |
<optional> |
2 | ngram size. |
ifn | function |
<optional> |
a function to build index; it is called for
every unique occurrence of ngram of |
|
idx | number |
<optional> |
the index; passed as the second argument to the |
Returns
of ngrams of size
of str
.
- Type
- set
soundex
Produces the soundex code from the input word
.
Example
soundex( 'Burroughs' );
// -> 'B620'
soundex( 'Burrows' );
// -> 'B620'
Parameters
Name | Type | Attributes | Default | Description |
---|---|---|---|---|
word | string | the input word. |
||
maxLength | number |
<optional> |
4 | of soundex code to be returned. |
Returns
soundex code of word
.
- Type
- string
splitElisions
Splits basic elisions found in the input string. Typical example of elisions
are it's, let's, where's, I'd, I'm, I'll, I've, and Isn't
etc. Note it does
not touch apostrophe used to indicate possession.
Example
splitElisions( "someone's wallet, isn't it?" );
// -> "someone's wallet, is n't it?"
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
input string after splitting of elisions.
- Type
- string
stem
Stems an inflected word using Porter2 stemming algorithm.
Example
stem( 'consisting' );
// -> 'consist'
Parameters
Name | Type | Description |
---|---|---|
word | string | to be stemmed. |
Returns
the stemmed word.
- Type
- string
tokenize
Tokenizes the input sentence
according to the value of detailed
flag.
Any occurance of ...
in the sentence
is
converted to ellipses. In detailed = true
mode, it
tags every token with its type; the supported tags are word, number, url, email,
mention, hashtag, emoji, emoticon, time, ordinal, currency, punctuation, symbol,
and tabCFLF.
Example
tokenize( "someone's wallet, isn't it? I'll return!" );
// -> [ 'someone', '\'s', 'wallet', ',', 'is', 'n\'t', 'it', '?',
// 'I', '\'ll', 'return', '!' ]
tokenize( 'For details on wink, check out http://winkjs.org/ URL!', true );
// -> [ { value: 'For', tag: 'word' },
// { value: 'details', tag: 'word' },
// { value: 'on', tag: 'word' },
// { value: 'wink', tag: 'word' },
// { value: ',', tag: 'punctuation' },
// { value: 'check', tag: 'word' },
// { value: 'out', tag: 'word' },
// { value: 'http://winkjs.org/', tag: 'url' },
// { value: 'URL', tag: 'word' },
// { value: '!', tag: 'punctuation' } ]
Parameters
Name | Type | Attributes | Default | Description |
---|---|---|---|---|
sentence | string | the input string. |
||
detailed | boolean |
<optional> |
false | if true, each token is a object cotaining
|
Returns
an array of strings if detailed
is false otherwise
an array of objects.
- Type
- Array.<string> Array.<object>
tokenize0
Tokenizes by splitting the input string on non-words. This means tokens would consists of only alphas, numerals and underscores; all other characters will be stripped as they are treated as separators. It also removes all elisions; however negations are retained and amplified.
Example
tokenize0( "someone's wallet, isn't it?" );
// -> [ 'someone', 's', 'wallet', 'is', 'not', 'it' ]
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
of tokens.
- Type
- Array.<string>
trim
Trims leading and trailing whitespaces from the input string.
Example
trim( ' Padded ' );
// -> 'Padded'
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
input string with leading & trailing whitespaces removed.
- Type
- string
upperCase
Converts the input string to upper case.
Example
upperCase( 'Upper Case' );
// -> 'UPPER CASE'
Parameters
Name | Type | Description |
---|---|---|
str | string | the input string. |
Returns
input string in upper case.
- Type
- string