Introduction

wink-nlp-utils

NLP Functions for amplifying negations, managing elisions, creating ngrams, stems, phonetic codes to tokens and more.

Build Status Coverage Status Inline docs dependencies Status devDependencies Status

Prepare raw text for Natural Language Processing (NLP) using wink-nlp-utils.It is a part of wink — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS.

It offers a set of APIs to work on strings such as names, sentences, paragraphs and tokens represented as an array of strings/words. They perform the required pre-processing for many ML tasks such as semantic search, and classification.

Installation

Use npm to install:

npm install wink-nlp-utils --save

Getting Started

// Load wink-nlp-utils
var nlp = require( 'wink-nlp-utils' );

// Extract person's name from a string:
var name = nlp.string.extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' );
console.log( name );
// -> 'Sarah Connor'

// Compose all possible sentences from a string:
var str = '[I] [am having|have] [a] [problem|question]';
console.log( nlp.string.composeCorpus( str ) );
// -> [ 'I am having a problem',
// ->   'I am having a question',
// ->   'I have a problem',
// ->   'I have a question' ]

// Sentence Boundary Detection.
var para = 'AI Inc. is focussing on AI. I work for AI Inc. My mail is r2d2@yahoo.com';
console.log( nlp.string.sentences( para ) );
// -> [ 'AI Inc. is focussing on AI.',
//      'I work for AI Inc.',
//      'My mail is r2d2@yahoo.com' ]

// Tokenize a sentence.
var s = 'For details on wink, check out http://winkjs.org/ URL!';
console.log( nlp.string.tokenize( s, true ) );
// -> [ { value: 'For', tag: 'word' },
//      { value: 'details', tag: 'word' },
//      { value: 'on', tag: 'word' },
//      { value: 'wink', tag: 'word' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'check', tag: 'word' },
//      { value: 'out', tag: 'word' },
//      { value: 'http://winkjs.org/', tag: 'url' },
//      { value: 'URL', tag: 'word' },
//      { value: '!', tag: 'punctuation' } ]

// Remove stop words:
var t = nlp.tokens.removeWords( [ 'mary', 'had', 'a', 'little', 'lamb' ] );
console.log( t );
// -> [ 'mary', 'little', 'lamb' ]

Documentation

Check out the wink NLP utilities API documentation to learn more.

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.

Copyright & License

wink-nlp-utils is copyright 2017-18 GRAYPE Systems Private Limited.

It is licensed under the under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 of the License.

string

string.amplifyNotElision

Amplifies the not elision by converting it into not; for example isn't becomes is not.

string.amplifyNotElision
Parameters
str (string) — the input string.
Returns
string: input string after not elision amplification.
Example
amplifyNotElision( "someone's wallet, isn't it?" );
// -> "someone's wallet, is not it?"

string.bong

Generates the bag of ngrams of size from the input string. The default size is 2, which means it will generate bag of bigrams by default.

string.bong
Parameters
str (string) — the input string.
size (number = 2) — ngram size.
ifn (function = undefined) — a function to build index; it is called for every unique occurrence of ngram of str ; and it receives the ngram and the idx as input arguments. The build() function of helper.returnIndexer may be used as ifn . If undefined then index is not built.
idx (number = undefined) — the index; passed as the second argument to the ifn function.
Returns
object: bag of ngrams of size from str .
Example
bong( 'mama' );
// -> { ma: 2, am: 1 }
bong( 'mamma' );
// -> { ma: 2, am: 1, mm: 1 }

string.composeCorpus

Generates all possible sentences from the input argument string. The string s must follow a special syntax as illustrated in the example below:
'[I] [am having|have] [a] [problem|question]'

Each phrase must be quoted between [ ] and each possible option of phrases (if any) must be separated by a | character. The corpus is composed by computing the cartesian product of all the phrases.

string.composeCorpus
Parameters
str (string) — the input string.
Returns
Array<string>: of all possible sentences.
Example
composeCorpus( '[I] [am having|have] [a] [problem|question]' );
// -> [ 'I am having a problem',
//      'I am having a question',
//      'I have a problem',
//      'I have a question' ]

string.edgeNGrams

Generates the edge ngrams from the input string.

string.edgeNGrams
Parameters
str (string) — the input string.
min (number = 2) — size of ngram generated.
max (number = 8) — size of ngram is generated.
delta (number = 2) — edge ngrams are generated in increments of this value.
ifn (function = undefined) — a function to build index; it is called for every edge ngram of str ; and it receives the edge ngram and the idx as input arguments. The build() function of helper.returnIndexer may be used as ifn . If undefined then index is not built.
idx (number = undefined) — the index; passed as the second argument to the ifn function.
Returns
Array<string>: of edge ngrams.
Example
edgeNGrams( 'decisively' );
// -> [ 'de', 'deci', 'decisi', 'decisive' ]
edgeNGrams( 'decisively', 8, 10, 1 );
// -> [ 'decisive', 'decisivel', 'decisively' ]

string.extractPersonsName

Attempts to extract person's name from input string. It assmues the following name format:
[<salutations>] <name part as FN [MN] [LN]> [<degrees>]
Entities in square brackets are optional.

string.extractPersonsName
Parameters
str (string) — the input string.
Returns
string: extracted name.
Example
extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' );
// -> 'Sarah Connor'

string.extractRunOfCapitalWords

Extracts the array of text appearing as Title Case or in ALL CAPS from the input string.

string.extractRunOfCapitalWords
Parameters
str (string) — the input string.
Returns
Array<string>: of text appearing in Title Case or in ALL CAPS; if no such text is found then null is returned.
Example
extractRunOfCapitalWords( 'In The Terminator, Sarah Connor is in Los Angeles' );
// -> [ 'In The Terminator', 'Sarah Connor', 'Los Angeles' ]

string.lowerCase

Converts the input string to lower case.

string.lowerCase
Parameters
str (string) — the input string.
Returns
string: input string in lower case.
Example
lowerCase( 'Lower Case' );
// -> 'lower case'

string.marker

Generates marker of the input string; it is defined as 1-gram, sorted and joined back as a string again. Marker is a quick and aggressive way to detect similarity between short strings. Its aggression may lead to more false positives such as Meter and Metre or no melon and no lemon.

string.marker
Parameters
str (string) — the input string.
Returns
string: the marker.
Example
marker( 'the quick brown fox jumps over the lazy dog' );
// -> ' abcdefghijklmnopqrstuvwxyz'

string.ngram

Generates an array of ngrams of a specified size from the input string. The default size is 2, which means it will generate bigrams by default.

string.ngram
Parameters
str (string) — the input string.
size (number = 2) — ngram's size.
Returns
Array<string>: ngrams of size from str .
Example
ngram( 'FRANCE' );
// -> [ 'FR', 'RA', 'AN', 'NC', 'CE' ]
ngram( 'FRENCH' );
// -> [ 'FR', 'RE', 'EN', 'NC', 'CH' ]
ngram( 'FRANCE', 3 );
// -> [ 'FRA', 'RAN', 'ANC', 'NCE' ]

string.phonetize

Phonetizes the input string using an algorithmic adaptation of Metaphone; It is not an exact implementation of Metaphone.

string.phonetize
Parameters
word (string) — the input word.
Returns
string: phonetic code of word .
Example
phonetize( 'perspective' );
// -> 'prspktv'
phonetize( 'phenomenon' );
// -> 'fnmnn'

string.removeElisions

Removes basic elisions found in the input string. Typical example of elisions are it's, let's, where's, I'd, I'm, I'll, I've, and Isn't etc. Note it retains apostrophe used to indicate possession.

string.removeElisions
Parameters
str (string) — the input string.
Returns
string: input string after removal of elisions.
Example
removeElisions( "someone's wallet, isn't it?" );
// -> "someone's wallet, is it?"

string.removeExtraSpaces

Removes leading, trailing and any extra in-between whitespaces from the input string.

string.removeExtraSpaces
Parameters
str (string) — the input string.
Returns
string: input string after removal of leading, trailing and extra whitespaces.
Example
removeExtraSpaces( '   Padded   Text    ' );
// -> 'Padded Text'

string.removeHTMLTags

Removes each HTML tag by replacing it with a whitespace.

Extra spaces, if required, may be removed using string.removeExtraSpaces function.

string.removeHTMLTags
Parameters
str (string) — the input string.
Returns
string: input string after removal of HTML tags.
Example
removeHTMLTags( '<p>Vive la France&nbsp;&#160;!</p>' );
// -> ' Vive la France  ! '

string.removePunctuations

Removes each punctuation mark by replacing it with a whitespace. It looks for the following punctuations — .,;!?:"!'... - () [] {}.

Extra spaces, if required, may be removed using string.removeExtraSpaces function.

string.removePunctuations
Parameters
str (string) — the input string.
Returns
string: input string after removal of punctuations.
Example
removePunctuations( 'Punctuations like "\'\',;!?:"!... are removed' );
// -> 'Punctuations like               are removed'

string.removeSplChars

Removes each special character by replacing it with a whitespace. It looks for the following special characters — ~@#%^*+=.

Extra spaces, if required, may be removed using string.removeExtraSpaces function.

string.removeSplChars
Parameters
str (string) — the input string.
Returns
string: input string after removal of special characters.
Example
removeSplChars( '4 + 4*2 = 12' );
// -> '4   4 2   12'

string.retainAlphaNums

Retains only apha, numerals, and removes all other characters from the input string, including leading, trailing and extra in-between whitespaces.

string.retainAlphaNums
Parameters
str (string) — the input string.
Returns
string: input string after removal of non-alphanumeric characters, leading, trailing and extra whitespaces.
Example
retainAlphaNums( ' This, text here, has  (other) chars_! ' );
// -> 'This text here has other chars'

string.sentences

Detects the sentence boundaries in the input paragraph and splits it into an array of sentence(s).

string.sentences
Parameters
paragraph (string) — the input string.
Returns
Array<string>: of sentences.
Example
sentences( 'AI Inc. is focussing on AI. I work for AI Inc. My mail is r2d2@yahoo.com' );
// -> [ 'AI Inc. is focussing on AI.',
//      'I work for AI Inc.',
//      'My mail is r2d2@yahoo.com' ]

sentences( 'U.S.A is my birth place. I was born on 06.12.1924. I climbed Mt. Everest.' );
// -> [ 'U.S.A is my birth place.',
//      'I was born on 06.12.1924.',
//      'I climbed Mt. Everest.' ]

string.soc

Creates a set of chars from the input string s. This is useful in even more aggressive string matching using Jaccard or Tversky compared to marker().

string.soc
Parameters
str (string) — the input string.
ifn (function = undefined) — a function to build index; it receives the first character of str and the idx as input arguments. The build() function of helper.returnIndexer may be used as ifn . If undefined then index is not built.
idx (number = undefined) — the index; passed as the second argument to the ifn function.
Returns
string: the soc.
Example
soc( 'the quick brown fox jumps over the lazy dog' );
// -> ' abcdefghijklmnopqrstuvwxyz'

string.song

Generates the set of ngrams of size from the input string. The default size is 2, which means it will generate set of bigrams by default.

string.song
Parameters
str (string) — the input string.
size (number = 2) — ngram size.
ifn (function = undefined) — a function to build index; it is called for every unique occurrence of ngram of str ; and it receives the ngram and the idx as input arguments. The build() function of helper.returnIndexer may be used as ifn . If undefined then index is not built.
idx (number = undefined) — the index; passed as the second argument to the ifn function.
Returns
set: of ngrams of size of str .
Example
song( 'mama' );
// -> Set { 'ma', 'am' }
song( 'mamma' );
// -> Set { 'ma', 'am', 'mm' }

string.soundex

Produces the soundex code from the input word.

string.soundex
Parameters
word (string) — the input word.
maxLength (number = 4) — of soundex code to be returned.
Returns
string: soundex code of word .
Example
soundex( 'Burroughs' );
// -> 'B620'
soundex( 'Burrows' );
// -> 'B620'

string.splitElisions

Splits basic elisions found in the input string. Typical example of elisions are it's, let's, where's, I'd, I'm, I'll, I've, and Isn't etc. Note it does not touch apostrophe used to indicate possession.

string.splitElisions
Parameters
str (string) — the input string.
Returns
string: input string after splitting of elisions.
Example
splitElisions( "someone's wallet, isn't it?" );
// -> "someone's wallet, is n't it?"

string.stem

Stems an inflected word using Porter2 stemming algorithm.

string.stem
Parameters
word (string) — to be stemmed.
Returns
string: the stemmed word.
Example
stem( 'consisting' );
// -> 'consist'

string.tokenize

Tokenizes the input sentence according to the value of detailed flag. Any occurance of ... in the sentence is converted to ellipses. In detailed = true mode, it tags every token with its type; the supported tags are currency, email, emoji, emoticon, hashtag, number, ordinal, punctuation, quoted_phrase, symbol, time, mention, url, and word.

string.tokenize
Parameters
sentence (string) — the input string.
detailed (boolean = false) — if true, each token is a object cotaining value and tag of each token; otherwise each token is a string. It's default value of false ensures compatibility with previous version.
Returns
(Array<string> | Array<object>): an array of strings if detailed is false otherwise an array of objects.
Example
tokenize( "someone's wallet, isn't it? I'll return!" );
// -> [ 'someone', '\'s', 'wallet', ',', 'is', 'n\'t', 'it', '?',
//      'I', '\'ll', 'return', '!' ]

tokenize( 'For details on wink, check out http://winkjs.org/ URL!', true );
// -> [ { value: 'For', tag: 'word' },
//      { value: 'details', tag: 'word' },
//      { value: 'on', tag: 'word' },
//      { value: 'wink', tag: 'word' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'check', tag: 'word' },
//      { value: 'out', tag: 'word' },
//      { value: 'http://winkjs.org/', tag: 'url' },
//      { value: 'URL', tag: 'word' },
//      { value: '!', tag: 'punctuation' } ]

string.tokenize0

Tokenizes by splitting the input string on non-words. This means tokens would consists of only alphas, numerals and underscores; all other characters will be stripped as they are treated as separators. It also removes all elisions; however negations are retained and amplified.

string.tokenize0
Parameters
str (string) — the input string.
Returns
Array<string>: of tokens.
Example
tokenize0( "someone's wallet, isn't it?" );
// -> [ 'someone', 's', 'wallet', 'is', 'not', 'it' ]

string.trim

Trims leading and trailing whitespaces from the input string.

string.trim
Parameters
str (string) — the input string.
Returns
string: input string with leading & trailing whitespaces removed.
Example
trim( '  Padded   ' );
// -> 'Padded'

string.upperCase

Converts the input string to upper case.

string.upperCase
Parameters
str (string) — the input string.
Returns
string: input string in upper case.
Example
upperCase( 'Upper Case' );
// -> 'UPPER CASE'

tokens

tokens.appendBigrams

Generates bigrams from the input tokens and appends them to the input tokens.

tokens.appendBigrams
Parameters
tokens (Array<string>) — the input tokens.
Returns
Array<string>: the input tokens appended with their bigrams.
Example
appendBigrams( [ 'he', 'acted', 'decisively', 'today' ] );
// -> [ 'he',
//      'acted',
//      'decisively',
//      'today',
//      'he_acted',
//      'acted_decisively',
//      'decisively_today' ]

tokens.bigrams

Generates bigrams from the input tokens.

tokens.bigrams
Parameters
tokens (Array<string>) — the input tokens.
Returns
Array<string>: the bigrams.
Example
bigrams( [ 'he', 'acted', 'decisively', 'today' ] );
// -> [ [ 'he', 'acted' ],
//      [ 'acted', 'decisively' ],
//      [ 'decisively', 'today' ] ]

tokens.bow

Generates the bag of words from the input string. By default it uses word count as it's frequency; but if logCounts parameter is set to true then it will use log2( word counts + 1 ) as it's frequency.

tokens.bow
Parameters
tokens (Array<string>) — the input tokens.
logCounts (number = false) — a true value flags the use of log2( word count + 1 ) instead of just word count as frequency.
ifn (function = undefined) — a function to build index; it is called for every unique occurrence of word in tokens ; and it receives the word and the idx as input arguments. The build() function of helper.returnIndexer may be used as ifn . If undefined then index is not built.
idx (number = undefined) — the index; passed as the second argument to the ifn function.
Returns
object: bag of words from tokens.
Example
bow( [ 'rain', 'rain', 'go', 'away' ] );
// -> { rain: 2, go: 1, away: 1 }
bow( [ 'rain', 'rain', 'go', 'away' ], true );
// -> { rain: 1.584962500721156, go: 1, away: 1 }

tokens.phonetize

Phonetizes input tokens using using an algorithmic adaptation of Metaphone.

tokens.phonetize
Parameters
tokens (Array<string>) — the input tokens.
Returns
Array<string>: phonetized tokens.
Example
phonetize( [ 'he', 'acted', 'decisively', 'today' ] );
// -> [ 'h', 'aktd', 'dssvl', 'td' ]

tokens.propagateNegations

It looks for negation tokens in the input array of tokens and propagates negation to subsequent upto tokens by prefixing them by a !. It is useful in handling text containing negations during tasks like similarity detection, classification or search.

tokens.propagateNegations
Parameters
tokens (Array<string>) — the input tokens.
upto (number = 2) — number of tokens to be negated after the negation token. Note, tokens are only negated either upto tokens or up to the token preceeding the , . ; : ! ? punctuations.
Returns
Array<string>: tokens with negation propagated.
Example
propagateNegations( [ 'mary', 'is', 'not', 'feeling', 'good', 'today' ] );
// -> [ 'mary', 'is', 'not', '!feeling', '!good', 'today' ]

tokens.removeWords

Removes the stop words from the input array of tokens.

tokens.removeWords
Parameters
tokens (Array<string>) — the input tokens.
stopWords (wordsFilter = defaultStopWords) — default stop words are loaded from stop_words.json located under the src/dictionaries/ directory. Custom stop words can be created using helper.returnWordsFilter .
Returns
Array<string>: balance tokens.
Example
removeWords( [ 'this', 'is', 'a', 'cat' ] );
// -> [ 'cat' ]

tokens.soundex

Generates the soundex coded tokens from the input tokens.

tokens.soundex
Parameters
tokens (Array<string>) — the input tokens.
Returns
Array<string>: soundex coded tokens.
Example
soundex( [ 'he', 'acted', 'decisively', 'today' ] );
// -> [ 'H000', 'A233', 'D221', 'T300' ]

tokens.sow

Generates the set of words from the input string.

tokens.sow
Parameters
tokens (Array<string>) — the input tokens.
ifn (function = undefined) — a function to build index; it is called for every member word of the set ; and it receives the word and the idx as input arguments. The build() function of helper.returnIndexer may be used as ifn . If undefined then index is not built.
idx (number = undefined) — the index; passed as the second argument to the ifn function.
Returns
set: of words from tokens.
Example
sow( [ 'rain', 'rain', 'go', 'away' ] );
// -> Set { 'rain', 'go', 'away' }

tokens.stem

Stems input tokens using Porter Stemming Algorithm Version 2.

tokens.stem
Parameters
tokens (Array<string>) — the input tokens.
Returns
Array<string>: stemmed tokens.
Example
stem( [ 'he', 'acted', 'decisively', 'today' ] );
// -> [ 'he', 'act', 'decis', 'today' ]

helper

helper.returnIndexer

Returns an Indexer object that contains two functions. The first function build() incrementally builds an index for each element using itsIndex — both passed as parameters to it. The second function — result() allows accessing the index anytime.

It is typically used with string.soc, string.bong, string.song, and tokens.sow.

helper.returnIndexer
Returns
indexer: used to build and access the index.
Example
var indexer = returnIndexer();
// -> { build: [function], result: [function] }

helper.returnQuotedTextExtractor

Returns a function that extracts all occurrences of every quoted text between the lq and the rq characters from its argument. This argument must be of type string.

helper.returnQuotedTextExtractor
Parameters
lq (string = '"') — the left quote character.
rq (string = '"') — the right quote character.
Returns
function: that will accept an input string argument and return an array of all substrings that are quoted between lq and rq .
Example
var extractQuotedText = returnQuotedTextExtractor();
extractQuotedText( 'Raise 2 issues - "fix a bug" & "run tests"' );
// -> [ 'fix a bug', 'run tests' ]

helper.returnWordsFilter

Returns an object containing the following functions: (a) set(), which returns a set of mapped words given in the input array words. (b) exclude() that is suitable for array filtering operations.

If the second argument mappers is provided as an array of maping functions then these are applied on the input array before converting into a set. A mapper function must accept a string as argument and return a string as the result. Examples of mapper functions are typically string functionss of wink-nlp-utils such as string.lowerCase(), string.stem() and string.soundex().

helper.returnWordsFilter
Parameters
words (Array<string>) — that can be filtered using the returned wordsFilter.
mappers (Array<function> = undefined) — optionally used to map each word before creating the wordsFilter.
Returns
wordsFilter: object containg set() and exclude() functions for words .
Example
var stopWords = [ 'This', 'That', 'Are', 'Is', 'Was', 'Will', 'a' ];
var myFilter = returnWordsFilter( stopWords, [ string.lowerCase ] );
[ 'this', 'is', 'a', 'cat' ].filter( myFilter.exclude );
// -> [ 'cat' ]

type-defs

indexer

indexer

Type: object

Properties
build (function) : accepts two parameters viz. element and itsIndex to incrementally build index for each element/itsIndex combination passed.
result (function) : is used to access the index. This index is in a form of an object that contains each element as key. The value of each key is an array containing all index positions to the element in question. Note these index positions are nothing but each itsIndex value passed for the element .

wordsFilter

wordsFilter

Type: Object

Properties
set (function) : contains the set created from the array words .
exclude (function) : used with array's filter method to exclude the words or mapped words if givenMappers are defined.