Introduction

wink-distance

Distance/Similarity functions for Bag of Words, Strings, Vectors and more.

Build Status Coverage Status Inline docs dependencies Status devDependencies Status

Compute distances or similarities needed for NLP, de-duplication and clustering using wink-distance. It is a part of wink — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS.

Installation

Use npm to install:

npm install wink-distance --save

Documentation

Check out the distance/similarity API documentation to learn more.

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.

Copyright & License

wink-distance is copyright 2017-18 GRAYPE Systems Private Limited.

It is licensed under the under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 of the License.

bow

bow.cosine

Computes the cosine distance between the input bag of words (bow) a and b and returns a value between 0 and 1.

bow.cosine
Parameters
a (object) — the first set of bows i.e word (i.e. key) and it's frequency (i.e. value) pairs.
b (object) — the second set of bows.
Returns
number: cosine distance between a and b .
Example
// bow for "the dog chased the cat"
var a = { the: 2, dog: 1, chased: 1, cat: 1 };
// bow  for "the cat chased the mouse"
var b = { the: 2, cat: 1, chased: 1, mouse: 1 };
cosine( a, b );
// -> 0.14285714285714302
// Note the bow could have been created directly by
// using "tokens.bow()" from the "wink-nlp-utils".

number

number.hamming

Computes the the hamming distance between two numbers; each number is assumed to be decimal representation of a binary number.

number.hamming
Parameters
na (number) — the first number.
nb (number) — the second number.
Returns
number: hamming distance between na and nb .
Example
hamming( 8, 8 );
// -> 0
hamming( 8, 15 );
// -> 3
hamming( 9, 15 );
// -> 2

set

set.jaccard

Computes the Jaccard distance between input sets sa and sb. This distance is always between 0 and 1.

set.jaccard
Parameters
sa (set) — the first set.
sb (set) — the second set.
Returns
number: the Jaccard distance between sa and sb .
Example
// Set for :-)
var sa = new Set( ':-)' );
// Set for :-(
var sb = new Set( ':-(' );
jaccard( sa, sb );
// -> 0.5

set.tversky

Computes the tversky distance between input sets sa and sb. This distance is always between 0 and 1. Tversky calls sa as prototype and sb as variant. The alpha corresponds to the weight of prototype, whereas beta corresponds to the weight of variant.

set.tversky
Parameters
sa (set) — the first set or the prototype.
sb (set) — the second set or the variant.
alpha (number = 0.5) — the prototype weight.
beta (number = 0.5) — the variant weight.
Returns
number: the tversky distance between sa and sb .
Example
// Set for :-)
var sa = new Set( ':-)' );
// Set for :p
var sb = new Set( ':p' );
tversky( sa, sb, 1, 0 );
// -> 0.6666666666666667
tversky( sa, sb );
// -> 0.6
tversky( sa, sb, 0.5, 0.5 );
// -> 0.6
tversky( sa, sb, 0, 1 );
// -> 0.5

string

string.hamming

Computes the hamming distance between two strings of identical length. This distance is always >= 0.

string.hamming
Parameters
str1 (string) — first string.
str2 (string) — second string.
Returns
number: hamming distance between str1 and str2 .
Example
hamming( 'john', 'john' );
// ->  0
hamming( 'sam', 'sat' );
// -> 1
hamming( 'summer', 'samuel' );
// -> 3
hamming( 'saturn', 'urn' );
// -> throws error

string.hammingNormalized

Computes the normalized hamming distance between two strings. These strings may be of different lengths. Normalized distance is always between 0 and 1.

string.hammingNormalized
Parameters
str1 (string) — first string.
str2 (string) — second string.
Returns
number: normalized hamming distance between str1 and str2 .
Example
hammingNormalized( 'john', 'johny' );
// ->  0.2
hammingNormalized( 'sam', 'sam' );
// -> 0
hammingNormalized( 'sam', 'samuel' );
// -> 0.5
hammingNormalized( 'saturn', 'urn' );
// -> 1

string.jaro

Computes the jaro distance between two strings. This distance is always between 0 and 1.

string.jaro
Parameters
str1 (string) — first string.
str2 (string) — second string.
Returns
number: jaro distance between str1 and str2 .
Example
jaro( 'father', 'farther' );
// ->  0.04761904761904756
jaro( 'abcdef', 'fedcba' );
// -> 0.6111111111111112
jaro( 'sat', 'urn' );
// -> 1

string.jaroWinkler

Computes the jaro winkler distance between two strings. This distance, controlled by the scalingFactor, is always between 0 and 1.

string.jaroWinkler
Parameters
str1 (string) — first string.
str2 (string) — second string.
boostThreshold (number = 0.3) — beyond which scaling is applied: it is applied only if the jaro distance between the input strings is less than or equal to this value. Any value > 1, is capped at 1 automatically.
scalingFactor (number = 0.1) — is used to scale the distance. Such scaling, if applied, is proportional to the number of shared consecutive characters from the first character of str1 and str2 . Any value > 0.25, is capped at 0.25 automatically.
Returns
number: jaro winkler distance between str1 and str2 .
Example
jaroWinkler( 'martha', 'marhta' );
// ->  0.03888888888888883
jaroWinkler( 'martha', 'marhta', 0.3, 0.2 );
// -> 0.022222222222222185
jaroWinkler( 'duane', 'dwayne' );
// -> .15999999999999992

string.levenshtein

Computes the levenshtein distance between two strings. This distance is computed as the number of deletions, insertions, or substitutions required to transform a string to another. Levenshtein distance is always an integer with a value of 0 or more.

string.levenshtein
Parameters
str1 (string) — first string.
str2 (string) — second string.
Returns
number: levenshtein distance between str1 and str2 .
Example
levenshtein( 'example', 'sample' );
// ->  3
levenshtein( 'distance', 'difference' );
// -> 5

string.soundex

Computes the soundex distance between two strings. This distance is either 1 indicating phonetic similarity or 0 indicating no similarity.

string.soundex
Parameters
str1 (string) — first string.
str2 (string) — second string.
Returns
number: soundex distance between str1 and str2 .
Example
soundex( 'Burroughs', 'Burrows' );
// ->  0
soundex( 'Ekzampul', 'example' );
// -> 0
soundex( 'sat', 'urn' );
// -> 1

vector

vector.chebyshev

Computes the chebyshev or manhattan distance between two vectors of identical length.

vector.chebyshev
Parameters
va (number) — the first vector.
vb (number) — the second vector.
Returns
number: chebyshev distance between va and vb .
Example
chebyshev( [ 0, 0 ], [ 6, 6 ] );
// -> 6

vector.taxicab

Computes the taxicab or manhattan distance between two vectors of identical length.

vector.taxicab
Parameters
va (number) — the first vector.
vb (number) — the second vector.
Returns
number: taxicab distance between va and vb .
Example
taxicab( [ 0, 0 ], [ 6, 6 ] );
// -> 12