similarity

similarity → { object }

This utility has multiple text similarity detection methods — cosine for bag of words, tversky and Otsuka-Ochiai for set. The bag-of-words model of a document can be obtained using as.bow and set can be derived using as.set. For example:

// Obtain the bow of a document.
bow1 = doc1.tokens().out(its.value, as.bow);
// Obtain the set of a document.
set1 = doc1.tokens().out(its.value, as.set);

It is also possible to pre-process the text prior to comparison using winkNLP's methods. For example stop words or punctuations can be removed using the .filter() method before obtaining a bow or set.

Require this utility using the following statement:

const similarity = require('wink-nlp/utilities/similarity.js');

The variable — similarity exposes following methods:

Name Description
similarity.bow.cosine(bowA, bowB) Measures similarity between the two BoWs using cosine similarity.
similarity.set.tversky(setA, setB[, alpha, beta]) Measures similarity between the two sets using Tversky method. The default values for both alpha & beta is 0.5. You can get Jaccard similarity or Sørensen-Dice by using appropriate values for alpha & beta.
similarity.set.oo(setA, setB) Measures Otsuka-Ochiai similarity between the two sets; this is equivalent to cosine similarity with a binarized BoW.

All methods return a value between 0 and 1.


Leave feedback