similarity
similarity → { object }
This utility has multiple text similarity detection methods — cosine for bag of words, tversky and Otsuka-Ochiai for set. The bag-of-words model of a document can be obtained using as.bow
and set can be derived using as.set
. For example:
// Obtain the bow of a document.
bow1 = doc1.tokens().out(its.value, as.bow);
// Obtain the set of a document.
set1 = doc1.tokens().out(its.value, as.set);
It is also possible to pre-process the text prior to comparison using winkNLP's methods. For example stop words or punctuations can be removed using the .filter() method before obtaining a bow or set.
Require this utility using the following statement:
const similarity = require('wink-nlp/utilities/similarity.js');
The variable — similarity exposes following methods:
Name | Description |
---|---|
similarity.bow.cosine(bowA, bowB) | Measures similarity between the two BoWs using cosine similarity. |
similarity.set.tversky(setA, setB[, alpha, beta]) | Measures similarity between the two sets using Tversky method. The default values for both alpha & beta is 0.5. You can get Jaccard similarity or Sørensen-Dice by using appropriate values for alpha & beta. |
similarity.set.oo(setA, setB) | Measures Otsuka-Ochiai similarity between the two sets; this is equivalent to cosine similarity with a binarized BoW. |
All methods return a value between 0 and 1.