BM25Vectorizer()

BM25Vectorizer( configuration ) → { methods }

BM25 is a major improvement over the classical TF-IDF based algorithms. The weights for a specific term (i.e. token) is computed using the BM25 algorithm. Three parameters control the computation of weights in this algorithm:

k1 controls how quickly TF saturates; lower values lead to faster saturation.
b controls normalization based on document length; setting b = 1 will perform full document-length normalisation, while b = 0 will switch normalisation off.
k manages IDF's saturation.

The configuration argument is an object that defines k1, b, k and norm. The norm defines the vector norm; the supported norms are none, l1, or l2. The default values of k1, b, k and norm are 1.2, 0.75, 1 and none respectively. Note, the default configuration usually works well for most of the situations. You can overide any or all default values using the configuration argument.

// Require wink-nlp, model and its helper.
const model = require('wink-eng-lite-web-model');
const nlp = require('wink-nlp' )(model);
const its = nlp.its;
// Require the BM25 Vectorizer.
const BM25Vectorizer = require('wink-nlp/utilities/bm25-vectorizer');
// Instantiate a vectorizer with the default configuration — no input config
// parameter indicates use default.
const bm25 = BM25Vectorizer();

The above creates an instance of the BM25 vectorizer that exposes the following APIs:

Method	Purpose
`learn(tokens)`	Learns the BM25 token weights from the input document's tokens. It is called iteratively for every input document. The learning process is automatically marked as completed on the first call to the `.out()` method.
`doc(n)`	Allows access to the n_th document after learning is completed.
`out(its.helper)`	Produces a variety of outputs based on the input `its.helper`; some examples are `its.idf`, and `its.bow`. It is also available at `.doc()` level.
`bowOf(tokens, processOOV=false)`	Produces the bag-of-words of input tokens based on the learnings; it ignores OOV tokens by default unless the `processOOV` parameter is true. For cosine similarity computation, it is recommended to set this flag as true.
`vectorOf(tokens)`	Produces the vector of input tokens based on the learnings; OOV tokens are ignored.
`config()`	Returns the current configuration.
`loadModel(json)`	Loads a previously saved model `json`. Model JSON for saving can be generated via `.out( its.modelJSON )` api call. Once a model is successfully loaded, further learning is not permitted.

The example below and the subsequent section on helpers illustrates the API usage in detail.

// Sample corpus.
const corpus = ['Bach', 'J Bach', 'Johann S Bach', 'Johann Sebastian Bach'];
// Train the vectorizer on each document, using its tokens. The tokens are
// extracted using the .out() api of wink NLP.
corpus.forEach((doc) =>  bm25.learn(nlp.readDoc(doc).tokens().out(its.normal)));

// Returns the vector of the new document, "Johann Bach symphony", which is
// first tokenized using winkNLP.
bm25.vectorOf(nlp.readDoc('Johann Bach symphony').tokens().out(its.normal));
// -> [0.092717254, 0, 0.609969519, 0, 0]

In certain cases, it may be useful to use its.stem or its.lemma instead of its.normal — as used in the above example.

BM25Vectorizer's `its` helpers

These helpers help the .out() method of BM25Vectorizer to produce a range of different outputs as outlined below. While they are similar to winkNLP helpers, but should not be treated as interchangeable; these apply to BM25Vecotrizer only.

WinkNLP computes all the weights (or scores) such as tf, bow and idf as per the BM25 algorithm and should not be confused with the standard TF-IDF scores.

`its.bow`

Applies to: vectorizer.doc(n).out()

Helps in generating the bag-of-words model of the document passed in doc():

// Returns the bow of the 1st document i.e. 'J Bach':
bm25.doc(1).out(its.bow);
// -> {j:1.261304842, bach:0.110377683}