BM25Vectorizer()

BM25Vectorizer( configuration ) → { methods }

BM25 is a major improvement over the classical TF-IDF based algorithms. The weights for a specific term (i.e. token) is computed using the BM25 algorithm. Three parameters control the computation of weights in this algorithm:

  1. k1 controls how quickly TF saturates; lower values lead to faster saturation.
  2. b controls normalization based on document length; setting b = 1 will perform full document-length normalisation, while b = 0 will switch normalisation off.
  3. k manages IDF's saturation.

The configuration argument is an object that defines k1, b, k and norm. The norm defines the vector norm; the supported norms are none, l1, or l2. The default values of k1, b, k and norm are 1.2, 0.75, 1 and none respectively. Note, the default configuration usually works well for most of the situations. You can overide any or all default values using the configuration argument.

// Require wink-nlp, model and its helper.
const model = require('wink-eng-lite-web-model');
const nlp = require('wink-nlp' )(model);
const its = nlp.its;
// Require the BM25 Vectorizer.
const BM25Vectorizer = require('wink-nlp/utilities/bm25-vectorizer');
// Instantiate a vectorizer with the default configuration — no input config
// parameter indicates use default.
const bm25 = BM25Vectorizer();

The above creates an instance of the BM25 vectorizer that exposes the following APIs:

Method Purpose
learn(tokens) Learns the BM25 token weights from the input document's tokens. It is called iteratively for every input document. The learning process is automatically marked as completed on the first call to the .out() method.
doc(n) Allows access to the nth document after learning is completed.
out(its.helper) Produces a variety of outputs based on the input its.helper; some examples are its.idf, and its.bow.

It is also available at .doc() level.
bowOf(tokens, processOOV=false) Produces the bag-of-words of input tokens based on the learnings; it ignores OOV tokens by default unless the processOOV parameter is true. For cosine similarity computation, it is recommended to set this flag as true.
vectorOf(tokens) Produces the vector of input tokens based on the learnings; OOV tokens are ignored.
config() Returns the current configuration.
loadModel(json) Loads a previously saved model json. Model JSON for saving can be generated via .out( its.modelJSON ) api call. Once a model is successfully loaded, further learning is not permitted.

The example below and the subsequent section on helpers illustrates the API usage in detail.

// Sample corpus.
const corpus = ['Bach', 'J Bach', 'Johann S Bach', 'Johann Sebastian Bach'];
// Train the vectorizer on each document, using its tokens. The tokens are
// extracted using the .out() api of wink NLP.
corpus.forEach((doc) =>  bm25.learn(nlp.readDoc(doc).tokens().out(its.normal)));

// Returns the vector of the new document, "Johann Bach symphony", which is
// first tokenized using winkNLP.
bm25.vectorOf(nlp.readDoc('Johann Bach symphony').tokens().out(its.normal));
// -> [0.092717254, 0, 0.609969519, 0, 0]
In certain cases, it may be useful to use its.stem or its.lemma instead of its.normal — as used in the above example.

BM25Vectorizer's its helpers

These helpers help the .out() method of BM25Vectorizer to produce a range of different outputs as outlined below. While they are similar to winkNLP helpers, but should not be treated as interchangeable; these apply to BM25Vecotrizer only.

WinkNLP computes all the weights (or scores) such as tf, bow and idf as per the BM25 algorithm and should not be confused with the standard TF-IDF scores.

its.bow

Applies to: vectorizer.doc(n).out()

Helps in generating the bag-of-words model of the document passed in doc():

// Returns the bow of the 1st document i.e. 'J Bach':
bm25.doc(1).out(its.bow);
// -> {j:1.261304842, bach:0.110377683}

its.docBOWArray

Applies to: vectorizer.out()

Helps in producing an array containing the bag-of-words model for every document in the corpus:

// Returns an array containing bow of every document in the corpus:
bm25.out(its.docBOWArray);
// -> [
//      {bach: 0.136348903}
//      {j: 1.261304842, bach: 0.110377683}
//      {johann: 0.609969519, s: 1.059496068, bach: 0.092717254}
//      {johann: 0.609969519, sebastian: 1.059496068, bach: 0.092717254}
//    ]

its.docTermMatrix

Applies to: vectorizer.out()

Aids in generating the document term matrix for the corpus:

// Returns a 2-dimensional array, where  rows correspond to documents in
// the corpus and columns correspond to terms i.e. the tokens.
bm25.out(its.docTermMatrix);
// -> [
//      [0.136348903, 0, 0, 0, 0]
//      [0.110377683, 1.261304842, 0, 0, 0]
//      [0.092717254, 0, 0.609969519, 1.059496068, 0]
//      [0.092717254, 0, 0.609969519, 0, 1.059496068]
//    ]

See also: its.terms

its.idf

Applies to: vectorizer.out()

Helps in producing inverse document frequency for each token in the corpus:

// Returns an array of token & its idf pairs.
bm25.out(its.idf);
// -> [
//      ["j", 1.203972804]
//      ["s", 1.203972804]
//      ["sebastian", 1.203972804]
//      ["johann", 0.693147181]
//      ["bach", 0.105360516]
//.   ]

its.modelJSON

Applies to: vectorizer.out()

Aids in producing JSON of BM25Vecotrizer's model, which can be saved and reused later without relearning from corpus. Saved model can be loaded using .loadModel() api.

// Returns the model in JSON format.
bm25.out(its.modelJSON);
// -> <the model's json>

its.terms

Applies to: vectorizer.out()

Assists in generating an array of all the unique terms in the corpus. These are always sorted alphabetically. Note, the document term matrix contains the weight for each document in the same order in which terms appear here.

// Returns an array of unique tokens in the corpus, sorted by alphabetic order.
bm25.out(its.terms);
// -> ['bach', 'j', 'johann', 's', 'sebastian']

its.tf

Applies to: vectorizer.doc(n).out()

Helps in producing term frequencies in the form of (token, its frequency) pairs array for the document referenced in doc(n).

// Returns an array of token & its tf pairs for the first document.
bm25.doc(1).out(its.tf);
// -> [["j", 1.261304842], ["bach", 0.110377683]]

its.vector

Applies to: vectorizer.doc(n).out()

Aids in producing the vector of term frequencies of a document.

// Returns a vector of for the first document.
bm25.doc(1).out(its.tf);
// -> [0.110377683, 1.261304842, 0, 0, 0]

Leave feedback