Collection

A collection may be thought of as similar to the JavaScript Array. It contains zero or more items of the same type. There are three item types — token, entity, and sentence.

Convention: A collection name is always in the plural i.e. sentences, entities, and tokens; whereas an item name is always in the singular i.e. sentence, entity and token.
  • Access a specific item using itemAt()
  • Filter items to form a new collection using filter()
  • Iterate through items using each()
  • Produce output with out()
  • Find its length using length()

By default, the readDoc() method automatically creates three collections — sentences, named entities, and tokens.

While collections are conceptually similar to JavaScript collections such as arrays or sets, their implementation and API is limited to the above functions. To get JavaScript data types take a look at out().

Let’s explore the API with the help of the following text that has been transformed into a document:

const text = `The Godfather premiered on March 15, 1972. It was released on March 24, 1972. It is the first installment in The Godfather trilogy. The story of the movie spans from 1945 to 1955. About 90 percent of the film was shot in New York City. The movie was made on a budget of $7.2 million. And it has a running time of 177 minutes.`;

const doc = nlp.readDoc( text );

Access a specific item

An item of a collection is accessed via the itemAt() method:

/// Access 3rd sentence of the document:
// 'The story of the movie spans from 1945 to 1955.'
const sentence3 = doc.sentences().itemAt(3);

// Access 2nd entity of the document:
// 'first'
const entity2 = doc.entities().itemAt(2);

// Access 1st token of the document:
// 'Godfather'
const token1 = doc.tokens().itemAt(1);
Like JavaScript, itemAt() uses a 0-based index and returns undefined when a non-existent item (i.e. outside the valid range) is accessed.

Filter items

Items can be filtered on the basis of their properties, to form a filtered collection. For example, you can filter all “date” entities, remove stop words, or select only words that have been negated.

Like JavaScript’s array filter method, the filter() method in a collection returns a new collection of items that pass the test provided in the callback function.

Filters provide a quick and easy way to extract information. The following example selects tokens from the 3rd sentence of the document that are of type ‘word’ and are not stop words:

doc.sentences()
  .itemAt(3) // The story of the movie spans from 1945 to 1955.
  .tokens()
  .filter(
    (t) => t.out(its.type) === 'word' && !t.out(its.stopWordFlag)
   )
  .out();
// Returns:
// [ 'story', 'movie', 'spans' ]
its is a helper and is required using the following statement:
const its = require( 'wink-nlp/src/its.js' );

Iterate through items

Similar to a JavaScript array, collections have a forEach() method to iterate over its items. It also has a length() method to get the number of items. Let’s count the number of dates in the example text and then print them out:

doc.entities().length()
// -> 2

doc.entities()
        .each((e) => {
          if (e.out(its.type) === 'DATE')
            console.log(e.out());
        } );
// -> 'March 15, 1972'
// -> 'March 24, 1972'
Note '1945 to 1955' is not printed as it is of type duration and not of type date.

The each() method calls the provided callback function once for every entity in the entities collection beginning from 0th entity item to the last.

Produce output

The collection.out() method by default returns an array of values of items in the collection:

doc.sentences().out(); // Returns:
[ 'The Godfather premiered on March 15, 1972.'
  'It was released on March 24, 1972.'
  'It is the first installment in The Godfather trilogy.'
  'The story of the movie spans from 1945 to 1955.'
  'About 90 percent of the film was shot in New York City.'
  'The movie was made on a budget of $7.2 million.'
  'And it has a running time of 177 minutes.' ]
doc.entities().out(); // Returns:
[ 'March 15, 1972',
  'March 24, 1972',
  'first',
  'from 1945 to 1955',
  'About 90 percent',
  '$7.2 million',
  '177 minutes' ]
doc.tokens().out(); // Returns:
[ 'The', 'Godfather', 'premiered', ... '177', 'minutes', '.' ]

It is possible to obtain array of any property of items in a collection:

doc.entities().out( its.type ); // Returns:
[ 'DATE',
  'DATE',
  'ORDINAL',
  'DURATION',
  'PERCENT',
  'MONEY',
  'DURATION' ]

Leave feedback