Collection
A collection may be thought of as similar to the JavaScript Array. It contains zero or more items of the same type. There are three item types — token, entity, and sentence.
- Access a specific item using
itemAt()
- Filter items to form a new collection using
filter()
- Iterate through items using
each()
- Produce output with
out()
- Find its length using
length()
By default, the readDoc()
method automatically creates three collections — sentences, named entities, and tokens.
Let’s explore the API with the help of the following text that has been transformed into a document:
const text = `The Godfather premiered on March 15, 1972. It was released on March 24, 1972. It is the first installment in The Godfather trilogy. The story of the movie spans from 1945 to 1955. About 90 percent of the film was shot in New York City. The movie was made on a budget of $7.2 million. And it has a running time of 177 minutes.`;
const doc = nlp.readDoc( text );
Access a specific item
An item of a collection is accessed via the itemAt()
method:
/// Access 3rd sentence of the document:
// 'The story of the movie spans from 1945 to 1955.'
const sentence3 = doc.sentences().itemAt(3);
// Access 2nd entity of the document:
// 'first'
const entity2 = doc.entities().itemAt(2);
// Access 1st token of the document:
// 'Godfather'
const token1 = doc.tokens().itemAt(1);
itemAt()
uses a 0-based index and returns undefined when a non-existent item (i.e. outside the valid range) is accessed. Filter items
Items can be filtered on the basis of their properties, to form a filtered collection. For example, you can filter all “date” entities, remove stop words, or select only words that have been negated.
Like JavaScript’s array filter method, the filter()
method in a collection returns a new collection of items that pass the test provided in the callback function.
Filters provide a quick and easy way to extract information. The following example selects tokens from the 3rd sentence of the document that are of type ‘word’ and are not stop words:
doc.sentences()
.itemAt(3) // The story of the movie spans from 1945 to 1955.
.tokens()
.filter(
(t) => t.out(its.type) === 'word' && !t.out(its.stopWordFlag)
)
.out();
// Returns:
// [ 'story', 'movie', 'spans' ]
its
is a helper and is required using the following statement:const its = require( 'wink-nlp/src/its.js' );
Iterate through items
Similar to a JavaScript array, collections have a forEach()
method to iterate over its items. It also has a length()
method to get the number of items. Let’s count the number of dates in the example text and then print them out:
doc.entities().length()
// -> 2
doc.entities()
.each((e) => {
if (e.out(its.type) === 'DATE')
console.log(e.out());
} );
// -> 'March 15, 1972'
// -> 'March 24, 1972'
The each()
method calls the provided callback function once for every entity in the entities collection beginning from 0th entity item to the last.
Produce output
The collection.out()
method by default returns an array of values of items in the collection:
doc.sentences().out(); // Returns:
[ 'The Godfather premiered on March 15, 1972.'
'It was released on March 24, 1972.'
'It is the first installment in The Godfather trilogy.'
'The story of the movie spans from 1945 to 1955.'
'About 90 percent of the film was shot in New York City.'
'The movie was made on a budget of $7.2 million.'
'And it has a running time of 177 minutes.' ]
doc.entities().out(); // Returns:
[ 'March 15, 1972',
'March 24, 1972',
'first',
'from 1945 to 1955',
'About 90 percent',
'$7.2 million',
'177 minutes' ]
doc.tokens().out(); // Returns:
[ 'The', 'Godfather', 'premiered', ... '177', 'minutes', '.' ]
It is possible to obtain array of any property of items in a collection:
doc.entities().out( its.type ); // Returns:
[ 'DATE',
'DATE',
'ORDINAL',
'DURATION',
'PERCENT',
'MONEY',
'DURATION' ]