Document

The document provides different views — dynamically — based on your context. It can be viewed as a collection of tokens while in another context it could be a collection of sentences or a collection of named entities such as time, date, or URLs. It lets you access these in a flexible manner. Consider the following text:

const text = `On July 20, 1969, a voice crackled from the speakers. He said simply, "the Eagle has landed." They spent nearly 21 hours on the lunar surface. 20% of the world\'s population watched humans walk on Moon.`;

const doc = nlp.readDoc(text);

The document has sentences(), entities() and tokens() methods to obtain their collection:

doc.sentences().out();
// Returns:
// [ 'On July 20, 1969, a voice crackled from the speakers.',
//   'He said simply, "the Eagle has landed."',
//   'They spent nearly 21 hours on the lunar surface.',
//   '20% of the world\'s population watched humans walk on Moon.'
// ]

doc.entities().out();
// Returns:
// [ 'July 20, 1969', 'nearly 21 hours', '20%' ]

doc.tokens().out();
// Returns:
// [ 'On', 'July', '20', ',', ... 'walk', 'on', 'Moon', '.' ]

Each element of a collection is referred to as an item. In other words a single_ token, entity, or sentence is an item. An item is accessed via the itemAt(n) method, where n is the index of the item. Like JavaScript, this index is also 0-based. For example:

doc.entities().itemAt(1).out();
// Returns:
// 'nearly 21 hours'

Since out() was called on an item, it automatically returned a string instead of an array.

By default, the out() returns a string when it is applied to an item and an array of strings when it is applied to a collection.

Next, let’s look at what a single sentence or entity might look like:

doc.sentences().itemAt(0).entities().out();
// Returns:
// [ 'July 20, 1969' ]
doc.sentences()   // Collection of all sentences.
        .itemAt(0)     // Its 0th sentence.
        .entities()    // Collection of entities in sentence #0.
        .itemAt(0)     // Its 0th entity.
        .tokens()      // Collection of tokens in entity #0.
        .out();           // Array of tokens in 0th entity of
                       // 0th sentence of the document!
// Returns:
// [ 'July', '20', ',', '1969' ]

An attempt to access a non-existent item using itemAt() returns undefined:

doc.sentences().itemAt(-1);
// Returns:
// undefined
The document also provides the pipeConfig() method, which returns the currently active processing pipeline based on the loaded language model.

In essence, a document is composed of collections of sentences, named entities, and tokens. Collections and items along with their methods are explained in the next section.


Leave feedback