Document
The document provides different views — dynamically — based on your context. It can be viewed as a collection of tokens while in another context it could be a collection of sentences or a collection of named entities such as time, date, or URLs. It lets you access these in a flexible manner. Consider the following text:
const text = `On July 20, 1969, a voice crackled from the speakers. He said simply, "the Eagle has landed." They spent nearly 21 hours on the lunar surface. 20% of the world\'s population watched humans walk on Moon.`;
const doc = nlp.readDoc(text);
The document has sentences()
, entities()
and tokens()
methods to obtain their collection:
doc.sentences().out();
// Returns:
// [ 'On July 20, 1969, a voice crackled from the speakers.',
// 'He said simply, "the Eagle has landed."',
// 'They spent nearly 21 hours on the lunar surface.',
// '20% of the world\'s population watched humans walk on Moon.'
// ]
doc.entities().out();
// Returns:
// [ 'July 20, 1969', 'nearly 21 hours', '20%' ]
doc.tokens().out();
// Returns:
// [ 'On', 'July', '20', ',', ... 'walk', 'on', 'Moon', '.' ]
Each element of a collection is referred to as an item. In other words a single_ token, entity, or sentence is an item. An item is accessed via the itemAt(n)
method, where n
is the index of the item. Like JavaScript, this index is also 0-based. For example:
doc.entities().itemAt(1).out();
// Returns:
// 'nearly 21 hours'
Since out()
was called on an item, it automatically returned a string instead of an array.
out()
returns a string when it is applied to an item and an array of strings when it is applied to a collection. Next, let’s look at what a single sentence or entity might look like:
doc.sentences().itemAt(0).entities().out();
// Returns:
// [ 'July 20, 1969' ]
doc.sentences() // Collection of all sentences.
.itemAt(0) // Its 0th sentence.
.entities() // Collection of entities in sentence #0.
.itemAt(0) // Its 0th entity.
.tokens() // Collection of tokens in entity #0.
.out(); // Array of tokens in 0th entity of
// 0th sentence of the document!
// Returns:
// [ 'July', '20', ',', '1969' ]
An attempt to access a non-existent item using itemAt()
returns undefined
:
doc.sentences().itemAt(-1);
// Returns:
// undefined
pipeConfig()
method, which returns the currently active processing pipeline based on the loaded language model. In essence, a document is composed of collections of sentences, named entities, and tokens. Collections and items along with their methods are explained in the next section.