Leveraging out()

The out() method produces appropriate JavaScript built-in datatypes depending on the usage. It is available universally at all levels — document, collection, and item. By default i.e without any input parameter, the out() returns a string when it is applied to an item and an array of strings when it is applied to a collection. The behaviour of doc.out() is similar to item.out() as shown below:

const text = `Its quarterly profits jumped 76% to $1.13 billion for the three months to December, from $639million of previous year.`;
const doc = nlp.readDoc( text );

doc.out() reproduces the original text:

doc.out()
// -> Its quarterly profits jumped 76% to
//    $1.13 billion for the three months to
//    December, from $639million of previous year.

The out() method has two optional arguments — its.propertyName and as.reducedValue. These optional arguments are useful in information extraction:

  1. A token, entity, sentence or document has several contextual properties that are accessible via its.propertyName such as its.stopWordFlag, its.shape and its.vector.
  2. The collection of tokens and entities can be reduced to as.freqTable, as.bow (bag of words), or as.bigrams etc. with as.reducedValue.
Convention: The bold part in a code fragment needs to be substituted with actual value according to the requirement. For example as observed above, the propertyName in its.propertyName can have a value such as stopWordFlag, shape or vector.

item.out()

While working with an item, any of its properties can be extracted by passing its.propertyName parameter to the item.out() method. For example doc.tokens().itemAt(0).out(its.shape) would return Xxx — the shape of zeroth token, "Its". Similarly doc.tokens().itemAt(0).out(its.case) would return titleCase.

its is a helper and is required using the following statement:
const its = require( 'wink-nlp/src/its.js' );

Each item type has several properties including few that are common across all types. The most prominent one is its.value — the default for the out() method. Another important common property, applicable to latin script languages such as English or French, is its.normal. It is useful for obtaining the lower-cased value. It also has some language specific flavour — for example in english, apart from lower casing the token it also automatically maps the british spellings to its american equivalent if any.

A comprehensive list of properties is available in the reference section titled “its helper”. A select few are outlined below:

Type Properties
Token
  • its.normal — lower-cased value of token; applies to latin script languages such as English or French. When used in english language, it maps the british spelling to its equivalent american spelling, if any.
  • its.pos — part-of-speech.
  • its.stopWordFlag — true if the token is a stop word.
  • its.type — type of token determined during tokenization e.g. word, number, punctuation or symbol.
  • its.vector — the word vector of the token.
Entity
  • its.type — type of entity determined during named entity recognition e.g. DATE, DURATION, or MONEY.
  • its.span — span of entity in terms of indexes of first tokens and the last token of the entity.
Sentence
  • its.span — span of sentence in terms of indexes of first tokens and the last token of the entity.
  • its.negationFlag — true if the sentence has any negation; for example, “I didn’t like it.” would have negation.
  • its.markedUpText — marked up text of the sentence, which has already been marked up using item.markup() API method. Useful in text visualization and highlighting.
  • its.sentiment - sentiment score of the sentence. Its value is between -1 and +1.
Document
  • its.sentiment — sentiment score of the document. Its value is between -1 and +1.
  • its.markedUpText — marked up text of the document, which has already been marked up using item.markup() API method. Useful in text visualization and highlighting.
The item.out() method automatically falls back to the default i.e. its.value whenever the input parameter is invalid or the property does not apply to the item in question. For example doc.out(its.case) would return the same as doc.out().

This is useful in a variety of NLP tasks such as text pre-processing and information extraction. For example, extracting nouns from a sentence gives a rough sense of its context:

doc.tokens()
        .filter(
          // Exclude nouns inside an entity
          (t) => !t.parentEntity() && t.out(its.pos) === 'NOUN'
         )
         .out();
// -> [ 'profits' ]

Let us take another example of text classification or intent detection. These sometimes require replacement of entity values by their types. Such replacements are helpful when each individual entity’s value is less semantically important compared to its type. These are typically required in addition to punctuation and stop word removal. Here is an example that illustrates how all of this can be easily achieved using the out() method:

const processedTokens = [];
const detectedEntities = new Set();
doc.tokens()
        .each( (t) => {
          const pe = t.parentEntity();
          if (pe && !detectedEntities.has(pe.index())) {
            detectedEntities.add(pe.index());
            processedTokens.push('#'+pe.out(its.type));
          } else if (!pe && !t.out(its.stopWordFlag) &&
                     (t.out(its.type) === 'word'))
                   processedTokens.push(t.out(its.normal));
        });
console.log( processedTokens );
// -> [ 'quarterly', 'profits', 'jumped', '#PERCENT', '#MONEY',
//      '#DURATION', '#DATE', '#MONEY', '#DATE']

collection.out()

By default, collection.out() method produces an array of strings, where collection can be of either sentences, entities, customEntities or tokens. For example:

// Each string in the array is an entity.
doc.entities().out()
// -> ['76%', '$1.13 billion', 'three months', 'December',
//      '$639million', 'previous year']

The its.propertyName parameter in this case acts like a mapper:

doc.entities().itemAt(1).tokens().out(its.type);
// -> [ 'currency', 'number', 'word' ]
doc.entities().itemAt(1).tokens().out(its.shape);
// -> [ '$', 'd.dd', 'xxxx' ]

Note its.shape trims after any four consecutive identical shape patterns, which is why the shape of “billion” is “xxxx” and not “xxxxxxx”.

The collection.out() method also accepts a second parameter — as.reducedValue. Here “as” is another helper like “its”.

as is a helper and is required using the following statement:
const as = require( 'wink-nlp/src/as.js' );

The as.reducedValue acts like a reducer and it defaults to as.array. Some of the “as” options are as.bow (bag of words) and as.bigrams. These reducers further simplify a number of common NLP tasks. Here is an example of bag of words creation:

const poem = `Rain, rain, go away
Come again another day!`;
const doc = nlp.readDoc( poem );
doc.tokens()
        .filter(
          (t) => !t.out(its.stopWordFlag) &&
                 (t.out(its.type) === 'word'))
        .out(its.normal, as.bow);
// -> { rain: 2, away: 1, come: 1, day: 1 }

The out() method plays an important role in winkNLP applications. Here is its summary:

  1. The item.out() method accepts its.propertyName as a parameter, whose default value is its.value, which is also the fall back if contextually invalid value is passed.

  2. The doc.out() method behaves like item.out().

  3. The collection.out() method has two parameters — its.propertyName and as.reducedValue — think of them as a mapper and reducer respectively. Their default values are its.value and as.array.


Leave feedback