Introduction

wink-regression-tree

Decision Tree to predict the value of a continuous target variable

Build Status Coverage Status Inline docs dependencies Status devDependencies Status

Predict the value of a continuous variable such as price, turn around time, or mileage using wink-regression-tree. It is a part of wink — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS.

Installation

Use npm to install:

npm install wink-regression-tree --save

Getting Started

Here is an example of predicting car’s mileage (miles per gallon - mpg) from attributes like displacement, horsepower, acceleration, country of origin, and few more. A sample data row is given for quick reference:

Model MPG Cylinders Displacement Power Weight Acceleration Year Origin
Toyota Mark II 20 6 large displacement high power high weight slow 73 Japan

The code below provides a potential configuration to predict the value of miles per gallon:

// Load wink-regression-tree.
var regressionTree = require( 'wink-regression-tree' );

// Load cars training data set.
// In practice an async mechanism may be used to
// read data asynchronously and call `ingest()` on
// every row of data read.
var cars = require( 'wink-regression-tree/sample-data/cars.json' );

// Create a sample data to test prediction for
// Ford Gran Torino, having "mpg of 14.5", very
// large displacement, extremely high power, very
// high weight, slow, and with origin as US.
var input = {
  model: 'Ford Gran Torino',
  weight: 'very high weight',
  displacement: 'very large displacement',
  horsepower: 'extremely high power',
  origin: 'US',
  acceleration: 'slow'
};
// Above record is not the part of training data.

// Create an instance of the regression  tree.
var rt = regressionTree();

// Specify columns of the training data.
var columns = [
  { name: 'model', categorical: true, exclude: true },
  { name: 'mpg', categorical: false, target: true },
  { name: 'cylinders', categorical: true, exclude: false },
  { name: 'displacement', categorical: true, exclude: false },
  { name: 'horsepower', categorical: true, exclude: false },
  { name: 'weight', categorical: true, exclude: false },
  { name: 'acceleration', categorical: true, exclude: false },
  { name: 'year', categorical: true, exclude: true },
  { name: 'origin', categorical: true, exclude: false  }
];
// Specify configuration for learning.
var treeParams = {
  minPercentVarianceReduction: 0.5,
  minLeafNodeItems: 10,
  minSplitCandidateItems: 30,
  minAvgChildrenItems: 2
};
// Define the regression tree configuration using
// `columns` and `treeParams`.
rt.defineConfig( columns, treeParams );

// Ingest the data.
cars.forEach( function ( row ) {
  rt.ingest( row );
} );

// Data ingested! Now time to learn from data!!
console.log( rt.learn() );
// -> 16 (Number of Rules Learned)

// Predict the **mean** value.
var mean = rt.predict( input );
console.log( +mean.toFixed( 1 ) );
// -> 14.3 ( compare with actual mpg of 14.5 )

// In practice one may like to compute a range
// or upper limit using the `modifier` function
// during prediction. Note `size`, `mean`, and `stdev`
// values, passed to this function, can be used
// for computing the range or the upper limit.

Try experimenting with this example on Runkit in the browser.

Documentation

For detailed API docs, check out http://winkjs.org/wink-regression-tree/ URL!

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.

Copyright & License

wink-regression-tree is copyright 2017-18 GRAYPE Systems Private Limited.

It is licensed under the under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 of the License.

How to create an Instance

regressionTree

Creates an instance of wink-regression-tree.

regressionTree(): methods
Returns
methods: object conatining set of API methods for tasks like configuration, data ingestion, learning, and prediction etc.
Example
// Load wink regression tree.
var regressionTree = require( 'wink-regression-tree' );
// Create your instance of regression tree.
var myRT = regressionTree();

API Methods

defineConfig

Defines the configuration required to read the input data and to generates the regression tree.

defineConfig(inputDataCols: Array<object>, tree: object): number
Parameters
inputDataCols (Array<object>) — each object in this array defines a column of input data in the same sequence in which data will be supplied to ingest(). It is defined in terms of the following details:
Name Description
inputDataCols[].name string of the column.
inputDataCols[].categorical boolean defines column's data type — true indicating categorical or false indicating numeric; currently numeric data type is not supported.
inputDataCols[].exclude boolean (default false) used to exclude a column during tree building.
inputDataCols[].target boolean (default false) is set to true only for the target column, whose value needs to be predicted. Note this column must be a numeric column.
tree (object) — contains key value pairs of the following regression tree's parameters:
Name Description
tree.maxDepth number (default 20) is the maximum depth of the tree after which learning stops.
tree.minPercentVarianceReduction number (default 10) is the minmum variance reduction required for a split to occur.
tree.minSplitCandidateItems number (default 50) the minimum items that must be present at a node for it to be split further, even after the minPercentVarianceReduction target has been achieved.
tree.minLeafNodeItems number (default 10) is the minimum number of items that must be present at a leaf node to be retained as an independent node. Nodes with less than this value size are merged together.
tree.minAvgChildrenItems number (default 2) the average number of items across children must be greater than this number, for a column to become a candidate for split. A higher number will discourage splits that creates many branches with each child node containing fewer items.
Returns
number: number of columns defined.
Example
// Define each column.
var columns = [
  { name: 'model', categorical: true, exclude: true },
  { name: 'mpg', categorical: false, target: true },
  { name: 'cylinders', categorical: true },
  { name: 'displacement', categorical: true, exclude: false },
  { name: 'horsepower', categorical: true, exclude: false },
  { name: 'weight', categorical: true, exclude: false },
  { name: 'acceleration', categorical: true, exclude: false },
  { name: 'year', categorical: true, exclude: true },
  { name: 'origin', categorical: true, exclude: false  }
];
// Define parameters to grow the tree.
var treeParams = {
  minPercentVarianceReduction: 2.5,
  minLeafNodeItems: 10,
  minSplitCandidateItems: 30,
  minAvgChildrenItems: 3
};
// Define the configuration using above 2 variables.
myRT.defineConfig( columns, treeParams );
// -> 8

ingest

Ingests one row of the data at a time. It is specially useful for reading data in an asynchronus manner, where this may be used as a call back function on every row read event.

ingest(row: array): boolean
Parameters
row (array) — one row of the data to be ingested; column values should be in the same sequence in which they are defined in data configuration via defineConfig() .
Returns
boolean: always true .
Throws
  • error: if number of elements in row don't match with the number of columns defined.
Example
// Load cars training data set.
var cars = require( 'wink-regression-tree/sample-data/cars.json' );
// Ingest the data.
cars.forEach( function ( row ) {
  myRT.ingest( row );
} );

learn

Learns from the ingested data and generates the rule tree that is used to predict() the value of target variable from the input. It requires at least 60 data rows to initiate meaningful learning.

learn(): number
Returns
number: number of rules learned from the input data.
Throws
  • error: if number of rows in the ingested data are <60.
Example
myRT.learn();
// -> Number of rules learned

predict

Predicts the value of target variable from the input using the rules tree generated by learn(). If the value of a columm in the input data, required for the prediction is missing, by defualt it throws an error. If the function fn is defined then no error is thrown, instead the name of missing column is passed to this function; and the function is expected to handle the same.

predict(input: object, modifier: function): number
Parameters
input (object) — data containing column name/value pairs; the column names must the same as defined via defineConfig() .
modifier (function = undefined) — is called once a leaf node is reached during prediction with the following 5 parameters: size, mean and stdev values at the node; an array of column names navigated to reach the leaf and column name for which value is missing in the input ( default=undefined ). The value returned from this function becomes the prediction.
Returns
number: mean value or whatever is returned by the modifier function, if defined.
Throws
  • error: if the input is not a javascript object.
  • error: if a value of a column required for prediction is missing in input , provided modifier has not been defined.
Example
// Populate sample input
var input = {
  model: 'Ford Gran Torino',
  weight: 'very high weight',
  displacement: 'very large displacement',
  horsepower: 'extremely high power',
  origin: 'US',
  acceleration: 'slow'
};
// Attempt prediction.
myRT.predict( input );
// -> 14.3

summary

Generates summary of the learnings in terms of the following:

  1. Relative importance of columns along with the corresponding min/max variance reductions (VR).
  2. The min/max mean values along with the corresponding standard deviations (SD).
  3. The minumum standard deviation (SD) discovered during the learning.
summary(): object
Returns
object: containing the following:
  1. table — array of objects, where each object defines level, columnHierarchy, nodesSplit, minVR and maxVR. A lower value of level indicates higher importance; similarly more nodes at a level split on a columnHierarchy is an indication of importance. Therefore, it is sorted in ascending order of level followed by in descending order of nodesSplit.
  2. stats — object containing min.mean, min.itsSD, max.mean, max.itsSD, and minSD.
Example
myRT.summary();
// -> returns the summary object.

evaluate

Incrementally evalutes variance reduction for one data row at a time.

evaluate(rowObject: object): boolean
Parameters
rowObject (object) — contains column name/value pairs including the target column name/value pair as well, which is used in evaluating the variance reduction.
Returns
boolean: always true .
Example
myRT.evaluate( input );

metrics

Computes the variance reduction observed in the validation data passed to evaluate().

metrics(): object
Returns
object: containing the varianceReduction in percentage and data size .
Example
myRT.metrics();
// -> object containing varianceReduction and data size.

exportJSON

Exports the JSON of the rule tree generated by learn(), which may be saved in a file for later predictions.

exportJSON(): json
Returns
json: of the rule tree.
Example
var rules = myRT.exportJSON();

importJSON

Imports the rule tree from the input rulesTree for subsequent use by predict(). Note after a successful import, this can be used ONLY for prediction purpose and not for further ingestion and/or learning.

importJSON(rulesTree: json): boolean
Parameters
rulesTree (json) — containg an earlier exported rule tree in JSON format.
Returns
boolean: always true .
Throws
  • error: if rulesTree is null .
  • error: if rulesTree can not be parsed as a valid JSON.
  • error: if rulesTree is of incorrect version or incorrect format.
Example
var anRT = regressionTree();
// Assuming that json has a valid rule tree.
anRT.importJSON( rules );

reset

It completely resets the tree by re-initializing all the learning related variables, except it's configuration. It is useful during cross fold-validation.

reset(): undefined
Returns
undefined: nothing!
Example
var myRT.reset();