Introduction

wink-statistics

Fast and Numerically Stable Statistical Analysis Utilities

Build Status Coverage Status Inline docs dependencies Status devDependencies Status Gitter

Perform fast and numerically stable statistical analysis using wink-statistics. It is a part of wink — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS.

Summarize, Discover and Analyze data with it's rich set of features:

  1. Analyze both continuous and categorical data.

  2. Handles real-time stream of data and incrementally compute required statistic that usually would take more than one pass over the data as in standard deviation or simple linear regression.

  3. Minimizes data preprocessing by handling array of data structures containing numerical values, e.g. array of objects.

Installation

Use npm to install:

npm install wink-statistics --save

API

Check out the statistics API documentation to learn more.

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.

Copyright & License

wink-statistics is copyright 2017-18 GRAYPE Systems Private Limited.

It is licensed under the under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 of the License.

Introduction

wink-statistics

Fast and Numerically Stable Statistical Analysis Utilities

Build Status Coverage Status Inline docs dependencies Status devDependencies Status Gitter

Perform fast and numerically stable statistical analysis using wink-statistics. It is a part of wink — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS.

Summarize, Discover and Analyze data with it's rich set of features:

  1. Analyze both continuous and categorical data.

  2. Handles real-time stream of data and incrementally compute required statistic that usually would take more than one pass over the data as in standard deviation or simple linear regression.

  3. Minimizes data preprocessing by handling array of data structures containing numerical values, e.g. array of objects.

Installation

Use npm to install:

npm install wink-statistics --save

API

Check out the statistics API documentation to learn more.

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.

Copyright & License

wink-statistics is copyright 2017-18 GRAYPE Systems Private Limited.

It is licensed under the under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 of the License.

Probability

Probability

probability.aggregate

Aggregates two probability estimates from independent sources about the occurrence of a single event a. It returns the aggregated probability of occurrence of the event a. The assumption here is that the two probabilities (estimates) are not correlated with each other and the common prior probability of a is 0.5.

For a detailed explanation, refer to the paper titled Bayesian Group Belief by Franz Dietrich published in Social Choice and Welfare October 2010, Volume 35, Issue 4, pp 595–626.

probability.aggregate
Parameters
pa1 (number) — first estimate of probability of occurrence of event a .
pa2 (number) — second estimate of probability of occurrence of event a .
Returns
number: the aggregated probability.
Example
aggregate( 0.5, 0.6 );
// returns 0.6
aggregate( 0.5, 0.4 );
// returns 0.4
aggregate( 0.6, 0.6 );
// returns 0.6923076923076923
aggregate( 0.4, 0.6 );
// returns 0.5

probability.range4CI

Computes probability from the observed count of successes (successCount) out of the total count (totalCount) along with its range for required level of Confidence Interval (CI) i.e. zscore . The range is the minimum and maximum probability values for given zscore or CI.

These computations are based on approach specified in the Wilson's Notes on Probable Inference, The Law of Succession, and Statistical Inference published in ASA's Journal.

For quick reference, typical value of zscore for 90% and 95% CI is approximately 1.645 and 1.960 respectively.

probability.range4CI
Parameters
successCount (number) — observed count of successes out of
totalCount (number) — the total count.
zscore (number = 1.645) — for the required level of CI.
Returns
object: containing probability , min and max .
Example
range4CI( 1, 10 );
// returns {
//   probability: 0.18518871952479238,
//   min: 0.02263232984000629,
//   max: 0.34774510920957846
// }
range4CI( 10, 100 );
// returns {
//   probability: 0.1105389143431459,
//   min: 0.06071598345043355,
//   max: 0.16036184523585828
// }

Stats

Stats

stats.boxplot

Performs complete boxplot analysis including computation of notches and outliers.

stats.boxplot
Parameters
sortedData (array) — sorted in ascending order of value.
coeff (number = 1.5) — used for outliers computation.
accessor ((string | number | function) = undefined) — required when elements of sortedData are objects or arrays instead of numbers. For objects, use key (string) to access the value; in case of arrays, use index (number) to access the value; or it could be a function that extracts the value from the element passed to it.
Returns
object: consisting of min , q1 , median , q3 , max , iqr , range , size along with leftNotch , and rightNotch . The leftOutliers/rightOutliers (object), if present, contains the count , fence and begin/end indexes to sortedData for easy extraction of exact values.
Example
var data = [
  -12, 14, 14, 14, 16, 18, 20, 20, 21, 23, 27, 27, 27, 29, 31,
  31, 32, 32, 34, 36, 40, 40, 40, 40, 40, 42, 51, 56, 60, 88
];
boxplot( data );
returns {
//   min: -12, q1: 20, median: 31, q3: 40, max: 88,
//   iqr: 20, range: 100, size: 30,
//   leftOutliers: { begin: 0, end: 0, count: 1, fence: 14 },
//   rightOutliers: { begin: 29, end: 29, count: 1, fence: 60 },
//   leftNotch: 25.230655727612252,
//   rightNotch: 36.76934427238775
// }

stats.fiveNumSummary

Returns the five number summary from the sortedData.

stats.fiveNumSummary
Parameters
sortedData (array) — sorted in ascending order of value.
accessor ((string | number | function) = undefined) — required when elements of x are objects or arrays instead of numbers. For objects, use key (string) to access the value; in case of arrays, use index (number) to access the value; or it could be a function that extracts the value from the element passed to it.
Returns
number: 5-number summary consisting of min , q1 , median , q3 , max along with iqr , range , and size .
Example
fiveNumSummary( [ 1, 1, 2, 2, 3, 3, 4, 4 ] );
// returns {
//   q1: 1.25, median: 2.5, q3: 3.75, iqr: 2.5,
//   size: 8, min: 1, max: 4, range: 3
// }

stats.histogram

Generates histogram using Freedman–Diaconis method. If both IQR and MAD are 0 then it automatically switches to Sturges' Rule while ensuring minimum of 5 bins. It attempts to reduce excessive sparsity of distribution, if any, by adjusting the number of bins using Sturges' Rule.

stats.histogram
Parameters
sortedData (array) — sorted in ascending order of value.
dataPrecision (number = 0) — typically the minumum number of decimal places observed in the sortedData .
accessor ((string | number | function) = undefined) — required when elements of x are objects or arrays instead of numbers. For objects, use key (string) to access the value; in case of arrays, use index (number) to access the value; or it could be a function that extracts the value from the element passed to it.
Returns
object: conatining arrays classes and the corresponding frequencies . Each element of classes array is an object with values for min/max (class intervals) and mid point of a class.

In addition, the returned object contains useful statistics like q1 , q3 , iqr , min , max , and range .
Example
var data = [
  12, 14, 14, 14, 16, 18, 20, 20, 21, 23, 27, 27, 27, 29, 31,
  31, 32, 32, 34, 36, 40, 40, 40, 40, 40, 42, 51, 56, 60, 65
];
histogram( data );
// returns {
//   classes: [
//     { min: 12, mid: 19, max: 25 },
//     { min: 25, mid: 32, max: 38 },
//     { min: 38, mid: 45, max: 51 },
//     { min: 51, mid: 58, max: 64 },
//     { min: 64, mid: 71, max: 77 } ],
//   frequencies: [ 10, 10, 7, 2, 1 ],
//   q1: 20,  q3: 40, iqr: 20, size: 30, min: 12, max: 65,range: 53
// }

stats.mad

Returns the median of the sortedData.

stats.mad
Parameters
sortedData (array) — sorted in ascending order of value.
accessor ((string | number | function) = undefined) — required when elements of x are objects or arrays instead of numbers. For objects, use key (string) to access the value; in case of arrays, use index (number) to access the value; or it could be a function that extracts the value from the element passed to it.
Returns
number: median of the sortedData .
Example
mad( [ 1, 1, 2, 2, 3, 3, 4, 4 ] );
// returns 1

stats.max

Finds the maximum value in the x array.

stats.max
Parameters
x (array) — array containing 1 or more elements.
accessor ((string | number | function) = undefined) — required when elements of x are objects or arrays instead of numbers. For objects, use key (string) to access the value; in case of arrays, use index (number) to access the value; or it could be a function that extracts the value from the element passed to it.
Returns
object: maximum value from array x .
Example
max( [ 99, 1, -1, +222, 0, -99 ] )
// returns 222
max( [ { x: 33 }, { x: 11 }, { x:44 } ], 'x' )
// returns 44

stats.mean

Comuptes the mean of numbers contained in the x array. The computations are inspired by the method proposed by B. P. Welford.

stats.mean
Parameters
x (array) — array containing 1 or more elements.
accessor ((string | number | function) = undefined) — required when elements of x are objects or arrays instead of numbers. For objects, use key (string) to access the value; in case of arrays, use index (number) to access the value; or it could be a function that extracts the value from the element passed to it.
Returns
number: mean value.
Example
mean( [ 2, 3, 5, 7 ] )
// returns 4.25
mean( [ { x: 2 }, { x: 3 }, { x: 5 }, { x: 7 } ], 'x' )
// returns 4.25

stats.median

Returns the median of the sortedData.

stats.median
Parameters
sortedData (array) — sorted in ascending order of value.
accessor ((string | number | function) = undefined) — Useful when each element of sortedData is an object or an array instead of number. If it is an object then it should be the key (string) to access the value; or if it is an array then it should be the index (number) to access the value; or it should be a function that extracts the value from the element passed to it.
Returns
number: median of the sortedData .
Example
median( [ 1, 1, 2, 2, 3, 3, 4, 4 ] );
// returns 2.5

stats.min

Finds the minimum value in the x array.

stats.min
Parameters
x (array) — array containing 1 or more elements.
accessor ((string | number | function) = undefined) — required when elements of x are objects or arrays instead of numbers. For objects, use key (string) to access the value; in case of arrays, use index (number) to access the value; or it could be a function that extracts the value from the element passed to it.
Returns
object: minimum value from array x .
Example
min( [ 99, 1, -1, +222, 0, -99 ] )
// returns -99
min( [ { x: 33 }, { x: 11 }, { x:44 } ], 'x' )
// returns 11

stats.percentile

Returns the qth percentile from the sortedData. The computation is based on Method 11 described in Quartiles in Elementary Statistics by Eric Langford published in Journal of Statistics Education Volume 14, Number 3 (2006).

stats.percentile
Parameters
sortedData (array) — sorted in ascending order of value.
q (number) — should be between 0 and 1 indicating percentile; for example, to get 25 th percentile, it should be 0.25.
accessor ((string | number | function) = undefined) — required when elements of x are objects or arrays instead of numbers. For objects, use key (string) to access the value; in case of arrays, use index (number) to access the value; or it could be a function that extracts the value from the element passed to it.
Returns
number: q th percentile of sortedData .
Example
percentile( [ 1, 1, 2, 2, 3, 3, 4, 4 ], 0.25 );
// returns 1.25
percentile( [ 1, 1, 2, 2, 3, 3, 4, 4 ], 0.75 );
// returns 3.75

stats.stdev

Comuptes the sample standard deviation of numbers contained in the x array. The computations are inspired by the method proposed by B. P. Welford.

stats.stdev
Parameters
x (array) — array containing 1 or more elements.
accessor ((string | number | function) = undefined) — required when elements of x are objects or arrays instead of numbers. For objects, use key (string) to access the value; in case of arrays, use index (number) to access the value; or it could be a function that extracts the value from the element passed to it.
Returns
number: standard deviation of sample.
Example
stdev( [ 2, 3, 5, 7 ] )
// returns 2.217355782608345
stdev( [ { x: 2 }, { x: 3 }, { x: 5 }, { x: 7 } ], 'x' )
// returns 2.217355782608345

Streaming

Streaming

streaming.cov

Covariance — cov is a higher order function that returns an object containing compute(), value(), result(), and reset() functions.

Use compute() to continuously determine the covariance between x and y values passed to it in real-time. Probe the sample covariance anytime using value(), which may be reset via reset().

Number of decimals in the returned numerical values can be configured by defining fractionDigits as parameter in result() and value(). Its default value is 4.

The result() returns an object containing sample covariance cov, along with meanX, meanY and size of data i.e. number of x & y pairs. It also contains population covariance covp.

streaming.cov
Returns
object: containing compute , value , result , and reset functions.
Example
var covariance = cov();
covariance.compute( 10, 80 );
covariance.compute( 15, 75 );
covariance.compute( 16, 65 );
covariance.compute( 18, 50 );
covariance.compute( 21, 45 );
covariance.compute( 30, 30 );
covariance.compute( 36, 18 );
covariance.compute( 40, 9 );
covariance.result();
// returns { size: 8,
//   meanX: 23.25,
//   meanY: 46.5,
//   cov: -275.8571,
//   covp: -241.375
// }

streaming.freqTable

It is a higher order function that returns an object containing build(), value(), result(), and reset() functions.

Use build() to construct a frequency table from value of data items passed to it in real-time. Probe the object containing data-item/frequency pairs using value(), which may be reset via reset().

The result() returns an object containing the frequency table sorted in descending order of category counts or frequency, along with it's size, sum of all counts, x2 - chi-squared statistic, df - degree of freedom, and the entropy.

The x2 along with the df can be used test the hypothesis that "the distribution is a uniform one". The percentage in table give the percentage of a category count against the sum; and expected is the count assuming an uniform distribution.

streaming.freqTable
Returns
object: containing compute , value , result , and reset functions.
Example
var ft = freqTable();
ft.build( 'Tea' );
ft.build( 'Tea' );
ft.build( 'Tea' );
ft.build( 'Pepsi' );
ft.build( 'Pepsi' );
ft.build( 'Gin' );
ft.build( 'Coke' );
ft.build( 'Coke' );
ft.value();
// returns { Tea: 3, Pepsi: 2, Gin: 1, Coke: 2 }
ft.result();
// returns {
//  table: [
//   { category: 'Tea', observed: 3, percentage: 37.5, expected: 2 },
//   { category: 'Pepsi', observed: 2, percentage: 25, expected: 2 },
//   { category: 'Coke', observed: 2, percentage: 25, expected: 2 },
//   { category: 'Gin', observed: 1, percentage: 12.5, expected: 2 }
//  ],
//  size: 4,
//  sum: 8,
//  x2: 1,
//  df: 3,
//  entropy: 1.9056390622295665
// }

streaming.max

It is a higher order function that returns an object containing compute(), value(), result(), and reset() functions.

Use compute() to continuously determine the maximum value of data items passed to it in real-time. Probe the maximum anytime using value(), which may be reset via reset(). The result() returns an object containing max.

streaming.max
Returns
object: containing compute , value , result , and reset functions.
Example
var maximum = max();
maximum.compute( 3 );
maximum.compute( 6 );
maximum.value();
// returns 6
maximum.result();
// returns { max: 6 }

streaming.mean

It is a higher order function that returns an object containing compute(), value(), result(), and reset() functions.

Use compute() to continuously determine the mean aka average value of data items passed to it in real-time. Probe the mean anytime using value(), which may be reset via reset(). The computations are inspired by the method proposed by B. P. Welford.

The result() returns an object containing sample mean along with size of data.

streaming.mean
Returns
object: containing compute , value , result , and reset functions.
Example
var avg = mean();
avg.compute( 2 );
avg.compute( 3 );
avg.compute( 5 );
avg.compute( 7 );
avg.value();
// returns 4.25
avg.result();
// returns { n: 4, mean: 4.25 }

streaming.min

It is a higher order function that returns an object containing compute(), value(), result(), and reset() functions.

Use compute() to continuously determine the minimum value of data items passed to it in real-time. Probe the minimum anytime using value(), which may be reset via reset(). The result() returns an object containing min.

streaming.min
Returns
object: containing compute , value , result , and reset functions.
Example
var minimum = min();
minimum.compute( 3 );
minimum.compute( 6 );
minimum.value();
// returns 3
minimum.result();
// returns { min: 3 }

streaming.slr

Simple Linear Regression — slr is a higher order function that returns an object containing compute(), value(), result(), and reset() functions.

Use compute() to continuously determine the correlation between x and y values passed to it in real-time. Probe the correlation anytime using result(), which may be reset via reset().

Number of decimals in the correlated values can be configured by defining fractionDigits as parameter in result(). Its default value is 4. The result() also has an alias value().

The correlation is an object containing slope, intercept, r, r2, se along with the size of data i.e. number of x & y pairs. In case of any error such as no input data or zero variance, correlation object will be an empty one.

streaming.slr
Returns
object: containing compute , value , result , and reset functions.
Example
var regression = slr();
regression.compute( 10, 80 );
regression.compute( 15, 75 );
regression.compute( 16, 65 );
regression.compute( 18, 50 );
regression.compute( 21, 45 );
regression.compute( 30, 30 );
regression.compute( 36, 18 );
regression.compute( 40, 9 );
regression.result();
// returns { slope: -2.3621,
//   intercept: 101.4188,
//   r: -0.9766,
//   r2: 0.9537,
//   se: 5.624,
//   size: 8
// }

streaming.stdev

It is a higher order function that returns an object containing compute(), value(), result(), and reset() functions.

Use compute() to continuously determine the standard deviation value of data items passed to it in real-time. Probe the sample standard deviation anytime using value(), which may be reset via reset(). The computations are inspired by the method proposed by B. P. Welford.

The result() returns an object containing sample stdev and variance, along with mean, size of data; it also contains population standard deviation and variance as stdevp and variancep.

streaming.stdev
Returns
object: containing compute , value , result , and reset functions.
Example
var sd = stdev();
sd.compute( 2 );
sd.compute( 3 );
sd.compute( 5 );
sd.compute( 7 );
sd.value();
// returns 2.217355782608345
sd.result();
// returns { size: 4, mean: 4.25,
//   variance: 4.916666666666666,
//   stdev: 2.217355782608345,
//   variancep: 3.6874999999999996,
//   stdevp: 1.920286436967152
// }

streaming.sum

It is a higher order function that returns an object containing compute(), value(), result(), and reset() functions.

Use compute() to continuously determine the sum of data items passed to it in real-time. Probe the sum anytime using value(), which may be reset via reset(). The sum is compensated for floating point errors using Neumaier Method. The result() returns an object containing sum.

streaming.sum
Returns
object: containing compute , value , result , and reset functions.
Example
var addition = sum();
addition.compute( 1 );
addition.compute( 10e+100 );
addition.compute( 1 );
addition.compute( -10e+100 );
addition.value();
// returns 2
addition.result();
// returns { sum: 2 }

streaming.summary

It is a higher order function that returns an object containing compute(), value(), result(), and reset() functions.

Use compute() to continuously determine the summary statistics of data items passed to it in real-time. Probe the sample summary statistics anytime using value(), which may be reset via reset(). The result() is also an alias of value(). The computations are inspired by the method proposed by B. P. Welford.

The summary statistics is an object containing size, min, mean, max, sample stdev along with sample variance of data; it also contains population standard deviation and variance as stdevp and variancep.

streaming.summary
Returns
object: containing compute , value , result , and reset functions.
Example
var ss = summary();
ss.compute( 2 );
ss.compute( 3 );
ss.compute( 5 );
ss.compute( 7 );
ss.result();
// returns { size: 4, min: 2, mean: 4.25, max: 7,
//   variance: 4.916666666666666,
//   stdev: 2.217355782608345,
//   variancep: 3.6874999999999996,
//   stdevp: 1.920286436967152
// }