Thursday, October 20, 2016

Model Training


For Classifiers

ModuleDescription
Train ModelTrains a classification or regression model from a training set. Takes an Untrained Model and the Training Data and Trains a Model.
Tune HyperparametersMany Models have Hyperparameters, so it's better to replace the Train Model with a Tune Hyperparameters module. It takes an Untrained Model, the Training Data, and some Validation Data and Trains a Model and Tunes the Hyperparameters.

For Regression

ModuleDescription
Train ModelTrains a classification or regression model from a training set. Takes an Untrained Model and the Training Data and Trains a Model.
Tune HyperparametersMany Models have Hyperparameters, so it's better to replace the Train Model with a Tune Hyperparameters module. It takes an Untrained Model, the Training Data, and some Validation Data and Trains a Model and Tunes the Hyperparameters.

For Anomaly Detection

ModuleDescription
Train Anomaly Detection ModelTrains an anomaly detector model and labels data from a training set. Takes an Untrained Model and the Training Data and Trains a Model.

For Clustering

ModuleDescription
Train Clustering ModelTrains a clustering model and assigns data from the training set to clusters. Takes an Untrained Model and the Training Data and Trains a Model.
Sweep ClusteringPerforms a parameter sweep on a clustering model to determine the optimum parameter settings and trains the best model. Takes an Untrained Model and the Training Data and Trains a Model.

Wednesday, October 19, 2016

Statistical Measures

Azure ML Evaluation results often include some statistical measures that need some explanation. 
Here is a brief summary:-

For Classifiers

MeasureDescription
True PositiveA count of the number of positive outcomes that the algorithm predicted correctly (TP)
True NegativeA count of the number of negative outcomes that the algorithm predicted correctly (TN)
False PositiveA count of the number of positive outcomes that the algorithm predicted incorrectly (FP)
False NegativeA count of the number of negative outcomes that the algorithm predicted incorrectly (FN)
PrecisionThe proportion of predicted positives that are classified correctly: TP/(TP+FP)
RecallThe proportion of actual positives which are classified correctly: TP/(TP+FN)
AccuracyThe proportion of all values classified correctly: (TP+TN)/(TP+TN+FP+FN). Accuracy is not a reliable metric for the real performance of a classifier.
F1 ScoreThe F1 score is the harmonic mean of precision and recall: F1 = 2 * ((precision/recall)/(precision+recall)). The F1 Score is a good metric for the real performance of a classifier since it includes both precision and recall.
AUCArea Under the Curve: This is the are under the Receiver Operating Characteristic (ROC) curve. This is the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. The AUC is a good metric for the real performance of a classifier since it includes both precision and recall.

For Regression

MeasureDescription
Negative LoglikelihoodThe Negative Loglikelihood is a measure of the variance of the actual data from the predicted values. A regression model attempts to reach the lowest Negative Loglikelihood. A low value indicates a well trained model.
Mean Absolute ErrorA low value indicates a well trained model.
Root Mean Squared ErrorA low value indicates a well trained model.
Relative Absolute ErrorA low value indicates a well trained model.
Relative Squared ErrorA low value indicates a well trained model.
Coefficient of Determination (R2)A statistical measure of how well the regression line approximates the real data points. The coefficient of determination ranges from 0 to 1. An R2 of 1 indicates that the regression line perfectly fits the data, but low values can be entirely normal.

For Clustering

It's not clear to me yet what best indicates a well trained Clustering model.
MeasureDescription
Average Distance to Cluster CenterThe average closeness of all points in a cluster to the centroid of that cluster.
Average Distance to Other CenterThe average closeness of all points in a cluster to the centroid of all clusters.
Number of PointsThe number of points in that cluster
Maximal Distance To Cluster CenterThe sum of the distances between each point and the centroid of that point’s cluster.

Tuesday, October 18, 2016

AzureML Machine Learning Models Summary

Two-Class (Binary) Classifiers

Binary classifiers able to learn how to predict binary outcomes. Binary classifiers are always supervised learning problems. The Scored Label is either 1 or 0. This is probably the most common type of Machine Learning algorithm.
In an AzureML binary classifier the Scored Probability is the probability that the Label should be 1. If the Scored Probability is less than 0.5 the Scored Label will be 0.

Two-Class Boosted Decision Tree

Visualized as large number of small trees. A Microsoft support person told me that the last tree is the one it uses, but I'm not convinced of that. The first tree usually looks pretty good.

Two-Class Boosted Decision Forest

Visualized a small number of large trees. The last tree is the one it uses.

Can run on Live date with nulls (or NaN's) and scores records where some fields are null/Nan.

Two-Class Bayes Point Machine

Visualized as feature-sets with weights

Two-Class Logistic Regression

Visualized as feature-sets with weights

Can run on Live date with nulls (or NaN's) but only scores records where all fields have values (i.e are not null/Nan)

Two-Class Neural Network

Can run on Live date with nulls (or NaN's) but only scores records where all fields have values (i.e are not null/Nan)

Two-Class Averaged Perceptron

Visualized as feature-sets with weights

Can run on Live date with nulls (or NaN's) but only scores records where all fields have values (i.e are not null/Nan)

Multiclass Classifiers

Multiclass classifiers are always supervised learning problems.
  • Multiclass Decision Forest
  • Multiclass Decision Jungle
  • Multiclass Logistic Regression
  • Multiclass Neural Network
  • One-vs-All Multiclass

Regression

Regression models predict where a record might appear along a continuum given the supplied features. For example, predicting a house price based on features of the house. Regression problems are by nature always Supervised.

The scored dataset has two labels
  1. Scored Label Mean
  2. Scored Label Standard Deviation
These are attempts to put the supplied labels on a continuum.

Evaluation of Regression models can be done using one or more of the following statistics
  • Negative Log Likelihood
  • Mean Absolute Error
  • Root Mean Squared Error
  • Relative Absolute Error
  • Relative Squared Error
  • Coefficient of Determination
AzureML currently supports the following Regression algorithms:-

Bayesian Linear Regression

The trained model has no useful visualization.

The “Scored Label Mean” is the prediction, and “Scored Label Standard Deviation” is the uncertainty around that prediction.

Boosted Decision Tree Regression

The trained model is visualized as 100 decision trees.

Decision Forest Regression

The trained model is visualized as 8 huge decision trees.

The “Scored Label Mean” is the prediction, and “Scored Label Standard Deviation” is the uncertainty around that prediction.

Fast Forest Quantile Regression

The trained model has no useful visualization.

Linear Regression

Neural Network Regression

The trained model has no useful visualization.

Ordinal Regression

Poisson Regression

The trained model is visualized as a series of features and weights.

Anomaly Detection

Anomaly Detection is normally unsupervised. We don't know in advance what an anomaly is, we can only train the algorithm on “normal” data. An email spam detector is a typical Anomaly Detection problem.

In AzureML Anomaly Detection the Scored Probability is the probability that the record is an Anomaly. If the Scored Probability is high the record is an anomaly, if it's low, the record closely matches the test data.

AzureML supports two algorithms suitable for Anomaly Detection

One-Class SVM (Support Vector Machine)

The One Class Support Vector Machine has no useful Visualization.

A trained One-Class model will run on Live data that contains null (or NaN) fields, but it will only Score records where all fields are present (i.e. the record has no null or NaN fields)

PCA (Principal Component Analysis) -Based Anomaly Detection

The PCA Anomaly Detection Model had no useful Visualization.

A trained PCA-Based model will fail to run on Live data if the data contains null (or NaN) fields.

Clustering

Clustering Models learn how to group records into n-clusters. This can be done either Supervised or Unsupervised.

Supervised clustering is used for predicting which records should fall into which predefined categories. This is really just a Multi-Class classifier.

An unsupervised clustering algorithm will define it's own categorizations which may not correspond to anything intuitive to the human observer. In the real-world these algorithms are used for Data Discovery problems such as discovering market segmentation.

K-Means Clustering