Uncertainty Estimation of Classifier
Another useful feature of the scikitlearn interface that we haven't talked about yet is that the classifier can give an estimate of the uncertainty of the prediction. Generally speaking, you are not only interested in the classifier predicting which category a test point belongs to, but also in how confident it is in this prediction. In practice, different types of errors can lead to very different results in realworld applications. Imagine a medical application for testing cancer. Falsepositive predictions may only allow patients to take additional tests, but falsenegative predictions may result in serious illness not being treated.
There are two functions in scikitlearn that can be used to obtain an estimate of the uncertainty of a classifier:decision_ Function and predict_ Proba. Most (but not all) classifiers have at least one of these functions, and many have both. Let's build a GradientBoostingClassifier classifier (with both decision_function and predict_proba methods) to see how these two functions work on a simulated twodimensional dataset:
from sklearn.ensemble import GradientBoostingClassifier from sklearn.datasets import make_circles import numpy as np from sklearn.model_selection import train_test_split X, y = make_circles(noise=0.25, factor=0.5, random_state=1) # For illustration purposes, we renamed the two categories "blue" and "red" y_named = np.array(["blue", "red"])[y] # We can call train_on any number of arrays Test_ Split # All arrays are divided in the same way X_train, X_test, y_train_named, y_test_named, y_train, y_test = \ train_test_split(X, y_named, y, random_state=0) # Building a gradient lifting model gbrt = GradientBoostingClassifier(random_state=0) gbrt.fit(X_train, y_train_named)
1. Decision Functions
For the case of binary classification, decision_ The shape of the function return value is (n_samples,), which returns a floating point number for each sample:
print("X_test.shape: {}".format(X_test.shape)) # X_test.shape: (25, 2) print("Decision function shape: {}".format(gbrt.decision_function(X_test).shape)) # Decision function shape: (25,)
For Category 1, this value represents the confidence level of the model that the data point belongs to the "positive" class. Positive values indicate preference for positive classes and negative values indicate preference for negative classes (other classes):
# Show decision_ The first elements of function print("Decision function:\n{}".format(gbrt.decision_function(X_test)[:6])) ''' Decision function: [ 4.13592603 1.70169917 3.95106099 3.62609552 4.28986642 3.66166081] '''
We can reproduce the predicted value by simply looking at the plus and minus signs of the decision function:
print("Thresholded decision function:\n{}".format(gbrt.decision_function(X_test) > 0)) ''' Thresholded decision function: [ True False False False True True False True True True False True True False True False False False True True True True True False False] ''' print("Predictions:\n{}".format(gbrt.predict(X_test))) ''' Predictions: ['red' 'blue' 'blue' 'blue' 'red' 'red' 'blue' 'red' 'red' 'red' 'blue' 'red' 'red' 'blue' 'red' 'blue' 'blue' 'blue' 'red' 'red' 'red' 'red' 'red' 'blue' 'blue'] '''
For binary classification problems, the inverse class is always classes_ The first element of an attribute, and the positive class is classes_ The second element. Therefore, if you want to reproduce the output of predict completely, you need to use classes_ Attributes:
# Convert Boolean True/False to 0 and 1 greater_zero = (gbrt.decision_function(X_test) > 0).astype(int) # Use 0 and 1 as classes_ Index of pred = gbrt.classes_[greater_zero] # pred and gbrt. The output of predict is exactly the same print("pred is equal to predictions: {}".format(np.all(pred == gbrt.predict(X_test)))) # pred is equal to predictions: True
decision_function can be evaluated in any range, depending on the data and model parameters:
decision_function = gbrt.decision_function(X_test) print("Decision function minimum: {:.2f} maximum: {:.2f}".format( np.min(decision_function), np.max(decision_function))) # Decision function minimum: 7.69 maximum: 4.29
Decision_because it can be scaled arbitrarily The output of functions is often difficult to interpret.
In the following example, we use color coding to draw decision_for all points in a twodimensional plane Function, and decision boundary. We draw training points as circles and test data as triangles:
fig, axes = plt.subplots(1, 2, figsize=(13, 5)) mglearn.tools.plot_2d_separator(gbrt, X, ax=axes[0], alpha=.4, fill=True, cm=mglearn.cm2) scores_image = mglearn.tools.plot_2d_scores(gbrt, X, ax=axes[1], alpha=.4, cm=mglearn.ReBl) for ax in axes: # Draw training and test points mglearn.discrete_scatter(X_test[:, 0], X_test[:, 1], y_test, markers='^', ax=ax) mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train, markers='o', ax=ax) ax.set_xlabel("Feature 0") ax.set_ylabel("Feature 1") cbar = plt.colorbar(scores_image, ax=axes.tolist()) axes[0].legend(["Test class 0", "Test class 1", "Train class 0", "Train class 1"], ncol=4, loc=(.1, 1.1)) plt.show()
This gives both the predicted results and the confidence level of the classifier, which gives more information. However, in the above image, it is difficult to distinguish the boundary between the two categories.
2. Prediction probability
Predict_ The output of proba is the probability for each category, usually greater than decision_ The output of function is easier to understand. For a binary classification problem, its shape is always (n_samples, 2):
print("Shape of probabilities: {}".format(gbrt.predict_proba(X_test).shape)) # Shape of probabilities: (25, 2)
The first element of each row is the estimated probability of the first category, and the second element is the estimated probability of the second category. Because predict_ The output of proba is a probability, so it is always between 0 and 1, and the sum of the elements of the two categories is always 1:
# Show predict_ The first elements of proba print("Predicted probabilities:\n{}".format(gbrt.predict_proba(X_test[:6]))) ''' Predicted probabilities: [[0.01573626 0.98426374] [0.84575653 0.15424347] [0.98112869 0.01887131] [0.97407033 0.02592967] [0.01352142 0.98647858] [0.02504637 0.97495363]] '''
Since the sum of the two categories is 1, the probability of only one category is more than 50%. This category is the predicted result of the model.
As you can see from the previous output, the classifier has a relatively high level of confidence in most points. The magnitude of uncertainty actually reflects the uncertainty of the data dependent on the model and parameters. Overfitting a stronger model may make a more confident prediction, even if it may be wrong. Models with lower complexity usually have greater uncertainty in predictions. If the uncertainty given by the model is realistic, the model is called a calibrated model. In the correction model, if the prediction has 70% certainty, it is correct in the 70% case.
In the following example, we give again the decision boundary of the dataset and the class probability of category 1:
fig, axes = plt.subplots(1, 2, figsize=(13, 5)) mglearn.tools.plot_2d_separator( gbrt, X, ax=axes[0], alpha=.4, fill=True, cm=mglearn.cm2) scores_image = mglearn.tools.plot_2d_scores( gbrt, X, ax=axes[1], alpha=.5, cm=mglearn.ReBl, function='predict_proba') for ax in axes: # Draw training and test points mglearn.discrete_scatter(X_test[:, 0], X_test[:, 1], y_test, markers='^', ax=ax) mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train, markers='o', ax=ax) ax.set_xlabel("Feature 0") ax.set_ylabel("Feature 1") cbar = plt.colorbar(scores_image, ax=axes.tolist()) axes[0].legend(["Test class 0", "Test class 1", "Train class 0", "Train class 1"], ncol=4, loc=(.1, 1.1))
The boundaries in this picture are clearer, and small areas of uncertainty are clearly visible.
Sckitlearn website Many models are compared and the shape of uncertainty estimates is given.
3. Uncertainty of Multiple Classification Problems
So far, we have only discussed the estimation of uncertainty in the binary classification problem. But decision_function and predict_proba also applies to multiclassification problems. We applied these two functions to the Iris dataset, which is a threeclass dataset:
from sklearn.datasets import load_iris iris = load_iris() X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, random_state=42) gbrt = GradientBoostingClassifier(learning_rate=0.01, random_state=0) gbrt.fit(X_train, y_train)
print("Decision function shape: {}".format(gbrt.decision_function(X_test).shape)) # Decision function shape: (38, 3) # Displaying the first few elements of a decision function print("Decision function:\n{}".format(gbrt.decision_function(X_test)[:6, :])) ''' Decision function: [[1.995715 0.04758267 1.92720695] [ 0.06146394 1.90755736 1.92793758] [1.99058203 1.87637861 0.09686725] [1.995715 0.04758267 1.92720695] [1.99730159 0.13469108 1.20341483] [ 0.06146394 1.90755736 1.92793758]] '''
For cases with multiple classifications, decision_ The shape of function is (n_samples, n_classes), and each column corresponds to a "deterministic score" for each category. Categories with higher scores are more likely, while those with lower scores are less likely. You can find the maximum elements of each data point and use these scores to reproduce the predictions:
print("Argmax of decision function:\n{}".format( np.argmax(gbrt.decision_function(X_test), axis=1))) ''' Argmax of decision function: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1 0] ''' print("Predictions:\n{}".format(gbrt.predict(X_test))) ''' Predictions: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1 0] '''
Predict_ The proba output is the same shape, also (n_samples, n_classes). Similarly, the sum of all possible categories for each data point is 1:
# Show predict_ The first elements of proba print("Predicted probabilities:\n{}".format(gbrt.predict_proba(X_test)[:6])) ''' Predicted probabilities: [[0.10217718 0.78840034 0.10942248] [0.78347147 0.10936745 0.10716108] [0.09818072 0.11005864 0.79176065] [0.10217718 0.78840034 0.10942248] [0.10360005 0.66723901 0.22916094] [0.78347147 0.10936745 0.10716108]] ''' # Show that the sum of each line is 1 print("Sums: {}".format(gbrt.predict_proba(X_test)[:6].sum(axis=1))) # Sums: [1. 1. 1. 1. 1. 1.]
Again, we can calculate predict_ argmax of proba to reproduce predictions:
print("Argmax of predicted probabilities:\n{}".format( np.argmax(gbrt.predict_proba(X_test), axis=1))) ''' Argmax of predicted probabilities: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1 0] ''' print("Predictions:\n{}".format(gbrt.predict(X_test))) ''' Predictions: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1 0] '''
In summary, predict_proba and decision_function always has the same shape, all of which are (n_samples, n_classes)  except for decision_in the special case of binary classification Function. For the case of binary classification, decision_function has only one column, corresponding to the positive class classes_[1]. This is mainly due to historical reasons.
If there is n_classes column, you can reproduce the predictions by calculating argmax for each column. But be careful if the category is a string or an integer, but not a continuous integer starting with 0. If you want to compare the result of predict with decision_function or predict_ The result of proba, be sure to use classes_of the classifier Property to get the real property name:
logreg = LogisticRegression() # Indicate each target value by the category name of the Iris dataset named_target = iris.target_names[y_train] logreg.fit(X_train, named_target) print("unique classes in training data: {}".format(logreg.classes_)) # unique classes in training data: ['setosa' 'versicolor' 'virginica'] print("predictions: {}".format(logreg.predict(X_test)[:10])) ''' predictions: ['versicolor' 'setosa' 'virginica' 'versicolor' 'versicolor' 'setosa' 'versicolor' 'virginica' 'versicolor' 'versicolor'] ''' argmax_dec_func = np.argmax(logreg.decision_function(X_test), axis=1) print("argmax of decision function: {}".format(argmax_dec_func[:10])) # argmax of decision function: [1 0 2 1 1 0 1 2 1 1] print("argmax combined with classes_: {}".format( logreg.classes_[argmax_dec_func][:10])) ''' argmax combined with classes_: ['versicolor' 'setosa' 'virginica' 'versicolor' 'versicolor' 'setosa' 'versicolor' 'virginica' 'versicolor' 'versicolor'] '''
4. Summary
This chapter begins with a discussion of model complexity, followed by a discussion of generalization, or learning a model that performs well on new data that has never been seen before. This leads to the concepts of underfitting and overfitting, the former refers to that a model cannot get all the changes in the training data, the latter refers to that the model pays too much attention to the training data, but has poor generalization performance for the new data.
This chapter then discusses a series of machine learning models for classification and regression, their advantages and disadvantages, and how to control their model complexity. We found that setting the correct parameters is critical to model performance for many algorithms. Some algorithms are also sensitive to the representation of input data, especially the scaling of features. Therefore, if you blindly apply an algorithm to a dataset without understanding the assumptions made by the model and the meaning of the parameter settings, it is unlikely that you will get a more accurate model.
This chapter contains a lot of information about the algorithm, and you don't have to remember all these details before continuing to read the following chapters. However, some of the knowledge about models mentioned here (and which models to use in specific situations) is important for the successful application of machine learning models in practice. Below is a quick summary of when and which model to use.

nearest neighbor
It is a good benchmark model for small datasets and is easy to interpret.

linear model
Very reliable preferred algorithm for very large datasets as well as for highdimensional data.

Naive Bayes
Applicable only to classification problems. It is faster than linear models and is suitable for very large datasets and highdimensional data. Accuracy is usually lower than that of linear models.

decision tree
It is fast, does not require data scaling, can be visualized, and is easy to interpret.

Random Forest
Almost always better than a single decision tree, robust and very strong. No data scaling is required. Not suitable for highdimensional sparse data.

Gradient lifting decision tree
The accuracy is usually slightly higher than that of random forests. Compared to random forests, training is slower, but prediction is faster and requires less memory. More parameter adjustments are required than random forests.

Support Vector Machine
Mediumsized datasets with similar signatures are powerful. Data scaling is required and parameter sensitive.

neural network
You can build very complex models, especially for large datasets. Sensitive to data scaling and parameter selection. Large networks take a long time to train.
For new datasets, it is usually best to start with a simple model, such as a linear model, a naive Bayesian, or a nearest neighbor classifier, to see what results can be obtained. Once you have a better understanding of the data, you can consider algorithms for building more complex models, such as random forests, gradientlifting decision trees, SVM, or neural networks.
Now you should have a good understanding of how to apply, adjust, and analyze the models we have introduced. This chapter focuses on the problem of binary classification, which is usually the easiest to understand. However, most of the algorithms in this chapter can be used for both classification and regression, and all classification algorithms can be used for both binary and multiclassification. You can try applying these algorithms to scikitlearn s builtin datasets, such as boston_for regression A housing or diabetes dataset, or a digits dataset for multiple classifications. Experimenting these algorithms on different datasets will give you a better sense of the training time they require, the ease of analyzing models, and their sensitivity to data representation.
Although we analyze the impact of different parameter settings on the algorithm, it is more complex to actually build a model that performs well in a production environment for generalizing new data.