preface
Today, I read the third chapter of machine learning practice: decision tree. I was very confused. I didn't understand it. There were too many professional terms, and it was a bit confusing. For a freshman who didn't learn probability theory well, I had a headache when I saw those probability formulas and some professional words of probability theory in my junior year. I immediately withdrew from the classroom. I got up early and didn't understand them for half an hour. I decided to lie back in bed again. Until I saw this blog post.
First of all, thank the blogger: Jack Cui
Home page: http://blog.csdn.net/c406495762
Decision tree blog address: https://blog.csdn.net/c406495762/article/details/75663451
The content of the book was vividly expressed in this blog post, which was easy to understand. I explained it with my own examples, which was much clearer than that in the book, so I began to learn. I am grateful and sincerely recommend it.
Most of my blog posts are excerpted from its blog posts, but I also typed them out word by word. I have also calculated the algorithm myself. It can be regarded as a summary of the blog posts after learning it. If you still can't understand it, you can look at it directly.
Tomorrow I will continue to study according to its blog.
Get to the point.
Decision tree
concept
What is a decision tree? Decision tree is a basic method of classification and regression. For an easy to understand example, the flow chart shown in the figure below is a decision tree. The rectangle represents the judgment module, and the ellipse represents the termination module, indicating that the operation can be terminated after a conclusion has been reached. The left and right arrows from the judgment module are called branches, which can reach another judgment module or termination module. We can also understand that the classification decision tree model is a tree structure that describes the classification of instances. The decision tree consists of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes. An internal node represents a feature or attribute, and a leaf node represents a class. Did you cover the circle?? In the decision tree shown in the figure below, both rectangle and ellipse are nodes. Rectangular nodes belong to internal nodes, elliptical nodes belong to leaf nodes, and the left and right arrows from the nodes are directed edges. The top node is the root node of the decision tree. In this way, the node statement corresponds to the module statement, and the understanding is good.
Let's go back to the flow chart. Yes, you are right. This is a hypothetical dating object classification system. It first detects whether a blind date has a house. If you have a room, you can consider further contact with this blind date. If there is no room, observe whether the blind date is selfmotivated. If not, Say Goodbye directly. At this time, you can say, "you are very nice, but we are not suitable." If yes, you can put this blind date on the candidate list. It sounds like a candidate list. To put it mildly, it is a spare tire.
But this is just a simple dating object classification system, just a simple classification. The real situation may be much more complicated, and the considerations may be varied. Good temper? Can you cook? Would you like to do housework? How many children are there in the family? What do parents do? God, I don't want to say any more. It's terrible to think about it.
We can regard the decision tree as a set of if then rules. The process of transforming the decision tree into if then rules is as follows: each path from the root node to the leaf node of the decision tree constructs a rule; The characteristics of the internal nodes on the path correspond to the conditions of the rules, while the classes of the leaf nodes correspond to the conclusions of the rules. The path of decision tree or its corresponding if then rule set has an important property: mutual exclusion and completeness. That is, each instance is covered by a path or a rule, and only by a path or a rule. The coverage here means that the characteristics of the instance are consistent with those on the path, or the instance meets the conditions of the rule.
Using the decision tree to make predictions requires the following process:
 Collect data: you can use any method. For example, if we want to build a blind date system, we can obtain data from matchmakers or by visiting blind date objects. According to the factors they consider and the final selection results, we can get some data for us to use.
 Data preparation: we should sort out the collected data, sort out all the collected information according to certain rules, and typeset it for our subsequent processing.
 Analyze data: any method can be used. After the decision tree is constructed, we can check whether the decision tree graph meets the expectations.
 Training algorithm: this process is to construct a decision tree. It can also be called decision tree learning, which is to construct a data structure of a decision tree.
 Test algorithm: use the experience tree to calculate the error rate. When the error rate reaches the acceptable range, the decision tree can be put into use.
 Use algorithm: this step can use any supervised learning algorithm, and use the decision tree to better understand the internal meaning of the data.
Preparation for decision tree construction
The Every step of using decision tree to make prediction is very important. If the data is not collected in place, there will be insufficient features for us to build a decision tree with low error rate. If the data features are sufficient, but we do not know which features are good, we will not be able to build a decision tree model with good classification effect. From the perspective of algorithm, the construction of decision tree is our core content.
The How to build a decision tree? Generally, this process can be summarized as three steps: feature selection, decision tree generation and decision tree pruning.
1. feature selection
The Feature selection is to decide which feature to use to divide the feature space. Usually, the standard of feature selection is information gain.
What is information gain? Before and after the data set is divided, the change of information becomes the information gain. Knowing how to calculate the information gain, we can calculate the information gain obtained by dividing the data set by each eigenvalue. The feature with the highest information gain is the best choice.
The Firstly, the formula of information gain is given. The information gain g(D,A) of feature A to training data set D is defined as the difference between the empirical entropy H(D) of set D and the conditional empirical entropy H(DA) of D under the given condition of feature A, that is, g(D,A) = H(D)  H(DA).
1.1 entropy, empirical entropy and conditional entropy
Entropy is defined as the expected value of information. In information theory and probability theory, entropy is a measure of uncertainty of random variables. If the transaction to be classified may be divided into multiple classifications, the information of symbol xi is defined as:
Where p(xi) is the probability of selecting the classification. Through the above formula, we can get all kinds of information. In order to calculate entropy, we need to calculate the information expectation (mathematical expectation) contained in all possible values of all categories, which is obtained by the following formula:
Where n is the number of classifications. The greater the entropy, the greater the uncertainty of random variables.
When the probability obtained from entropy is obtained by data estimation (especially maximum likelihood estimation), the corresponding entropy is called empirical entropy.
What is estimated by data? For example, there are 10 data, and there are two categories, class A and class B. If 7 of the data belong to class A, the probability of class A is 7 in 10. If three of the data belong to class B, the probability of class B is three tenths. The simple explanation is that the probability is calculated according to the data. We define the data in the sample data table as training data set D, then the empirical entropy of training data set D is H(D), d represents its sample size and the number of samples. There are k classes Ck, k = 1,2,3, ···, K, Ck is the number of samples belonging to class Ck, and the empirical entropy formula can be written as:
The empirical entropy H(D) is calculated according to this formula.
The conditional entropy H(YX) represents the uncertainty of the random variable y under the condition that the random variable x is known. The conditional entropy H(YX) of the random variable y under the given condition of the random variable x defines the mathematical expectation of the entropy of the conditional probability distribution of Y on X under the given condition of X:
among
Similarly, when the probability in conditional entropy is obtained by data estimation (especially maximum likelihood estimation), the corresponding conditional entropy becomes conditional empirical entropy.
The Let feature a have n different values {a1,a2,..., an}, and divide data set D into n subsets D1,D2,..., Dn, di as the number of samples of subset Di according to the values of feature a. Note that the set of samples belonging to Ck (Ck is the final category) in subset Di is Dik, that is, Dik = Di ∩ Ck, and  dik  is the number of samples of Dik. So the formula of conditional entropy can be written as:
1.2 information gain
The After defining the concepts of empirical entropy and conditional entropy, we return to information gain. As mentioned earlier, how to select features depends on the information gain. That is to say, the information gain is relative to the feature. The greater the information gain, the greater the impact of the feature on the final classification result. We should choose the feature that has the greatest impact on the final classification result as our classification feature.
The So, back to the definition and formula of information gain mentioned earlier:
Give an example
1.3 examples and programming
Take the sample data sheet of loan application as an example, as shown in the following table.
ID Age Have work Own a house Credit situation Category (whether it is a loan)
one youth no no commonly no
two youth no no good no
three youth yes no good yes
four youth yes yes commonly yes
five youth no no commonly no
six middle age no no commonly no
seven middle age no no good no
eight middle age yes yes good yes
nine middle age no yes Very good yes
ten middle age no yes Very good yes
eleven old age no yes Very good yes
twelve old age no yes good yes
thirteen old age yes no good yes
fourteen old age yes no Very good yes
fifteen old age no no commonly no
The Purpose: we hope to learn a decision tree of loan application through the given training data to classify future loan applications, that is, when a new customer applies for a loan, we can use the decision tree to decide whether to approve the loan application according to the characteristics of the applicant.
The First, we calculate the information gain of a feature:
First, calculate the information gain of the feature of age, so that the feature of age is A1. According to the formula
That is, the difference between empirical entropy and conditional empirical entropy. First, according to the provided training data, calculate the empirical entropy of training set D:
 Empirical entropy
Using Python3 to write code and calculate empirical entropy:
Before writing the code, you need to label the training data.
 Age: 0 for youth, 1 for middle age, 2 for old age;
 Yes: 0 means no, 1 means yes;
 Have your own house: 0 means no, 1 means yes;
 Credit: 0 means average, 1 means good, 2 means very good;
 Category (loan or not): no means no, yes means yes.
After these are determined, we can create a data set and calculate the empirical entropy. The code is written as follows:
# * coding: UTF8 * from math import log """ Function description:Create test data set Parameters: nothing Returns: dataSet  data set labels  Classification properties Author: Jack Cui Modify: 20170720 """ #Create workout data def createDataSet(): dataSet = [[0, 0, 0, 0, 'no'], #data set [0, 0, 0, 1, 'no'], [0, 1, 0, 1, 'yes'], [0, 1, 1, 0, 'yes'], [0, 0, 0, 0, 'no'], [1, 0, 0, 0, 'no'], [1, 0, 0, 1, 'no'], [1, 1, 1, 1, 'yes'], [1, 0, 1, 2, 'yes'], [1, 0, 1, 2, 'yes'], [2, 0, 1, 2, 'yes'], [2, 0, 1, 1, 'yes'], [2, 1, 0, 1, 'yes'], [2, 1, 0, 2, 'yes'], [2, 0, 0, 0, 'no']] labels = ['Age', 'Have work', 'Own a house', 'Credit situation']#Classification properties return dataSet, labels#Return dataset and classification properties """ Function description:Calculate the empirical entropy of a given data set(Shannon entropy) Parameters: dataSet  data set Returns: shannonEnt  Empirical entropy(Shannon entropy) Author: Jack Cui Modify: 20170329 """ def calcShannonEnt(dataSet): numEntires = len(dataSet) #Returns the number of rows in the dataset labelCounts = {} #Save a dictionary of occurrences of each label for featVec in dataSet: #Statistics for each group of eigenvectors currentLabel = featVec[1] #Extract label information if currentLabel not in labelCounts.keys(): #If the label is not put into the dictionary for counting times, add it labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 #Label count shannonEnt = 0.0 #Empirical entropy (Shannon entropy) for key in labelCounts: #Calculate Shannon entropy prob = float(labelCounts[key]) / numEntires #Probability of selecting the label shannonEnt = prob * log(prob, 2) #Calculation by formula return shannonEnt #Return empirical entropy (Shannon entropy) if __name__ == '__main__': dataSet, features = createDataSet() print(dataSet) print(calcShannonEnt(dataSet))

After the empirical entropy calculation is completed, the conditional empirical entropy is calculated (all under the condition that the eigenvalue is the age):
Take the sample data sheet of loan application as an example. Look at the data in the age column, that is, feature A1. There are three categories: youth, middle age and old age. We only look at the data that age is youth. There are five data that age is youth, so the probability of the data that age is youth appearing in the training data set is five fifths, that is, one third. Similarly, the probability that the data of middle age and old age appear in the training data set is also one third. Now we only look at the data that age is youth, and the probability of getting the loan is two fifths, because only two of the five data show that we have got the loan. Similarly, the probability of getting the loan for the data that age is middle age and old age is three fifths and four fifths respectively.
The specific calculation process is shown in the following figure: the calculation result is 0.888

After calculating the conditional empirical entropy, the information gain can be calculated
The information gain calculated by taking the eigenvalue A1 as an example can be directly obtained from the above, and the result is 0.9710.888 = 0.083.
Similarly, the information gains g(D,A2), g(D,A3) and g(D,A4) of other features can be calculated. As follows:
Finally, from the information gain results of each eigenvalue, the information gain value of feature A3 (with its own house) is the largest, so A3 is selected as the optimal feature.

Write code to calculate information gain
After learning to calculate the information gain through the formula, you should start to write code to calculate the information gain and select the optimal eigenvalue.
The codes are as follows:
# * coding: UTF8 * from math import log """ Function description:Calculate the empirical entropy of a given data set(Shannon entropy) Parameters: dataSet  data set Returns: shannonEnt  Empirical entropy(Shannon entropy) Author: Jack Cui Modify: 20170329 """ def calcShannonEnt(dataSet): numEntires = len(dataSet) #Returns the number of rows in the dataset labelCounts = {} #Save a dictionary of occurrences of each label for featVec in dataSet: #Statistics for each group of eigenvectors currentLabel = featVec[1] #Extract label information if currentLabel not in labelCounts.keys(): #If the label is not put into the dictionary for counting times, add it labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 #Label count shannonEnt = 0.0 #Empirical entropy (Shannon entropy) for key in labelCounts: #Calculate Shannon entropy prob = float(labelCounts[key]) / numEntires #Probability of selecting the label shannonEnt = prob * log(prob, 2) #Calculation by formula return shannonEnt #Return empirical entropy (Shannon entropy) """ Function description:Create test data set Parameters: nothing Returns: dataSet  data set labels  Classification properties Author: Jack Cui Modify: 20170720 """ def createDataSet(): dataSet = [[0, 0, 0, 0, 'no'], #data set [0, 0, 0, 1, 'no'], [0, 1, 0, 1, 'yes'], [0, 1, 1, 0, 'yes'], [0, 0, 0, 0, 'no'], [1, 0, 0, 0, 'no'], [1, 0, 0, 1, 'no'], [1, 1, 1, 1, 'yes'], [1, 0, 1, 2, 'yes'], [1, 0, 1, 2, 'yes'], [2, 0, 1, 2, 'yes'], [2, 0, 1, 1, 'yes'], [2, 1, 0, 1, 'yes'], [2, 1, 0, 2, 'yes'], [2, 0, 0, 0, 'no']] labels = ['Age', 'Have work', 'Own a house', 'Credit situation'] #Classification properties return dataSet, labels #Return dataset and classification properties """ Function description:Divide the data set according to the given characteristics Parameters: dataSet  Data set to be divided axis  Characteristics of partitioned datasets value  Value of the characteristic to be returned Returns: nothing Author: Jack Cui Modify: 20170330 """ def splitDataSet(dataSet, axis, value): retDataSet = [] #Create a list of returned datasets for featVec in dataSet: #Traversal dataset if featVec[axis] == value: reducedFeatVec = featVec[:axis] #Remove axis feature reducedFeatVec.extend(featVec[axis+1:]) #Add qualified to the returned dataset retDataSet.append(reducedFeatVec) return retDataSet #Returns the partitioned dataset """ Function description:Select optimal feature Parameters: dataSet  data set Returns: bestFeature  Maximum information gain(optimal)Index value of feature Author: Jack Cui Modify: 20170330 """ def chooseBestFeatureToSplit(dataSet): numFeatures = len(dataSet[0])  1 #Number of features baseEntropy = calcShannonEnt(dataSet) #Calculate Shannon entropy of data set bestInfoGain = 0.0 #information gain bestFeature = 1 #Index value of optimal feature for i in range(numFeatures): #Traverse all features #Get all the ith features of the dataSet and save them in the featList featList = [example[i] for example in dataSet] #print(featList)#List of 15 characteristic values for each characteristic uniqueVals = set(featList) #Create set set {}, element cannot be repeated #print(uniqueVals)#Remove duplicates newEntropy = 0.0 #Empirical conditional entropy for value in uniqueVals: #Calculate information gain subDataSet = splitDataSet(dataSet, i, value) #Sub dataset partitioned subset prob = len(subDataSet) / float(len(dataSet)) #Calculate the probability of subsets newEntropy += prob * calcShannonEnt(subDataSet) #Calculating empirical conditional entropy according to formula infoGain = baseEntropy  newEntropy #information gain print("Section%d The gain of the features is%.3f" % (i, infoGain)) #Print information gain for each feature if (infoGain > bestInfoGain): #Calculate information gain bestInfoGain = infoGain #Update the information gain to find the maximum information gain bestFeature = i #Record the index value of the feature with the largest information gain return bestFeature #Returns the index value of the feature with the largest information gain if __name__ == '__main__': dataSet, features = createDataSet() print("Optimal feature index value:" + str(chooseBestFeatureToSplit(dataSet)))
The test results are as follows:
The gain of the 0th feature is 0.083
Gain of the first feature is 0.324
Gain of the second feature is 0.420
Gain of the third feature is 0.363
Optimal feature index value: 2