Decision trees are discriminative models that can solve both classification and regression problems. The classification tree makes decisions on discrete variables, and the regression tree makes decisions on continuous variables; its representation in space is similar to splitting in different dimensions. (The decision tree can be a binary tree or a non-binary tree)

Three parts of decision tree construction: feature selection, decision tree generation, pruning

## ID3 tree

ID3 uses information gain as a measure of node split selection features.

The concept of entropy: derived from Shannon's information theory, a measure used to describe the degree of information chaos.

Information entropy formula: *Entropy=-p*logp

Information gain = experience entropy of the original data set - deconditioning experience entropy

python code:

from math import log import operator def createDataSet(): dataSet =[[1,1,'yes'], [1,1,'yes'], [1,0,'no'], [0,1,'no'], [0,1,'no']] labels = ['no surfacing','flippers'] #Classified properties return dataSet,labels def calcShannonEnt(dataSet): numEntries=len(dataSet) ##Count the number of different label s labelCounts={} for featVec in dataSet: currentLabel=featVec[-1] if currentLabel not in labelCounts: labelCounts[currentLabel]=0 labelCounts[currentLabel]+=1 shannonEnt=0.0 for key in labelCounts: prob=float(labelCounts[key])/numEntries shannonEnt-=prob*log(prob,2) return shannonEnt #Divide the data set, the three parameters are the data set with division, the characteristics of the divided data set, and the return value of the characteristics def splitDataSet(dataSet,axis,value): retDataSet=[] ##Remove the feature in the axis column for featVec in dataSet: if featVec[axis]==value: reducedFeatVec=featVec[:axis] reducedFeatVec.extend(featVec[axis+1:]) retDataSet.append(reducedFeatVec) return retDataSet def chooseBestFeatureToSplit(dataSet): numFea=len(dataSet[0])-1 baseEntropy=calcShannonEnt(dataSet) bestInfoGain=0.0 beatFeature=-1 ##Pick the feature with the best information gain among all the features for i in range(numFea): fea_list=set([fea[i] for fea in dataSet]) newEntropy=0.0 for fea in fea_list: subDataSet=splitDataSet(dataSet,i,fea) ###The probability that the different values corresponding to fea account for the total data volume prob=len(subDataSet)/float(len(dataSet)) newEntropy+=prob*calcShannonEnt(subDataSet) infoGain=baseEntropy-newEntropy if bestInfoGain>infoGain: bestInfoGain=infoGain bestFea=i return besFea ##Count the categories in the classList and return the one with the most statistics def majorityCnt(classList): classCount={} for vote in classList: if vote not in classCount.keys(): classCount[vote]=0 classCount[vote]+=1 sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True) return sortedClassCount[0][0] def createTree(dataSet,labels): classList=[example[-1] for example in dataSet]##type of data if classList.count(classList[0])==len(classList):##If there is only one category left, return the category return classList[0] if len(dataSet[0])==1:##If the data set is split according to the features, there is only one feature left in the final data set return majorityCnt(classList) bestFeat=chooseBestFeatureToSplit(dataSet) bestFeatLabel=labels[bestFeat] myTree={bestFeatLabel:{}} del (labels[bestFeat]) featValues=[example[bestFeat] for example in dataSet] uniqueVals=set(featValues) for value in uniqueVals: subLabels=labels[:] ##Recursively call the create decision tree function myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels) return myTree if __name__ == '__main__': dataSet,labels=createDataSet() print(createTree(dataSet,labels))

Advantages and disadvantages of decision trees: decision trees are suitable for numerical types, read data sets, and learn the implied rules from the continuous splitting process. Decision trees are computationally inexpensive, easy to use, and efficient. Decision trees can handle data with unrelated characteristics, and can easily construct understandable rules, which are often easy to interpret and understand. Decision tree models also have some disadvantages, such as difficulty in dealing with missing data, overfitting, and ignoring the correlation between attributes in the dataset, etc.

## C4.5 tree

C4.5 has improved a lot compared to ID3

- Using the information gain rate to select attributes overcomes the disadvantage of choosing attributes with more values when using information gain;
- pruning during tree construction;
- Able to complete discretization of continuous attributes;
- Able to handle incomplete data.

Python code: Here only the code for calculating feature information entropy and calculating information gain ratio is added in the method of chooseBestFeatureToSplit. For the specific method of processing continuous values, please refer to: https://blog.csdn.net/leaf_zizi/article/details/83105836

def chooseBestFeatureToSplit(dataSet): numFea=len(dataSet[0])-1 baseEntropy=calcShannonEnt(dataSet) bestInfoGain=0.0 beatFeature=-1 ##Pick the feature with the best information gain among all the features for i in range(numFea): fea_list=set([fea[i] for fea in dataSet]) infoGainRatio=0 newEntropy=0.0 fea_entro=0.0 for fea in fea_list: subDataSet=splitDataSet(dataSet,i,fea) ###The probability that the different values corresponding to fea account for the total data volume prob=len(subDataSet)/float(len(dataSet)) newEntropy+=prob*calcShannonEnt(subDataSet) ##Calculate the entropy of the feature fea fea_entro-=prob*log(prob,2) infoGain=baseEntropy-newEntropy newInfoRatio=infoGain/fea_entro if newInfoRatio>infoGainRatio: newInfoRatio=infoGainRatio bestFea=i return bestFea

## CART tree

Classification and regression tree (classification and regression tree, CART), cart tree is a binary tree, which is the conditional probability distribution of output random variable Y under the condition of given random variable X.

Strategy: Regression tree uses squared error minimization, classification tree uses Gini index minimization criterion

How is the final output y obtained in the regression tree?

In the region R divided by the input space, the mean value of the output y corresponding to all input instances x

The construction process of the least squares regression tree:

- Traverse all input variables, find the optimal segmentation variable (according to the following formula), use variable j to divide the input space into two regions R1,R2, and the output of y in the corresponding region is the mean value of the input instances in the region;

- According to 1, the segmentation variable j and the segmentation point s are obtained, and the corresponding output value is determined by dividing the area;
- Continue to repeat steps 1 and 2 until the conditions are met to stop;
- Divide the input space into M regions and generate a decision tree

Classification tree construction: slightly

How the cart classification tree handles continuous values:

For the processing of continuous values of the CART classification tree, the idea is the same as C4.5, which discretizes continuous features. The only difference is that the measurement methods are different when selecting the division point. C4.5 uses the information gain ratio, while the CART classification tree uses the Gini coefficient.

The specific idea is as follows. For example, there are m continuous features A of m samples, and they are arranged as a1,a2,...,am from small to large. Then the CART algorithm takes the average of the values of two adjacent samples, and obtains a total of m-1 divisions point, where the i-th dividing point Ti is expressed as: Ti=(ai+ai+1)/2. For these m-1 points, calculate the Gini coefficient when this point is used as a binary classification point. Select the point with the smallest Gini coefficient as the binary discrete classification point of the continuous feature. For example, the point with the smallest Gini coefficient is at, then the value smaller than at is category 1, and the value greater than at is category 2, so that we can achieve the discretization of continuous features.

Recommend Liu Jianping's blog garden, the analysis is very thorough: https://www.cnblogs.com/pinard/p/6053344.html