(Li Hang Statistical Learning Method) Decision Tree python Implementation

Decision trees are discriminative models that can solve both classification and regression problems. The classification tree makes decisions on discrete variables, and the regression tree makes decisions on continuous variables; its representation in space is similar to splitting in different dimensions. (The decision tree can be a binary tree or a non-binary tree)
Three parts of decision tree construction: feature selection, decision tree generation, pruning

ID3 tree

ID3 uses information gain as a measure of node split selection features.
The concept of entropy: derived from Shannon's information theory, a measure used to describe the degree of information chaos.
Information entropy formula: *Entropy=-p*logp
Information gain = experience entropy of the original data set - deconditioning experience entropy
python code:

from math import log
import operator

def createDataSet():
    dataSet =[[1,1,'yes'],
              [1,1,'yes'],
              [1,0,'no'],
              [0,1,'no'],
              [0,1,'no']]
    labels = ['no surfacing','flippers'] #Classified properties
    return dataSet,labels
def calcShannonEnt(dataSet):
    numEntries=len(dataSet)
    ##Count the number of different label s
    labelCounts={}
    for featVec in dataSet:
        currentLabel=featVec[-1]
        if currentLabel not in labelCounts:
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1
    shannonEnt=0.0
    for key in labelCounts:
        prob=float(labelCounts[key])/numEntries
        shannonEnt-=prob*log(prob,2)
    return shannonEnt
#Divide the data set, the three parameters are the data set with division, the characteristics of the divided data set, and the return value of the characteristics
def splitDataSet(dataSet,axis,value):
    retDataSet=[]
    ##Remove the feature in the axis column
    for featVec in dataSet:
        if featVec[axis]==value:
            reducedFeatVec=featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

def chooseBestFeatureToSplit(dataSet):
    numFea=len(dataSet[0])-1
    baseEntropy=calcShannonEnt(dataSet)
    bestInfoGain=0.0
    beatFeature=-1
    ##Pick the feature with the best information gain among all the features
    for i in range(numFea):
        fea_list=set([fea[i] for fea in dataSet])
        newEntropy=0.0
        for fea in fea_list:
            subDataSet=splitDataSet(dataSet,i,fea)
            ###The probability that the different values ​​​​corresponding to fea account for the total data volume
            prob=len(subDataSet)/float(len(dataSet))
            newEntropy+=prob*calcShannonEnt(subDataSet)
        infoGain=baseEntropy-newEntropy
        if bestInfoGain>infoGain:
            bestInfoGain=infoGain
            bestFea=i
    return besFea
##Count the categories in the classList and return the one with the most statistics
def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not  in classCount.keys():
            classCount[vote]=0
        classCount[vote]+=1
    sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
    return  sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList=[example[-1] for example in dataSet]##type of data
    if classList.count(classList[0])==len(classList):##If there is only one category left, return the category
        return classList[0]
    if len(dataSet[0])==1:##If the data set is split according to the features, there is only one feature left in the final data set
        return majorityCnt(classList)
    bestFeat=chooseBestFeatureToSplit(dataSet)
    bestFeatLabel=labels[bestFeat]
    myTree={bestFeatLabel:{}}
    del (labels[bestFeat])
    featValues=[example[bestFeat] for example in dataSet]
    uniqueVals=set(featValues)
    for value in uniqueVals:
        subLabels=labels[:]
        ##Recursively call the create decision tree function
        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return myTree

if __name__ == '__main__':
    dataSet,labels=createDataSet()
    print(createTree(dataSet,labels))

Advantages and disadvantages of decision trees: decision trees are suitable for numerical types, read data sets, and learn the implied rules from the continuous splitting process. Decision trees are computationally inexpensive, easy to use, and efficient. Decision trees can handle data with unrelated characteristics, and can easily construct understandable rules, which are often easy to interpret and understand. Decision tree models also have some disadvantages, such as difficulty in dealing with missing data, overfitting, and ignoring the correlation between attributes in the dataset, etc.

C4.5 tree

C4.5 has improved a lot compared to ID3

  1. Using the information gain rate to select attributes overcomes the disadvantage of choosing attributes with more values ​​when using information gain;
  2. pruning during tree construction;
  3. Able to complete discretization of continuous attributes;
  4. Able to handle incomplete data.
    Python code: Here only the code for calculating feature information entropy and calculating information gain ratio is added in the method of chooseBestFeatureToSplit. For the specific method of processing continuous values, please refer to: https://blog.csdn.net/leaf_zizi/article/details/83105836
def chooseBestFeatureToSplit(dataSet):
    numFea=len(dataSet[0])-1
    baseEntropy=calcShannonEnt(dataSet)
    bestInfoGain=0.0
    beatFeature=-1
    ##Pick the feature with the best information gain among all the features
    for i in range(numFea):
        fea_list=set([fea[i] for fea in dataSet])
        infoGainRatio=0
        newEntropy=0.0
        fea_entro=0.0
        for fea in fea_list:
            subDataSet=splitDataSet(dataSet,i,fea)
            ###The probability that the different values ​​​​corresponding to fea account for the total data volume
            prob=len(subDataSet)/float(len(dataSet))
            newEntropy+=prob*calcShannonEnt(subDataSet)
            ##Calculate the entropy of the feature fea
            fea_entro-=prob*log(prob,2)
        infoGain=baseEntropy-newEntropy
        newInfoRatio=infoGain/fea_entro

        if newInfoRatio>infoGainRatio:
            newInfoRatio=infoGainRatio
            bestFea=i
    return bestFea

CART tree

Classification and regression tree (classification and regression tree, CART), cart tree is a binary tree, which is the conditional probability distribution of output random variable Y under the condition of given random variable X.
Strategy: Regression tree uses squared error minimization, classification tree uses Gini index minimization criterion
How is the final output y obtained in the regression tree?
In the region R divided by the input space, the mean value of the output y corresponding to all input instances x
The construction process of the least squares regression tree:

  1. Traverse all input variables, find the optimal segmentation variable (according to the following formula), use variable j to divide the input space into two regions R1,R2, and the output of y in the corresponding region is the mean value of the input instances in the region;
  2. According to 1, the segmentation variable j and the segmentation point s are obtained, and the corresponding output value is determined by dividing the area;
  3. Continue to repeat steps 1 and 2 until the conditions are met to stop;
  4. Divide the input space into M regions and generate a decision tree
    Classification tree construction: slightly
    How the cart classification tree handles continuous values:
    For the processing of continuous values ​​of the CART classification tree, the idea is the same as C4.5, which discretizes continuous features. The only difference is that the measurement methods are different when selecting the division point. C4.5 uses the information gain ratio, while the CART classification tree uses the Gini coefficient.
        The specific idea is as follows. For example, there are m continuous features A of m samples, and they are arranged as a1,a2,...,am from small to large. Then the CART algorithm takes the average of the values ​​​​of two adjacent samples, and obtains a total of m-1 divisions point, where the i-th dividing point Ti is expressed as: Ti=(ai+ai+1)/2. For these m-1 points, calculate the Gini coefficient when this point is used as a binary classification point. Select the point with the smallest Gini coefficient as the binary discrete classification point of the continuous feature. For example, the point with the smallest Gini coefficient is at, then the value smaller than at is category 1, and the value greater than at is category 2, so that we can achieve the discretization of continuous features.

Recommend Liu Jianping's blog garden, the analysis is very thorough: https://www.cnblogs.com/pinard/p/6053344.html

Posted by Oxymen on Sun, 29 Jan 2023 22:47:03 +0530