Decision tree part 2 pre-pruning

Decision tree pre-pruning:

Decision trees can be divided into ID3, C4.5 and CART.

Algorithm purpose: The pruning of the decision tree is to simplify the decision tree model and avoid overfitting.

Pruning type: pre-pruning, post-pruning

  • Pre-pruning: pruning while constructing the decision tree. All decision tree construction methods stop the process of creating branches when the entropy cannot be further reduced. In order to avoid overfitting, a threshold can be set. The amount of entropy reduction is less than this threshold, even if it can continue Lowers entropy and also stops further branching. But this method doesn't work well in practice.
  • Post-pruning is to prune the tree after the growth of the decision tree is completed to obtain a simplified version of the decision tree.

    The process of pruning is to check a group of nodes with the same parent node to determine whether the increase in entropy is less than a certain threshold if they are merged. If it is really small, this set of nodes can be merged into one node, which contains all possible outcomes. Post-pruning is by far the most common practice.
    The pruning process of post-pruning is to delete some subtrees and replace them with their leaf nodes. The category identified by the leaf nodes is determined by the majority class criterion.


• Pre-pruning is to stop the growth of the tree early, before the training set is completely correctly classified. There are a number of different ways to specify when to stop the growth of a decision tree:
(1) One of the simplest methods is to stop the growth of the tree when the decision tree reaches a certain height.
(2) The instances that reach this node have the same feature vector, but do not necessarily belong to the same class, and can also stop growing.
(3) The growth of the tree can also be stopped if the number of instances reaching this node is less than a certain threshold.

(4) Another more common approach is to calculate the gain of each expansion to the system performance, and if the gain value is less than a certain threshold, the expansion will not be performed.


(Omit basic information entropy and other mathematical operation codes, see the first experiment)

data set:
def createDataXG20():
    data = np.array([['Green', 'curl up', 'Loudness', 'clear', 'sunken', 'Hard and slippery']
                    , ['jet black', 'curl up', 'boring', 'clear', 'sunken', 'Hard and slippery']
                    , ['jet black', 'curl up', 'Loudness', 'clear', 'sunken', 'Hard and slippery']
                    , ['Green', 'curl up', 'boring', 'clear', 'sunken', 'Hard and slippery']
                    , ['light white', 'curl up', 'Loudness', 'clear', 'sunken', 'Hard and slippery']
                    , ['Green', 'slightly curled up', 'Loudness', 'clear', 'Slightly concave', 'soft sticky']
                    , ['jet black', 'slightly curled up', 'Loudness', 'Slightly blurred', 'Slightly concave', 'soft sticky']
                    , ['jet black', 'slightly curled up', 'Loudness', 'clear', 'Slightly concave', 'Hard and slippery']
                    , ['jet black', 'slightly curled up', 'boring', 'Slightly blurred', 'Slightly concave', 'Hard and slippery']
                    , ['Green', 'Stiff', 'crisp', 'clear', 'flat', 'soft sticky']
                    , ['light white', 'Stiff', 'crisp', 'Vague', 'flat', 'Hard and slippery']
                    , ['light white', 'curl up', 'Loudness', 'Vague', 'flat', 'soft sticky']
                    , ['Green', 'slightly curled up', 'Loudness', 'Slightly blurred', 'sunken', 'Hard and slippery']
                    , ['light white', 'slightly curled up', 'boring', 'Slightly blurred', 'sunken', 'Hard and slippery']
                    , ['jet black', 'slightly curled up', 'Loudness', 'clear', 'Slightly concave', 'soft sticky']
                    , ['light white', 'curl up', 'Loudness', 'Vague', 'flat', 'Hard and slippery']
                    , ['Green', 'curl up', 'boring', 'Slightly blurred', 'Slightly concave', 'Hard and slippery']])
    label = np.array(['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no'])
    name = np.array(['color', 'roots', 'Knock', 'texture', 'Navel', 'touch'])
    return data, label, name

def splitXgData20(xgData, xgLabel):
    xgDataTrain = xgData[[0, 1, 2, 5, 6, 9, 13, 14, 15, 16],:]
    xgDataTest = xgData[[3, 4, 7, 8, 10, 11, 12],:]
    xgLabelTrain = xgLabel[[0, 1, 2, 5, 6, 9, 13, 14, 15, 16]]
    xgLabelTest = xgLabel[[3, 4, 7, 8, 10, 11, 12]]
    return xgDataTrain, xgLabelTrain, xgDataTest, xgLabelTest
# Create a pre-pruned decision tree
def createTreePrePruning(dataTrain, labelTrain, dataTest, labelTest, names, method = 'id3'):

    trainData = np.asarray(dataTrain)
    labelTrain = np.asarray(labelTrain)
    testData = np.asarray(dataTest)
    labelTest = np.asarray(labelTest)
    names = np.asarray(names)# If the result is a single result
    if len(set(labelTrain)) == 1: 
        return labelTrain[0] # If there are no features to classify
    elif trainData.size == 0: 
        return voteLabel(labelTrain)# In other cases, select features 
    bestFeat, bestEnt = bestFeature(dataTrain, labelTrain, method = method) # take feature name
    bestFeatName = names[bestFeat] # removes the acquired feature name from the list of feature names
    names = np.delete(names, [bestFeat])# Segmentation based on optimal features
    dataTrainSet, labelTrainSet = splitFeatureData(dataTrain, labelTrain, bestFeat)# Pre-pruning evaluation
    labelTrainLabelPre = voteLabel(labelTrain)
    labelTrainRatioPre = equalNums(labelTrain, labelTrainLabelPre) / labelTrain.size# Accuracy calculation after division 
    if dataTest is not None: 
        dataTestSet, labelTestSet = splitFeatureData(dataTest, labelTest, bestFeat)# Correct ratio of test labels before division
        labelTestRatioPre = equalNums(labelTest, labelTrainLabelPre) / labelTest.size# After division, the correct number of classification labels for each feature value
        labelTrainEqNumPost = 0
        for val in labelTrainSet.keys():
            labelTrainEqNumPost += equalNums(labelTestSet.get(val), voteLabel(labelTrainSet.get(val))) + 0.0# After dividing the correct ratio
        labelTestRatioPost = labelTrainEqNumPost / labelTest.size 
# If the data is not evaluated but the precision before division is equal to the minimum value of 0.5 then continue to divide
    if dataTest is None and labelTrainRatioPre == 0.5:
        decisionTree = {bestFeatName: {}}
        for featValue in dataTrainSet.keys():
            decisionTree[bestFeatName][featValue] = createTreePrePruning(dataTrainSet.get(featValue), labelTrainSet.get(featValue)
                                      , None, None, names, method)
    elif dataTest is None:
        return labelTrainLabelPre 
    # If the accuracy after division is lower than the accuracy before division, it will be directly returned as a leaf node
    elif labelTestRatioPost < labelTestRatioPre:
        return labelTrainLabelPre
    else :
        # Create tree nodes based on selected feature names
        decisionTree = {bestFeatName: {}}
        # Calculate the subset of data divided by each eigenvalue of the optimal feature
        for featValue in dataTrainSet.keys():
            decisionTree[bestFeatName][featValue] = createTreePrePruning(dataTrainSet.get(featValue), labelTrainSet.get(featValue)
                                      , dataTestSet.get(featValue), labelTestSet.get(featValue)
                                      , names, method)
    return decisionTree 

Pre-pruning test:

# Split into test set and training set
xgDataTrain, xgLabelTrain, xgDataTest, xgLabelTest = splitXgData20(xgData, xgLabel)
# generate unpruned tree
xgTreeTrain = createTree(xgDataTrain, xgLabelTrain, xgName, method = 'id3')
# generate pre-pruned trees
xgTreePrePruning = createTreePrePruning(xgDataTrain, xgLabelTrain, xgDataTest, xgLabelTest, xgName, method = 'id3')
# drawing tree before pruning
print("original tree")
# drawing pruned tree
print("pruning tree")

Output comparison:





advantage disadvantage

•Because pre-pruning does not need to generate the entire decision tree, and the algorithm is relatively simple and efficient, it is suitable for solving large-scale problems. But although this method seems straightforward, it is quite difficult to estimate exactly when to stop the growth of the tree.

• One disadvantage of pre-pruning is the field of view problem. That is to say, under the same standard, the current expansion may cause overfitting to the training data, but further expansion can meet the requirements, and it is also possible to accurately fit the training data. This will cause the algorithm to stop constructing the decision tree prematurely.

If using visual pruning code:

clf = tree.DecisionTreeClassifier(criterion='entropy', random_state=0, min_samples_split=5)


clf = tree.DecisionTreeClassifier(criterion='entropy', random_state=0, min_samples_split=10)

It can be known that the larger the value of min_samples_split, the smaller the branch, which is the pruning operation.

After pruning:

After pruning, pruning is performed on the generated overfitting decision tree, and a simplified version of the pruning decision tree can be obtained

1. REP-error rate reduction pruning

2. PEP-pessimistic pruning

3. CCP-Cost Complexity Pruning

4. MEP-minimum error pruning


The pruning process occupies an extremely important position in the decision tree model. There are many studies that show that pruning is more critical than the tree generation process. For the overfitting decision tree generated by different division standards, the most important attribute division can be retained after pruning, so the final performance gap is not large. Understand the theory of pruning methods. In practical applications, according to different data types and scales, decide which decision tree to use and the corresponding pruning strategy, and be flexible to find the optimal choice.


1. Wang Lu ylu: decision tree python

2. Machine learning in practice

3 Decision tree pruning of Xie Xiaojiao's teaching package meeting decision tree


Tags: Decision Tree

Posted by flatpooks on Sun, 20 Nov 2022 03:11:08 +0530