Decision tree pre-pruning:
Decision trees can be divided into ID3, C4.5 and CART.
Algorithm purpose: The pruning of the decision tree is to simplify the decision tree model and avoid overfitting.
Pruning type: pre-pruning, post-pruning
- Pre-pruning: pruning while constructing the decision tree. All decision tree construction methods stop the process of creating branches when the entropy cannot be further reduced. In order to avoid overfitting, a threshold can be set. The amount of entropy reduction is less than this threshold, even if it can continue Lowers entropy and also stops further branching. But this method doesn't work well in practice.
- Post-pruning is to prune the tree after the growth of the decision tree is completed to obtain a simplified version of the decision tree.
The process of pruning is to check a group of nodes with the same parent node to determine whether the increase in entropy is less than a certain threshold if they are merged. If it is really small, this set of nodes can be merged into one node, which contains all possible outcomes. Post-pruning is by far the most common practice.
The pruning process of post-pruning is to delete some subtrees and replace them with their leaf nodes. The category identified by the leaf nodes is determined by the majority class criterion.
pre-pruning
• Pre-pruning is to stop the growth of the tree early, before the training set is completely correctly classified. There are a number of different ways to specify when to stop the growth of a decision tree:
(1) One of the simplest methods is to stop the growth of the tree when the decision tree reaches a certain height.
(2) The instances that reach this node have the same feature vector, but do not necessarily belong to the same class, and can also stop growing.
(3) The growth of the tree can also be stopped if the number of instances reaching this node is less than a certain threshold.
(4) Another more common approach is to calculate the gain of each expansion to the system performance, and if the gain value is less than a certain threshold, the expansion will not be performed.
Code:
(Omit basic information entropy and other mathematical operation codes, see the first experiment)
data set: def createDataXG20(): data = np.array([['Green', 'curl up', 'Loudness', 'clear', 'sunken', 'Hard and slippery'] , ['jet black', 'curl up', 'boring', 'clear', 'sunken', 'Hard and slippery'] , ['jet black', 'curl up', 'Loudness', 'clear', 'sunken', 'Hard and slippery'] , ['Green', 'curl up', 'boring', 'clear', 'sunken', 'Hard and slippery'] , ['light white', 'curl up', 'Loudness', 'clear', 'sunken', 'Hard and slippery'] , ['Green', 'slightly curled up', 'Loudness', 'clear', 'Slightly concave', 'soft sticky'] , ['jet black', 'slightly curled up', 'Loudness', 'Slightly blurred', 'Slightly concave', 'soft sticky'] , ['jet black', 'slightly curled up', 'Loudness', 'clear', 'Slightly concave', 'Hard and slippery'] , ['jet black', 'slightly curled up', 'boring', 'Slightly blurred', 'Slightly concave', 'Hard and slippery'] , ['Green', 'Stiff', 'crisp', 'clear', 'flat', 'soft sticky'] , ['light white', 'Stiff', 'crisp', 'Vague', 'flat', 'Hard and slippery'] , ['light white', 'curl up', 'Loudness', 'Vague', 'flat', 'soft sticky'] , ['Green', 'slightly curled up', 'Loudness', 'Slightly blurred', 'sunken', 'Hard and slippery'] , ['light white', 'slightly curled up', 'boring', 'Slightly blurred', 'sunken', 'Hard and slippery'] , ['jet black', 'slightly curled up', 'Loudness', 'clear', 'Slightly concave', 'soft sticky'] , ['light white', 'curl up', 'Loudness', 'Vague', 'flat', 'Hard and slippery'] , ['Green', 'curl up', 'boring', 'Slightly blurred', 'Slightly concave', 'Hard and slippery']]) label = np.array(['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']) name = np.array(['color', 'roots', 'Knock', 'texture', 'Navel', 'touch']) return data, label, name def splitXgData20(xgData, xgLabel): xgDataTrain = xgData[[0, 1, 2, 5, 6, 9, 13, 14, 15, 16],:] xgDataTest = xgData[[3, 4, 7, 8, 10, 11, 12],:] xgLabelTrain = xgLabel[[0, 1, 2, 5, 6, 9, 13, 14, 15, 16]] xgLabelTest = xgLabel[[3, 4, 7, 8, 10, 11, 12]] return xgDataTrain, xgLabelTrain, xgDataTest, xgLabelTest
# Create a pre-pruned decision tree def createTreePrePruning(dataTrain, labelTrain, dataTest, labelTest, names, method = 'id3'): trainData = np.asarray(dataTrain) labelTrain = np.asarray(labelTrain) testData = np.asarray(dataTest) labelTest = np.asarray(labelTest) names = np.asarray(names)# If the result is a single result if len(set(labelTrain)) == 1: return labelTrain[0] # If there are no features to classify elif trainData.size == 0: return voteLabel(labelTrain)# In other cases, select features bestFeat, bestEnt = bestFeature(dataTrain, labelTrain, method = method) # take feature name bestFeatName = names[bestFeat] # removes the acquired feature name from the list of feature names names = np.delete(names, [bestFeat])# Segmentation based on optimal features dataTrainSet, labelTrainSet = splitFeatureData(dataTrain, labelTrain, bestFeat)# Pre-pruning evaluation labelTrainLabelPre = voteLabel(labelTrain) labelTrainRatioPre = equalNums(labelTrain, labelTrainLabelPre) / labelTrain.size# Accuracy calculation after division if dataTest is not None: dataTestSet, labelTestSet = splitFeatureData(dataTest, labelTest, bestFeat)# Correct ratio of test labels before division labelTestRatioPre = equalNums(labelTest, labelTrainLabelPre) / labelTest.size# After division, the correct number of classification labels for each feature value labelTrainEqNumPost = 0 for val in labelTrainSet.keys(): labelTrainEqNumPost += equalNums(labelTestSet.get(val), voteLabel(labelTrainSet.get(val))) + 0.0# After dividing the correct ratio labelTestRatioPost = labelTrainEqNumPost / labelTest.size # If the data is not evaluated but the precision before division is equal to the minimum value of 0.5 then continue to divide if dataTest is None and labelTrainRatioPre == 0.5: decisionTree = {bestFeatName: {}} for featValue in dataTrainSet.keys(): decisionTree[bestFeatName][featValue] = createTreePrePruning(dataTrainSet.get(featValue), labelTrainSet.get(featValue) , None, None, names, method) elif dataTest is None: return labelTrainLabelPre # If the accuracy after division is lower than the accuracy before division, it will be directly returned as a leaf node elif labelTestRatioPost < labelTestRatioPre: return labelTrainLabelPre else : # Create tree nodes based on selected feature names decisionTree = {bestFeatName: {}} # Calculate the subset of data divided by each eigenvalue of the optimal feature for featValue in dataTrainSet.keys(): decisionTree[bestFeatName][featValue] = createTreePrePruning(dataTrainSet.get(featValue), labelTrainSet.get(featValue) , dataTestSet.get(featValue), labelTestSet.get(featValue) , names, method) return decisionTree
Pre-pruning test:
# Split into test set and training set xgDataTrain, xgLabelTrain, xgDataTest, xgLabelTest = splitXgData20(xgData, xgLabel) # generate unpruned tree xgTreeTrain = createTree(xgDataTrain, xgLabelTrain, xgName, method = 'id3') # generate pre-pruned trees xgTreePrePruning = createTreePrePruning(xgDataTrain, xgLabelTrain, xgDataTest, xgLabelTest, xgName, method = 'id3') # drawing tree before pruning print("original tree") createPlot(xgTreeTrain) # drawing pruned tree print("pruning tree") createPlot(xgTreePrePruning)
Output comparison:
Original
back:
advantage disadvantage
•Because pre-pruning does not need to generate the entire decision tree, and the algorithm is relatively simple and efficient, it is suitable for solving large-scale problems. But although this method seems straightforward, it is quite difficult to estimate exactly when to stop the growth of the tree.
• One disadvantage of pre-pruning is the field of view problem. That is to say, under the same standard, the current expansion may cause overfitting to the training data, but further expansion can meet the requirements, and it is also possible to accurately fit the training data. This will cause the algorithm to stop constructing the decision tree prematurely.
If using visual pruning code:
clf = tree.DecisionTreeClassifier(criterion='entropy', random_state=0, min_samples_split=5)
clf = tree.DecisionTreeClassifier(criterion='entropy', random_state=0, min_samples_split=10)
It can be known that the larger the value of min_samples_split, the smaller the branch, which is the pruning operation.
After pruning:
After pruning, pruning is performed on the generated overfitting decision tree, and a simplified version of the pruning decision tree can be obtained
1. REP-error rate reduction pruning
2. PEP-pessimistic pruning
3. CCP-Cost Complexity Pruning
4. MEP-minimum error pruning
Summarize:
The pruning process occupies an extremely important position in the decision tree model. There are many studies that show that pruning is more critical than the tree generation process. For the overfitting decision tree generated by different division standards, the most important attribute division can be retained after pruning, so the final performance gap is not large. Understand the theory of pruning methods. In practical applications, according to different data types and scales, decide which decision tree to use and the corresponding pruning strategy, and be flexible to find the optimal choice.
references:
1. Wang Lu ylu: decision tree python
2. Machine learning in practice
3 Decision tree pruning of Xie Xiaojiao's teaching package meeting decision tree