Note: All functions are for the watermelon dataset only. If you need to classify other datasets, simply find the column in the dataset where the target values (yes and no for a good melon) are located (the last column in the watermelon dataset), and then you will need to replace the code that runs according to the target values.
Left-Out Implementation
The idea of implementation: According to the principle of the retention method, the dataset needs to be divided into two mutually exclusive sets, one as the training set and the other as the test set. In the process of data partitioning, stratified sampling is required. In order to generate multiple sets of data, random partitioning is required each time.
First count the number of positive examples in the dataset:
# Calculate positive examples in a dataset for data in data_set: if data[-1] == "yes": positive_size += 1
Then calculate the size of the training set you need:
# Calculate training set size train_set_size = int(percent / 100 * len(data_set)+0.5)
percent is the percentage of the training set to the dataset and requires human input.
Then calculate the number of positive examples needed in the training set:
# Calculate the normal size required in the training set positive_train_data_size = int( train_set_size / len(data_set) * positive_size + 0.5)
To make each division random and the training and test sets mutually exclusive, use sample to generate random numbers as the subscript for the training to be collected:
# Each set of training sets and datasets guarantees random partitioning index_list = random.sample(range(0, len(data_set)), len(data_set))
Using the for loop hierarchical training set and test set:
# Divide training sets, datasets for j in index_list: if data_set[j][-1] == "yes": if positive_count < positive_train_data_size: train_set.append(data_set[j]) positive_count += 1 else: test_set.append(data_set[j]) else: if negative_count < train_set_size - positive_train_data_size: train_set.append(data_set[j]) negative_count += 1 else: test_set.append(data_set[j])
Save a set of training and test sets each time they are generated
k_train_set.append(train_set) k_test_set.append(test_set)
The general procedure is as follows:
def setAside(data_set): '''Generating datasets using the retention method''' positive_size = 0 # Calculate positive examples in a dataset for data in data_set: if data[-1] == "yes" or data[-1] == "1": positive_size += 1 percent = int(input("Percentage of training set to total dataset(1~99): ")) # Calculate training set size train_set_size = int(percent / 100 * len(data_set)+0.5) # Calculate the normal size required in the training set positive_train_data_size = int( train_set_size / len(data_set) * positive_size + 0.5) group_num = int(input("How many groups are needed? ")) k_train_set, k_test_set = [], [] # Generate training set, test set for i in range(group_num): # Each set of training sets and datasets guarantees random partitioning index_list = random.sample(range(0, len(data_set)), len(data_set)) # Store a set of datasets and training sets train_set, test_set = [], [] # Number of positive and negative examples in the training set, enough to determine if positive and negative examples exist positive_count, negative_count = 0, 0 # Divide training sets, datasets for j in index_list: if data_set[j][-1] == "yes" or data[-1] == "1": if positive_count < positive_train_data_size: train_set.append(data_set[j]) positive_count += 1 else: test_set.append(data_set[j]) else: if negative_count < train_set_size - positive_train_data_size: train_set.append(data_set[j]) negative_count += 1 else: test_set.append(data_set[j]) k_train_set.append(train_set) k_test_set.append(test_set) # Return training set, dataset group, number of groups return k_train_set, k_test_set, group_num
Implementation of k-fold cross-validation
Implementing ideas: K-fold cross-validation divides the dataset into k mutually exclusive subsets, takes one as test set, the rest as training set, and needs to be sampled hierarchically. If you calculate all kinds of data needed for each subset like leave-out method, the program will be too complex, so another way of thinking is to sort the datasets according to whether they are good melons, and then save them in turnLists.
First, sort the datasets
# Sort by the last element in the table to facilitate hierarchy data_set.sort(key=lambda k: k[-1])
The sorted datasets are then sequentially stored in a list of size k
# Hierarchical data set into k parts for i in range(len(data_set)): k_list[i % k].append(data_set[i])
The data is then stratified into mutually exclusive k-parts
Then, one of the k data sets is classified as a test set and the other as a training set, and the results are saved for each division
# Generate k datasets and k training sets for i in range(k): temp_set = [] for j in range(k): if j != i: temp_set.extend(k_list[j]) else: k_test_set.append(k_list[j]) k_train_set.append(temp_set)
The general procedure is as follows:
def crossValidate(data_set): '''Use k Generating datasets using fold-and-cross validation''' # Sort by the last element in the table to facilitate hierarchy data_set.sort(key=lambda k: k[-1]) k = int(input("Input Fold Number k(k>1): ")) # Generate k lists in one list k_list = list([] for i in range(k)) k_train_set = list([]) k_test_set = list([]) # Hierarchical data set into k parts for i in range(len(data_set)): k_list[i % k].append(data_set[i]) # Generate k datasets and k training sets for i in range(k): temp_set = [] for j in range(k): if j != i: temp_set.extend(k_list[j]) else: k_test_set.append(k_list[j]) k_train_set.append(temp_set) # Return training set, dataset group, number of groups return k_train_set, k_test_set, k
Self-help implementation
The idea is that the self-service method randomly sampled m-times data from the size of m-datasets (this random sampling is similar to playback sampling) as a training set and the rest as a test set.
First get the size of m
m = len(data_set)
In order to generate multiple sets of training and test sets, generate the desired array of random numbers, each of which ranges from 0 to m-1 because the subscript range of the dataset starts from 0 to m-1
# Generate group random numbers with m in each group, ranging from 0 to M-1 ran_lis = list([[random.randint(0, m-1)for i in range(m)] for j in range(group)])
Then separate the training set from the test set
# Generate training set, test set for i in range(group): train_set = [] test_set = [] # Random sampling m data in training set for j in ran_lis[i]: train_set.append(data_set[j]) # Put the rest of the data in the test set for data in data_set: if data not in train_set: test_set.append(data) k_train_set.append(train_set) k_test_set.append(test_set) The general procedure is as follows: def bootstrap(data_set): m = len(data_set) group = int(input("How many groups are needed? ")) # Generate group random numbers with m in each group, ranging from 0 to M-1 ran_lis = list([[random.randint(0, m-1) for i in range(m)] for j in range(group)]) k_train_set = [] k_test_set = [] # Generate training set, test set for i in range(group): train_set = [] test_set = [] # Random sampling m data in training set for j in ran_lis[i]: train_set.append(data_set[j]) # Put the rest of the data in the test set for data in data_set: if data not in train_set: test_set.append(data) k_train_set.append(train_set) k_test_set.append(test_set) # Return training set, dataset group, number of groups return k_train_set, k_test_set, group
Other auxiliary programs, such as reading dataset files, saving generated training and test sets to files, and programs that call these three functions, are not covered here. You can click This link Download all files.