Machine learning: A python-based method for generating training and test sets

Note: All functions are for the watermelon dataset only. If you need to classify other datasets, simply find the column in the dataset where the target values (yes and no for a good melon) are located (the last column in the watermelon dataset), and then you will need to replace the code that runs according to the target values.

Left-Out Implementation

The idea of implementation: According to the principle of the retention method, the dataset needs to be divided into two mutually exclusive sets, one as the training set and the other as the test set. In the process of data partitioning, stratified sampling is required. In order to generate multiple sets of data, random partitioning is required each time.
First count the number of positive examples in the dataset:

# Calculate positive examples in a dataset
    for data in data_set:
        if data[-1] == "yes":
            positive_size += 1

Then calculate the size of the training set you need:

# Calculate training set size
    train_set_size = int(percent / 100 * len(data_set)+0.5)

percent is the percentage of the training set to the dataset and requires human input.
Then calculate the number of positive examples needed in the training set:

# Calculate the normal size required in the training set
    positive_train_data_size = int(
        train_set_size / len(data_set) * positive_size + 0.5)

To make each division random and the training and test sets mutually exclusive, use sample to generate random numbers as the subscript for the training to be collected:

# Each set of training sets and datasets guarantees random partitioning
        index_list = random.sample(range(0, len(data_set)), len(data_set))

Using the for loop hierarchical training set and test set:

# Divide training sets, datasets
        for j in index_list:
            if data_set[j][-1] == "yes":
                if positive_count < positive_train_data_size:
                    train_set.append(data_set[j])
                    positive_count += 1
                else:
                    test_set.append(data_set[j])
            else:
                if negative_count < train_set_size - positive_train_data_size:
                    train_set.append(data_set[j])
                    negative_count += 1
                else:
                    test_set.append(data_set[j])

Save a set of training and test sets each time they are generated

k_train_set.append(train_set)
k_test_set.append(test_set)

The general procedure is as follows:

def setAside(data_set):
    '''Generating datasets using the retention method'''
    positive_size = 0
    # Calculate positive examples in a dataset
    for data in data_set:
        if data[-1] == "yes" or data[-1] == "1":
            positive_size += 1
    percent = int(input("Percentage of training set to total dataset(1~99): "))
    # Calculate training set size
    train_set_size = int(percent / 100 * len(data_set)+0.5)
    # Calculate the normal size required in the training set
    positive_train_data_size = int(
        train_set_size / len(data_set) * positive_size + 0.5)
    group_num = int(input("How many groups are needed? "))
    k_train_set, k_test_set = [], []
    # Generate training set, test set
    for i in range(group_num):
        # Each set of training sets and datasets guarantees random partitioning
        index_list = random.sample(range(0, len(data_set)), len(data_set))
        # Store a set of datasets and training sets
        train_set, test_set = [], []
        # Number of positive and negative examples in the training set, enough to determine if positive and negative examples exist
        positive_count, negative_count = 0, 0
        # Divide training sets, datasets
        for j in index_list:
            if data_set[j][-1] == "yes" or data[-1] == "1":
                if positive_count < positive_train_data_size:
                    train_set.append(data_set[j])
                    positive_count += 1
                else:
                    test_set.append(data_set[j])
            else:
                if negative_count < train_set_size - positive_train_data_size:
                    train_set.append(data_set[j])
                    negative_count += 1
                else:
                    test_set.append(data_set[j])
        k_train_set.append(train_set)
        k_test_set.append(test_set)
    # Return training set, dataset group, number of groups
    return k_train_set, k_test_set, group_num

Implementation of k-fold cross-validation

Implementing ideas: K-fold cross-validation divides the dataset into k mutually exclusive subsets, takes one as test set, the rest as training set, and needs to be sampled hierarchically. If you calculate all kinds of data needed for each subset like leave-out method, the program will be too complex, so another way of thinking is to sort the datasets according to whether they are good melons, and then save them in turnLists.
First, sort the datasets

# Sort by the last element in the table to facilitate hierarchy
    data_set.sort(key=lambda k: k[-1])

The sorted datasets are then sequentially stored in a list of size k

# Hierarchical data set into k parts
    for i in range(len(data_set)):
        k_list[i % k].append(data_set[i])

The data is then stratified into mutually exclusive k-parts
Then, one of the k data sets is classified as a test set and the other as a training set, and the results are saved for each division

# Generate k datasets and k training sets
    for i in range(k):
        temp_set = []
        for j in range(k):
            if j != i:
                temp_set.extend(k_list[j])
            else:
                k_test_set.append(k_list[j])
        k_train_set.append(temp_set)

The general procedure is as follows:

def crossValidate(data_set):
    '''Use k Generating datasets using fold-and-cross validation'''
    # Sort by the last element in the table to facilitate hierarchy
    data_set.sort(key=lambda k: k[-1])
    k = int(input("Input Fold Number k(k>1): "))
    # Generate k lists in one list
    k_list = list([] for i in range(k))
    k_train_set = list([])
    k_test_set = list([])
    # Hierarchical data set into k parts
    for i in range(len(data_set)):
        k_list[i % k].append(data_set[i])
    # Generate k datasets and k training sets
    for i in range(k):
        temp_set = []
        for j in range(k):
            if j != i:
                temp_set.extend(k_list[j])
            else:
                k_test_set.append(k_list[j])
        k_train_set.append(temp_set)
    # Return training set, dataset group, number of groups
    return k_train_set, k_test_set, k

Self-help implementation

The idea is that the self-service method randomly sampled m-times data from the size of m-datasets (this random sampling is similar to playback sampling) as a training set and the rest as a test set.
First get the size of m

m = len(data_set)

In order to generate multiple sets of training and test sets, generate the desired array of random numbers, each of which ranges from 0 to m-1 because the subscript range of the dataset starts from 0 to m-1

# Generate group random numbers with m in each group, ranging from 0 to M-1
ran_lis = list([[random.randint(0, m-1)for i in range(m)] for j in range(group)])

Then separate the training set from the test set

# Generate training set, test set
    for i in range(group):
        train_set = []
        test_set = []
        # Random sampling m data in training set
        for j in ran_lis[i]:
            train_set.append(data_set[j])
        # Put the rest of the data in the test set
        for data in data_set:
            if data not in train_set:
                test_set.append(data)
        k_train_set.append(train_set)
        k_test_set.append(test_set)
The general procedure is as follows:
def bootstrap(data_set):
    m = len(data_set)
    group = int(input("How many groups are needed? "))
    # Generate group random numbers with m in each group, ranging from 0 to M-1
    ran_lis = list([[random.randint(0, m-1)
                     for i in range(m)] for j in range(group)])
    k_train_set = []
    k_test_set = []
    # Generate training set, test set
    for i in range(group):
        train_set = []
        test_set = []
        # Random sampling m data in training set
        for j in ran_lis[i]:
            train_set.append(data_set[j])
        # Put the rest of the data in the test set
        for data in data_set:
            if data not in train_set:
                test_set.append(data)
        k_train_set.append(train_set)
        k_test_set.append(test_set)
    # Return training set, dataset group, number of groups
    return k_train_set, k_test_set, group

Other auxiliary programs, such as reading dataset files, saving generated training and test sets to files, and programs that call these three functions, are not covered here. You can click This link Download all files.

Tags: Python Deep Learning Machine Learning

Posted by venkat20 on Sun, 26 Sep 2021 23:58:20 +0530