Introduction to machine learning but the door is welded to the classification KNN algorithm implementation


For a senior who is about to graduate, suddenly transferring from the back end to machine learning is very difficult to learn, especially returning there, mathematics is very difficult! ! ! ! ! ! ! !

Because the regression formula is too difficult, here is the first-hand classification algorithm, KNN.

Introduction to Algorithms

What is KNN?
Officially, we will not be whole. Generally speaking, it is to find the k points closest to a certain point (the k points have already been classified), and find the classification with the most occurrences among the k points, which is what we predicted. Classification.

From the above description, the algorithm has a total of these steps

  • Calculate the distance from the target point to other points
  • Sort all distance values ​​in positive order
  • Find the top k values ​​in the sorted result and find the category with the most occurrences

It feels quite simple, let's write the code next.


Calculate distance

Two methods for calculating distance, namely Manhattan formula and Euler formula

Here are a few functions
The summation function, compared with other summation formulas, this summation formula supports the addition of vectors and vectors, and also supports the addition of matrices and vectors (provided that the number of columns of the matrix should be equal to the dimension of the vector).

axis=1 indicates that the final matrix should be one-column, that is, when the result after vector addition is [1,2,4], the result after setting axis is [7].

find absolute value

# Definition of Distance Function Manhattan Formula
def distance1(a, b):
    # When a is a matrix, b is a vector, that is, each row of a is combined with b, and then combined into a column
    sum = np.sum(np.abs(a - b), axis=1)  # Because ab is a vector, a-b is still a vector, and axis is to save a column, that is, add the data of all columns after the operation is over
    return sum;

# Euler's formula
def distance2(a, b):
    sum = np.sqrt(np.sum(np.abs(a - b) ** 2, axis=1))
    return sum;

Ingest and slice datasets

After using java, and then using python, you will feel, python, you are my god!
You don't need to write your own code to split the dataset, just introduce a library and you're done

from sklearn.model_selection import train_test_split  # Split training set and test set
from sklearn.datasets import load_iris    #Introduce a classified dataset
# Data loading and preprocessing
iris = load_iris()

df = pd.DataFrame(, columns=iris.feature_names)
df['target'] =
df['target'] = df['target'].map({0: iris.target_names[0], 1: iris.target_names[1], 2: iris.target_names[2]})

x =
y =
y = y.reshape(-1, 1)
# testsize The size of the test set randomstate Whether it is divided randomly stratify is stratified according to the same proportion, that is, what is the proportion of y, and what is the proportion of the divided test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1, stratify=y)

Classifier (to predict which class the target point belongs to)

# Classifier
class KNN(object):
    # Defining an initialization method self refers to an example of a class
    def __init__(self, k_neighbors=1, disufunc=distance1):
        self.k_neighbors = k_neighbors
        self.disufunc = disufunc

    # The training model can be set directly in the constructor here, see what you want to do
    def fit(self, x, y):
        self.x_train = x
        self.y_train = y

    # Model prediction
    def predict(self, x):
        y_pred = np.zeros((x.shape[0], 1), dtype=self.y_train.dtype)  # Initializing the array (x.shape[0], 1) is to set the number of rows and columns of the array dtype sets the type of elements in the array
        for index,x_val in enumerate(x):    #The enumerate function wraps an object in an array as an index, value
            distance=self.disufunc(self.x_train,x_val)   #Call the previous distance function directly
            #sort get index value
            nn_index=np.argsort(distance)     # argsort is to sort the data in the array and return the original index value as an array
            #Statistics frequency
            nn_y=self.y_train[nn_index[:self.k_neighbors]].ravel()   #Take out the categories corresponding to the first k indices
            y_pred[index]=np.argmax(np.bincount(nn_y))      #bincount counts the number of occurrences of each value, that is, the content is a[3]=4, 3 is the value, 4 is the number of occurrences, and the index value of the maximum number of argmax times
        return y_pred


        #prediction accuracy

In this way, all the code is implemented, which is quite simple


I have learned the usage of a lot of numpy functions, and I really feel that python is very easy to use, so I have written them all.

A lot of things still need to be reviewed and written a blog to summarize.

Tags: Algorithm Machine Learning

Posted by MagicMikey on Tue, 31 May 2022 15:41:30 +0530