foreword
For a senior who is about to graduate, suddenly transferring from the back end to machine learning is very difficult to learn, especially returning there, mathematics is very difficult! ! ! ! ! ! ! !
Because the regression formula is too difficult, here is the first-hand classification algorithm, KNN.
Introduction to Algorithms
What is KNN?
Officially, we will not be whole. Generally speaking, it is to find the k points closest to a certain point (the k points have already been classified), and find the classification with the most occurrences among the k points, which is what we predicted. Classification.
From the above description, the algorithm has a total of these steps
- Calculate the distance from the target point to other points
- Sort all distance values in positive order
- Find the top k values in the sorted result and find the category with the most occurrences
It feels quite simple, let's write the code next.
Code
Calculate distance
Two methods for calculating distance, namely Manhattan formula and Euler formula
Here are a few functions
np.sum
The summation function, compared with other summation formulas, this summation formula supports the addition of vectors and vectors, and also supports the addition of matrices and vectors (provided that the number of columns of the matrix should be equal to the dimension of the vector).
axis=1 indicates that the final matrix should be one-column, that is, when the result after vector addition is [1,2,4], the result after setting axis is [7].
np.abs
find absolute value
# Definition of Distance Function Manhattan Formula def distance1(a, b): # When a is a matrix, b is a vector, that is, each row of a is combined with b, and then combined into a column sum = np.sum(np.abs(a - b), axis=1) # Because ab is a vector, a-b is still a vector, and axis is to save a column, that is, add the data of all columns after the operation is over return sum; # Euler's formula def distance2(a, b): sum = np.sqrt(np.sum(np.abs(a - b) ** 2, axis=1)) return sum;
Ingest and slice datasets
After using java, and then using python, you will feel, python, you are my god!
You don't need to write your own code to split the dataset, just introduce a library and you're done
from sklearn.model_selection import train_test_split # Split training set and test set from sklearn.datasets import load_iris #Introduce a classified dataset # Data loading and preprocessing iris = load_iris() df = pd.DataFrame(data=iris.data, columns=iris.feature_names) df['target'] = iris.target df['target'] = df['target'].map({0: iris.target_names[0], 1: iris.target_names[1], 2: iris.target_names[2]}) x = iris.data y = iris.target y = y.reshape(-1, 1) # testsize The size of the test set randomstate Whether it is divided randomly stratify is stratified according to the same proportion, that is, what is the proportion of y, and what is the proportion of the divided test set x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1, stratify=y)
Classifier (to predict which class the target point belongs to)
# Classifier class KNN(object): # Defining an initialization method self refers to an example of a class def __init__(self, k_neighbors=1, disufunc=distance1): self.k_neighbors = k_neighbors self.disufunc = disufunc # The training model can be set directly in the constructor here, see what you want to do def fit(self, x, y): self.x_train = x self.y_train = y # Model prediction def predict(self, x): y_pred = np.zeros((x.shape[0], 1), dtype=self.y_train.dtype) # Initializing the array (x.shape[0], 1) is to set the number of rows and columns of the array dtype sets the type of elements in the array for index,x_val in enumerate(x): #The enumerate function wraps an object in an array as an index, value distance=self.disufunc(self.x_train,x_val) #Call the previous distance function directly #sort get index value nn_index=np.argsort(distance) # argsort is to sort the data in the array and return the original index value as an array #Statistics frequency nn_y=self.y_train[nn_index[:self.k_neighbors]].ravel() #Take out the categories corresponding to the first k indices y_pred[index]=np.argmax(np.bincount(nn_y)) #bincount counts the number of occurrences of each value, that is, the content is a[3]=4, 3 is the value, 4 is the number of occurrences, and the index value of the maximum number of argmax times return y_pred
test
knn=KNN(k_neighbors=5) knn.fit(x_train,y_train) y_pred=knn.predict(x_test) #prediction accuracy accuracy=accuracy_score(y_test,y_pred) print(accuracy)
In this way, all the code is implemented, which is quite simple
Summarize
I have learned the usage of a lot of numpy functions, and I really feel that python is very easy to use, so I have written them all.
A lot of things still need to be reviewed and written a blog to summarize.