catalogue
Implementation of Kmean algorithm
Step 1: randomly select the initial centroid
Step 2: calculate Euclidean distance and find the nearest centroid
Step 3: recalculate the centroid according to the sorted clusters
Step 4: combine Kmean modules
Main function (including data standardization)
Image compression using Kmean
Using sklearn to implement Kmean
Disadvantages of Kmean algorithm
Preface: please make sure you know the definitions of unsupervised learning and clustering.
Kmean clustering algorithm
Kmean, also known as kmeans algorithm, means that the data set is divided into K clusters, and each cluster takes the mean value of the data set belonging to the cluster as the center (or centroid) of the cluster. The flow of kmean algorithm is roughly as follows:

First, K initial clustering centers are randomly selected.

Calculate the distance from each point to each cluster center, and classify each point into the nearest cluster.

Recalculate the respective centers with the data of each cluster.

Repeat the processes 2 and 3 until the cluster center no longer changes or exceeds the maximum number of iterations.
It can be seen that the Kmean algorithm is relatively simple and easy to implement. However, it also has some disadvantages. The advantages and disadvantages of Kmean algorithm will be discussed later.
In the above process, distance calculation is mentioned in the second step. Next, distance calculation is discussed first.
Distance calculation
The choice of distance measurement is different in different cases. Here are three common methods:
Euclidean distance
Manhattan distance
cosine similarity
Implementation of Kmean algorithm
Those that need data sets can be private.
Required packages:
import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler
The StandardScal package is used to standardize data sets. For manual implementation of Zscore standardization, see:
Multiple linear regression model_ Twilight Sparkle's blog  CSDN blog
Each of the following steps is compared with the above steps.
Step 1: randomly select the initial centroid
# Randomly select the sample as the initial centroid def init_centroid(daraSet,K): randIdx = np.random.permutation(daraSet.shape[0]) centroids = daraSet[randIdx[:K]] return centroids
Step 2: calculate Euclidean distance and find the nearest centroid
# Find the nearest centroid index def find_closed_centroid(dataSet,centroids): K = centroids.shape[0] # Number of centroids idx = np.zeros(dataSet.shape[0],dtype=int) # Index of the sample to the nearest centroid for i in range(dataSet.shape[0]): # Calculate the Euclidean distance from a sample to all centroids diff = np.tile(dataSet[i],(K,1))  centroids # tile can copy matrix and check usage squaredDist = np.sum(diff**2,axis=1) # axis = 1, calculate the sum of each line distance = squaredDist ** 0.5 # Select the particle index with the smallest distance idx[i] = np.argmin(distance) return idx
Step 3: recalculate the centroid according to the sorted clusters
# Calculate the center of mass def compute_centroids(dataSet,idx,K,oldCentroids): m,n = dataSet.shape centroids = np.zeros((K, n)) for k in range(K): # Extract samples belonging to the same category points = np.array(dataSet[np.where(idx == k)]) # Calculate the center of mass centroids[k] = points.mean(axis=0) # Calculate the mean value of each column # Calculate variation changed = centroids  oldCentroids return centroids,changed
Step 4: combine Kmean modules
# Combination Kmean def run_Kmean(dataSet,K,max_iters = 100): m,n = dataSet.shape idx = [] # Initialize particle centroids = init_centroid(dataSet,K) for i in range(max_iters): # Find the nearest particle (Group) idx = find_closed_centroid(dataSet,centroids) # Calculate new particle centroids,changed = compute_centroids(dataSet,idx,K,centroids) if changed.all() == 0: break return idx
mapping
# mapping def Kmean_plot(dataSet,idx,K): for k in range(K): # Extract samples belonging to the same category points = np.array(dataSet[np.where(idx == k)]) plt.scatter(points[:, 0], points[:, 1]) plt.show()
Main function (including data standardization)
if __name__ == '__main__': dataSet = np.load('ex7_X.npy') # Kmean needs data standardization scaler = StandardScaler() dataSet = scaler.fit_transform(dataSet) K = 3 classification = run_Kmean(dataSet,K) Kmean_plot(dataSet,classification,K)
Operation results:
Image compression using Kmean
Now use Kmean to do some interesting things!
Original image:
We need to change this image into an image composed of only eight colors.
Data preprocessing
The picture needs to be converted into a 2dimensional matrix. This picture isYes, it needs to be changed to And needs to be normalized.
original_img = plt.imread('bird_small.png') # normalization original_img/=255 # Convert the image into a twodimensional matrix of m*3, where m is the image pixel, where m should be 128 * 128. X_img = np.reshape(original_img, (original_img.shape[0] * original_img.shape[1], 3)) # print(X_img.shape)
Compressed image
You need to run_ Add a centroids to the return value of kmean, and the rest of the code remains unchanged.
K = 8 # max_iters = 10 idx,centroids = run_Kmean(X_img,K) # Now, overwrite each pixel with the value of the centroid of the cluster to which it belongs X_recovered = centroids[idx] X_recovered = np.reshape(X_recovered, original_img.shape)
Draw image
# Display original image fig, ax = plt.subplots(1, 2, figsize=(8, 8)) plt.axis('off') ax[0].imshow(original_img * 255) ax[0].set_title('Original') ax[0].set_axis_off() # Do not display axis # Display compressed image ax[1].imshow(X_recovered * 255) ax[1].set_title('Compressed with %d colours' % K) ax[1].set_axis_off() plt.show()
Results:
Using sklearn to implement Kmean
I found a picture on the Internet and compressed it!
When writing the code, I found a very strange problem. This picture actually has four layers. When it becomes a twodimensional array, I need to pay attention to that it does not become m*3 but m*4. (it was later found that as long as the Q screenshot was taken, it was all 4layer.)
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans if __name__ == '__main__': original_img = plt.imread('color.png') print("Shape of original_img is:", original_img.shape) original_img /= 255 X_img = np.reshape(original_img, (original_img.shape[0] * original_img.shape[1], 4)) # model training K = 8 model = KMeans(n_clusters=K) model.fit(X_img) centroids = model.cluster_centers_ # labels gets the centroid index labels = model.predict(X_img) # print(labels[:6]) # Replacement sample X_recovered = centroids[labels] X_recovered = np.reshape(X_recovered, original_img.shape) plt.imshow(X_recovered*255) plt.axis('off') plt.show()
After compression:
Disadvantages of Kmean algorithm

K in the algorithm is given in advance, but the choice of K is difficult to estimate in many cases, and it is often unknown how many classes the data set should be divided into.

If the initial centroid of Kmean is not selected well, it may not get effective clustering results. Local convergence rather than global convergence may occur. In order to solve this problem, a binary kmean algorithm is proposed. Experiments show that the clustering effect of the binary kmean algorithm is better than that of the ordinary kmean algorithm.

Convergence is slow when the dataset is large.

If non spherical data is encountered, Kmean algorithm is not applicable.
Source of this figure: The principle and python implementation of Kmeans clustering algorithm