Preface: please make sure you know the definitions of unsupervised learning and clustering.
K-mean clustering algorithm
K-mean, also known as k-means algorithm, means that the data set is divided into K clusters, and each cluster takes the mean value of the data set belonging to the cluster as the center (or centroid) of the cluster. The flow of k-mean algorithm is roughly as follows:
First, K initial clustering centers are randomly selected.
Calculate the distance from each point to each cluster center, and classify each point into the nearest cluster.
Recalculate the respective centers with the data of each cluster.
Repeat the processes 2 and 3 until the cluster center no longer changes or exceeds the maximum number of iterations.
It can be seen that the K-mean algorithm is relatively simple and easy to implement. However, it also has some disadvantages. The advantages and disadvantages of K-mean algorithm will be discussed later.
In the above process, distance calculation is mentioned in the second step. Next, distance calculation is discussed first.
The choice of distance measurement is different in different cases. Here are three common methods:
Implementation of K-mean algorithm
Those that need data sets can be private.
import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler
The StandardScal package is used to standardize data sets. For manual implementation of Z-score standardization, see:
Each of the following steps is compared with the above steps.
Step 1: randomly select the initial centroid
# Randomly select the sample as the initial centroid def init_centroid(daraSet,K): randIdx = np.random.permutation(daraSet.shape) centroids = daraSet[randIdx[:K]] return centroids
Step 2: calculate Euclidean distance and find the nearest centroid
# Find the nearest centroid index def find_closed_centroid(dataSet,centroids): K = centroids.shape # Number of centroids idx = np.zeros(dataSet.shape,dtype=int) # Index of the sample to the nearest centroid for i in range(dataSet.shape): # Calculate the Euclidean distance from a sample to all centroids diff = np.tile(dataSet[i],(K,1)) - centroids # tile can copy matrix and check usage squaredDist = np.sum(diff**2,axis=1) # axis = 1, calculate the sum of each line distance = squaredDist ** 0.5 # Select the particle index with the smallest distance idx[i] = np.argmin(distance) return idx
Step 3: recalculate the centroid according to the sorted clusters
# Calculate the center of mass def compute_centroids(dataSet,idx,K,oldCentroids): m,n = dataSet.shape centroids = np.zeros((K, n)) for k in range(K): # Extract samples belonging to the same category points = np.array(dataSet[np.where(idx == k)]) # Calculate the center of mass centroids[k] = points.mean(axis=0) # Calculate the mean value of each column # Calculate variation changed = centroids - oldCentroids return centroids,changed
Step 4: combine K-mean modules
# Combination K-mean def run_Kmean(dataSet,K,max_iters = 100): m,n = dataSet.shape idx =  # Initialize particle centroids = init_centroid(dataSet,K) for i in range(max_iters): # Find the nearest particle (Group) idx = find_closed_centroid(dataSet,centroids) # Calculate new particle centroids,changed = compute_centroids(dataSet,idx,K,centroids) if changed.all() == 0: break return idx
# mapping def Kmean_plot(dataSet,idx,K): for k in range(K): # Extract samples belonging to the same category points = np.array(dataSet[np.where(idx == k)]) plt.scatter(points[:, 0], points[:, 1]) plt.show()
Main function (including data standardization)
if __name__ == '__main__': dataSet = np.load('ex7_X.npy') # K-mean needs data standardization scaler = StandardScaler() dataSet = scaler.fit_transform(dataSet) K = 3 classification = run_Kmean(dataSet,K) Kmean_plot(dataSet,classification,K)
Image compression using K-mean
Now use K-mean to do some interesting things!
We need to change this image into an image composed of only eight colors.
The picture needs to be converted into a 2-dimensional matrix. This picture isYes, it needs to be changed to And needs to be normalized.
original_img = plt.imread('bird_small.png') # normalization original_img/=255 # Convert the image into a two-dimensional matrix of m*3, where m is the image pixel, where m should be 128 * 128. X_img = np.reshape(original_img, (original_img.shape * original_img.shape, 3)) # print(X_img.shape)
You need to run_ Add a centroids to the return value of kmean, and the rest of the code remains unchanged.
K = 8 # max_iters = 10 idx,centroids = run_Kmean(X_img,K) # Now, overwrite each pixel with the value of the centroid of the cluster to which it belongs X_recovered = centroids[idx] X_recovered = np.reshape(X_recovered, original_img.shape)
# Display original image fig, ax = plt.subplots(1, 2, figsize=(8, 8)) plt.axis('off') ax.imshow(original_img * 255) ax.set_title('Original') ax.set_axis_off() # Do not display axis # Display compressed image ax.imshow(X_recovered * 255) ax.set_title('Compressed with %d colours' % K) ax.set_axis_off() plt.show()
Using sklearn to implement K-mean
I found a picture on the Internet and compressed it!
When writing the code, I found a very strange problem. This picture actually has four layers. When it becomes a two-dimensional array, I need to pay attention to that it does not become m*3 but m*4. (it was later found that as long as the Q screenshot was taken, it was all 4-layer.)
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans if __name__ == '__main__': original_img = plt.imread('color.png') print("Shape of original_img is:", original_img.shape) original_img /= 255 X_img = np.reshape(original_img, (original_img.shape * original_img.shape, 4)) # model training K = 8 model = KMeans(n_clusters=K) model.fit(X_img) centroids = model.cluster_centers_ # labels gets the centroid index labels = model.predict(X_img) # print(labels[:6]) # Replacement sample X_recovered = centroids[labels] X_recovered = np.reshape(X_recovered, original_img.shape) plt.imshow(X_recovered*255) plt.axis('off') plt.show()
Disadvantages of K-mean algorithm
K in the algorithm is given in advance, but the choice of K is difficult to estimate in many cases, and it is often unknown how many classes the data set should be divided into.
If the initial centroid of K-mean is not selected well, it may not get effective clustering results. Local convergence rather than global convergence may occur. In order to solve this problem, a binary k-mean algorithm is proposed. Experiments show that the clustering effect of the binary k-mean algorithm is better than that of the ordinary k-mean algorithm.
Convergence is slow when the dataset is large.
If non spherical data is encountered, K-mean algorithm is not applicable.
Source of this figure: The principle and python implementation of K-means clustering algorithm