Implementation and application of K-mean clustering

catalogue

K-mean clustering algorithm

Distance calculation

Euclidean distance

Manhattan distance

cosine similarity

Implementation of K-mean algorithm

Step 1: randomly select the initial centroid

Step 2: calculate Euclidean distance and find the nearest centroid

Step 3: recalculate the centroid according to the sorted clusters

Step 4: combine K-mean modules

mapping

Main function (including data standardization)

Operation results:

Image compression using K-mean

Data preprocessing

Compressed image

Draw image

Using sklearn to implement K-mean

Disadvantages of K-mean algorithm

 

Preface: please make sure you know the definitions of unsupervised learning and clustering.

K-mean clustering algorithm

K-mean, also known as k-means algorithm, means that the data set is divided into K clusters, and each cluster takes the mean value of the data set belonging to the cluster as the center (or centroid) of the cluster. The flow of k-mean algorithm is roughly as follows:

  1. First, K initial clustering centers are randomly selected.

  2. Calculate the distance from each point to each cluster center, and classify each point into the nearest cluster.

  3. Recalculate the respective centers with the data of each cluster.

  4. Repeat the processes 2 and 3 until the cluster center no longer changes or exceeds the maximum number of iterations.

It can be seen that the K-mean algorithm is relatively simple and easy to implement. However, it also has some disadvantages. The advantages and disadvantages of K-mean algorithm will be discussed later.

In the above process, distance calculation is mentioned in the second step. Next, distance calculation is discussed first.

Distance calculation

The choice of distance measurement is different in different cases. Here are three common methods:

Euclidean distance

Manhattan distance

cosine similarity

Implementation of K-mean algorithm

Those that need data sets can be private.

Required packages:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

The StandardScal package is used to standardize data sets. For manual implementation of Z-score standardization, see:

Multiple linear regression model_ Twilight Sparkle's blog - CSDN blog

Each of the following steps is compared with the above steps.

Step 1: randomly select the initial centroid

# Randomly select the sample as the initial centroid
def init_centroid(daraSet,K):
    randIdx = np.random.permutation(daraSet.shape[0])
    centroids = daraSet[randIdx[:K]]
    return centroids

Step 2: calculate Euclidean distance and find the nearest centroid

# Find the nearest centroid index
def find_closed_centroid(dataSet,centroids):
    K = centroids.shape[0] # Number of centroids
    idx = np.zeros(dataSet.shape[0],dtype=int) # Index of the sample to the nearest centroid
    for i in range(dataSet.shape[0]):
        # Calculate the Euclidean distance from a sample to all centroids
        diff = np.tile(dataSet[i],(K,1)) - centroids # tile can copy matrix and check usage
        squaredDist = np.sum(diff**2,axis=1) # axis = 1, calculate the sum of each line
        distance = squaredDist ** 0.5
        # Select the particle index with the smallest distance
        idx[i] = np.argmin(distance)
    return idx

Step 3: recalculate the centroid according to the sorted clusters

# Calculate the center of mass
def compute_centroids(dataSet,idx,K,oldCentroids):
    m,n = dataSet.shape
    centroids = np.zeros((K, n))
    for k in range(K):
        # Extract samples belonging to the same category
        points = np.array(dataSet[np.where(idx == k)])
        # Calculate the center of mass
        centroids[k] = points.mean(axis=0) # Calculate the mean value of each column
        # Calculate variation
    changed = centroids - oldCentroids
    return centroids,changed

Step 4: combine K-mean modules

# Combination K-mean
def run_Kmean(dataSet,K,max_iters = 100):
    m,n = dataSet.shape
    idx = []
    # Initialize particle
    centroids = init_centroid(dataSet,K)
    for i in range(max_iters):
        # Find the nearest particle (Group)
        idx = find_closed_centroid(dataSet,centroids)
        # Calculate new particle
        centroids,changed = compute_centroids(dataSet,idx,K,centroids)
        if changed.all() == 0:
            break
    return idx

mapping

# mapping
def Kmean_plot(dataSet,idx,K):
    for k in range(K):
        # Extract samples belonging to the same category
        points = np.array(dataSet[np.where(idx == k)])
        plt.scatter(points[:, 0], points[:, 1])
    plt.show()

Main function (including data standardization)

if __name__ == '__main__':
    dataSet = np.load('ex7_X.npy')
    # K-mean needs data standardization
    scaler = StandardScaler()
    dataSet = scaler.fit_transform(dataSet)
    K = 3
    classification = run_Kmean(dataSet,K)
    Kmean_plot(dataSet,classification,K)

Operation results:

Image compression using K-mean

Now use K-mean to do some interesting things!

Original image:

 

We need to change this image into an image composed of only eight colors.

Data preprocessing

The picture needs to be converted into a 2-dimensional matrix. This picture isYes, it needs to be changed to And needs to be normalized.

    original_img = plt.imread('bird_small.png')
    # normalization
    original_img/=255
    # Convert the image into a two-dimensional matrix of m*3, where m is the image pixel, where m should be 128 * 128.
    X_img = np.reshape(original_img, (original_img.shape[0] * original_img.shape[1], 3))
    # print(X_img.shape)

Compressed image

You need to run_ Add a centroids to the return value of kmean, and the rest of the code remains unchanged.

    K = 8
    # max_iters = 10
    idx,centroids = run_Kmean(X_img,K)
    # Now, overwrite each pixel with the value of the centroid of the cluster to which it belongs
    X_recovered = centroids[idx]
    X_recovered = np.reshape(X_recovered, original_img.shape)

Draw image

    # Display original image
    fig, ax = plt.subplots(1, 2, figsize=(8, 8))
    plt.axis('off')

    ax[0].imshow(original_img * 255)
    ax[0].set_title('Original')
    ax[0].set_axis_off() # Do not display axis

    # Display compressed image
    ax[1].imshow(X_recovered * 255)
    ax[1].set_title('Compressed with %d colours' % K)
    ax[1].set_axis_off()
    plt.show()

Results:

Using sklearn to implement K-mean

I found a picture on the Internet and compressed it!

When writing the code, I found a very strange problem. This picture actually has four layers. When it becomes a two-dimensional array, I need to pay attention to that it does not become m*3 but m*4. (it was later found that as long as the Q screenshot was taken, it was all 4-layer.)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

if __name__ == '__main__':
    original_img = plt.imread('color.png')
    print("Shape of original_img is:", original_img.shape)
    original_img /= 255
    X_img = np.reshape(original_img, (original_img.shape[0] * original_img.shape[1], 4))
    # model training
    K = 8
    model = KMeans(n_clusters=K)
    model.fit(X_img)
    centroids = model.cluster_centers_
    # labels gets the centroid index
    labels = model.predict(X_img)
    # print(labels[:6])
    # Replacement sample
    X_recovered = centroids[labels]
    X_recovered = np.reshape(X_recovered, original_img.shape)

    plt.imshow(X_recovered*255)
    plt.axis('off')
    plt.show()

After compression:

Disadvantages of K-mean algorithm

  • K in the algorithm is given in advance, but the choice of K is difficult to estimate in many cases, and it is often unknown how many classes the data set should be divided into.

  • If the initial centroid of K-mean is not selected well, it may not get effective clustering results. Local convergence rather than global convergence may occur. In order to solve this problem, a binary k-mean algorithm is proposed. Experiments show that the clustering effect of the binary k-mean algorithm is better than that of the ordinary k-mean algorithm.

  • Convergence is slow when the dataset is large.

  • If non spherical data is encountered, K-mean algorithm is not applicable.

Source of this figure: The principle and python implementation of K-means clustering algorithm

Tags: Algorithm image processing Machine Learning clustering

Posted by maneetpuri on Sun, 04 Sep 2022 21:32:25 +0530