Modeling and Analysis of Bilibili Video Data Based on R——Cluster Analysis

Modeling and Analysis of Bilibili Video Data Based on R——Cluster Analysis

  • 0, written in front
  • 1. Data Analysis
    • 1.1 Cluster analysis
    • 1.2 Cluster statistics
    • 1.3 System clustering
    • 1.4 Kmeans and principal component analysis
  • 2. Reference materials

0, written in front

lab environment

  • Python version: Python3.9
  • Pycharm version: Pycharm2021.1.3
  • R version: R-4.2.0
  • RStudio version: RStudio-2021.09.2-382

This experiment uses a total of 4 data sets, but the article describes only one data set, and for the analysis of each data set, the data size is about 110

  • The data comes from Hejing Community

https://www.heywhale.com/mw/dataset/62a45d284619d87b3b2b9147/file

Data field description

  • title: the title of the video
  • duration: video duration
  • publisher: video author
  • descriptions: video description information
  • pub_time: video publishing time
  • view: video playback volume
  • comments: the number of video comments
  • praise: the number of likes on the video
  • coins: the number of video coins
  • favors: the number of video favorites
  • forwarding: video forwarding volume

1. Data analysis

The data analysis stage is divided into three angles for analysis, namely variable correlation analysis, cluster analysis, modeling-factor analysis

The cluster analysis phase is described below

1.1 Cluster analysis

This stage is divided into cluster statistics, system clustering and Kmeans and principal component analysis

viewArr = data1[,6]
forwardingArr = data1[,11]
view = c(viewArr)
forwarding = c(forwardingArr)
copy

1.2 Cluster statistics

  • Euclidean distance
X = cbind(view, forwarding)
dist(X)
dist(X, diag = TRUE)
dist(X, upper = TRUE)
copy

Add main diagonal distance

add upper triangle distance

  • Mahalanobis distance
dist(X, method = "manhattan")

copy

The Mahalanobis distance of data set 1 is shown in the figure below:

1.3 System clustering

Carry out systematic clustering for data set 1 and viewing volume view

  • Shortest distance method:
hc=hclust(dist(data1[,6]),method = "single")
plot(hc)
cbind(hc$merge,hc$height)
cutree(hc,5:1)
copy
  • Class average method:
hc=hclust(dist(data1[,6]),method = "average")
plot(hc)
cbind(hc$merge,hc$height)
cutree(hc,5:1)
copy
  • Intermediate distance method:
hc=hclust(dist(data1[,6]),method = "median")
plot(hc)
cbind(hc$merge,hc$height)
cutree(hc,5:1)
copy

1.4 Kmeans and principal component analysis

For dataset 1, Kmeans is one of the important implementations of clustering

  • Use the silhouette method to cluster and view the number of clusters
library(factoextra)
fviz_nbclust(data1[6:11],kmeans,method = "silhouette")
copy

Get the result graph to determine how many clusters need to be clustered using the kmeans method for data set 1

library(cluster)
gap_stat <- clusGap(data1[6:11], FUN = kmeans,K.max = 10)
fviz_gap_stat(gap_stat)
copy
  • According to the above figure, for data set 1, a total of 8 categories need to be clustered

code show as below

km=kmeans(data1[6:11], 8)
fviz_cluster(km,data1[6:11])  
copy
  • View PCA Loadings (Principal Component Loadings)

The corrplot package in R provides a visual exploration tool on correlation matrices that supports automatic variable reordering to help detect hidden patterns among variables.

corrplot is very easy to use and offers a wealth of plotting options in terms of visualization methods, graph layout, colors, legends, text labels, and more. It also provides p-values ​​and confidence intervals to help users determine the statistical significance of correlations.

plot(as.matrix(data1[6:11]),col=km$cluster+1, pch=10)
points(km$centers,col=3,pch ="*",cex=3)library(corrplot)  
corpic=corrplot(PCA$loadings)
copy
  • Use the function fviz_contrib()[factoextra package] for bar graphs of variable contributions
library(factoextra)
contri=fviz_contrib(PCA,choice = "ind",axes = 1:6)
contri   
copy
  • Kmeans combined with principal component analysis for cluster analysis

fviz_pca_ind(): It is a function in the factoextra package, which can display the data analysis results in the form of scattered points.

fviz_pca_ind(PCA, col.ind = "cos2", gradient.cols=c("red","yellow","green"))
copy

fviz_cluster provides elegant visualization of ggplot2-based partitioning methods. If ncol(data)>2, use principal components to represent observations with points in the plot. Draw an ellipse around each cluster.

pca.km=kmeans(PCA$scores[,1:6],8)
fviz_cluster(km,data1[6:11])
plot(as.matrix(PCA$scores[,1:6]),col=pca.km$cluster,cex=0.7)   
points(pca.km$centers,col=3,pch ="*",cex=3)
fviz_pca_ind(PCA,habillage = pca.km$cluster,addEllipses = T,repel = T,ellipse.level=0.9)+ggtitle("PCA_KM")
copy

tp

tp

draw better pictures

2. Reference materials

  • Multivariate Statistical Analysis and the Use of R (Fifth Edition)

Finish!

Tags: Machine Learning Data Analysis

Posted by bruceleejr on Tue, 31 Jan 2023 13:12:55 +0530