# Modeling and Analysis of Bilibili Video Data Based on R——Cluster Analysis

- 0, written in front
- 1. Data Analysis
- 1.1 Cluster analysis
- 1.2 Cluster statistics
- 1.3 System clustering
- 1.4 Kmeans and principal component analysis

- 2. Reference materials

## 0, written in front

lab environment

- Python version: Python3.9
- Pycharm version: Pycharm2021.1.3
- R version: R-4.2.0
- RStudio version: RStudio-2021.09.2-382

This experiment uses a total of 4 data sets, but the article describes only one data set, and for the analysis of each data set, the data size is about 110

- The data comes from Hejing Community

https://www.heywhale.com/mw/dataset/62a45d284619d87b3b2b9147/file

Data field description

- title: the title of the video
- duration: video duration
- publisher: video author
- descriptions: video description information
- pub_time: video publishing time
- view: video playback volume
- comments: the number of video comments
- praise: the number of likes on the video
- coins: the number of video coins
- favors: the number of video favorites
- forwarding: video forwarding volume

## 1. Data analysis

The data analysis stage is divided into three angles for analysis, namely variable correlation analysis, cluster analysis, modeling-factor analysis

The cluster analysis phase is described below

### 1.1 Cluster analysis

This stage is divided into cluster statistics, system clustering and Kmeans and principal component analysis

copyviewArr = data1[,6] forwardingArr = data1[,11] view = c(viewArr) forwarding = c(forwardingArr)

### 1.2 Cluster statistics

- Euclidean distance

copyX = cbind(view, forwarding) dist(X) dist(X, diag = TRUE) dist(X, upper = TRUE)

Add main diagonal distance

add upper triangle distance

- Mahalanobis distance

copydist(X, method = "manhattan")

The Mahalanobis distance of data set 1 is shown in the figure below:

### 1.3 System clustering

Carry out systematic clustering for data set 1 and viewing volume view

- Shortest distance method:

copyhc=hclust(dist(data1[,6]),method = "single") plot(hc) cbind(hc$merge,hc$height) cutree(hc,5:1)

- Class average method:

copyhc=hclust(dist(data1[,6]),method = "average") plot(hc) cbind(hc$merge,hc$height) cutree(hc,5:1)

- Intermediate distance method:

copyhc=hclust(dist(data1[,6]),method = "median") plot(hc) cbind(hc$merge,hc$height) cutree(hc,5:1)

### 1.4 Kmeans and principal component analysis

For dataset 1, Kmeans is one of the important implementations of clustering

- Use the silhouette method to cluster and view the number of clusters

copylibrary(factoextra) fviz_nbclust(data1[6:11],kmeans,method = "silhouette")

Get the result graph to determine how many clusters need to be clustered using the kmeans method for data set 1

copylibrary(cluster) gap_stat <- clusGap(data1[6:11], FUN = kmeans,K.max = 10) fviz_gap_stat(gap_stat)

- According to the above figure, for data set 1, a total of 8 categories need to be clustered

code show as below

copykm=kmeans(data1[6:11], 8) fviz_cluster(km,data1[6:11])

- View PCA Loadings (Principal Component Loadings)

The corrplot package in R provides a visual exploration tool on correlation matrices that supports automatic variable reordering to help detect hidden patterns among variables.

corrplot is very easy to use and offers a wealth of plotting options in terms of visualization methods, graph layout, colors, legends, text labels, and more. It also provides p-values and confidence intervals to help users determine the statistical significance of correlations.

copyplot(as.matrix(data1[6:11]),col=km$cluster+1, pch=10) points(km$centers,col=3,pch ="*",cex=3)library(corrplot) corpic=corrplot(PCA$loadings)

- Use the function fviz_contrib()[factoextra package] for bar graphs of variable contributions

copylibrary(factoextra) contri=fviz_contrib(PCA,choice = "ind",axes = 1:6) contri

- Kmeans combined with principal component analysis for cluster analysis

fviz_pca_ind(): It is a function in the factoextra package, which can display the data analysis results in the form of scattered points.

copyfviz_pca_ind(PCA, col.ind = "cos2", gradient.cols=c("red","yellow","green"))

fviz_cluster provides elegant visualization of ggplot2-based partitioning methods. If ncol(data)>2, use principal components to represent observations with points in the plot. Draw an ellipse around each cluster.

copypca.km=kmeans(PCA$scores[,1:6],8) fviz_cluster(km,data1[6:11]) plot(as.matrix(PCA$scores[,1:6]),col=pca.km$cluster,cex=0.7) points(pca.km$centers,col=3,pch ="*",cex=3) fviz_pca_ind(PCA,habillage = pca.km$cluster,addEllipses = T,repel = T,ellipse.level=0.9)+ggtitle("PCA_KM")

tp

tp

draw better pictures

## 2. Reference materials

- Multivariate Statistical Analysis and the Use of R (Fifth Edition)

Finish!