Modeling and Analysis of Bilibili Video Data Based on R——Cluster Analysis

• 0, written in front
• 1. Data Analysis
• 1.1 Cluster analysis
• 1.2 Cluster statistics
• 1.3 System clustering
• 1.4 Kmeans and principal component analysis
• 2. Reference materials

0, written in front

lab environment

• Python version: Python3.9
• Pycharm version: Pycharm2021.1.3
• R version: R-4.2.0
• RStudio version: RStudio-2021.09.2-382

This experiment uses a total of 4 data sets, but the article describes only one data set, and for the analysis of each data set, the data size is about 110

• The data comes from Hejing Community

https://www.heywhale.com/mw/dataset/62a45d284619d87b3b2b9147/file

Data field description

• title: the title of the video
• duration: video duration
• publisher: video author
• descriptions: video description information
• pub_time: video publishing time
• view: video playback volume
• praise: the number of likes on the video
• coins: the number of video coins
• favors: the number of video favorites
• forwarding: video forwarding volume

1. Data analysis

The data analysis stage is divided into three angles for analysis, namely variable correlation analysis, cluster analysis, modeling-factor analysis

The cluster analysis phase is described below

1.1 Cluster analysis

This stage is divided into cluster statistics, system clustering and Kmeans and principal component analysis

```viewArr = data1[,6]
forwardingArr = data1[,11]
view = c(viewArr)
forwarding = c(forwardingArr)
```
copy

1.2 Cluster statistics

• Euclidean distance
```X = cbind(view, forwarding)
dist(X)
dist(X, diag = TRUE)
dist(X, upper = TRUE)
```
copy

• Mahalanobis distance
```dist(X, method = "manhattan")

```
copy

The Mahalanobis distance of data set 1 is shown in the figure below:

1.3 System clustering

Carry out systematic clustering for data set 1 and viewing volume view

• Shortest distance method:
```hc=hclust(dist(data1[,6]),method = "single")
plot(hc)
cbind(hc\$merge,hc\$height)
cutree(hc,5:1)
```
copy
• Class average method:
```hc=hclust(dist(data1[,6]),method = "average")
plot(hc)
cbind(hc\$merge,hc\$height)
cutree(hc,5:1)
```
copy
• Intermediate distance method:
```hc=hclust(dist(data1[,6]),method = "median")
plot(hc)
cbind(hc\$merge,hc\$height)
cutree(hc,5:1)
```
copy

1.4 Kmeans and principal component analysis

For dataset 1, Kmeans is one of the important implementations of clustering

• Use the silhouette method to cluster and view the number of clusters
```library(factoextra)
fviz_nbclust(data1[6:11],kmeans,method = "silhouette")
```
copy

Get the result graph to determine how many clusters need to be clustered using the kmeans method for data set 1

```library(cluster)
gap_stat <- clusGap(data1[6:11], FUN = kmeans,K.max = 10)
fviz_gap_stat(gap_stat)
```
copy
• According to the above figure, for data set 1, a total of 8 categories need to be clustered

code show as below

```km=kmeans(data1[6:11], 8)
fviz_cluster(km,data1[6:11])
```
copy

The corrplot package in R provides a visual exploration tool on correlation matrices that supports automatic variable reordering to help detect hidden patterns among variables.

corrplot is very easy to use and offers a wealth of plotting options in terms of visualization methods, graph layout, colors, legends, text labels, and more. It also provides p-values ​​and confidence intervals to help users determine the statistical significance of correlations.

```plot(as.matrix(data1[6:11]),col=km\$cluster+1, pch=10)
points(km\$centers,col=3,pch ="*",cex=3)library(corrplot)
```
copy
• Use the function fviz_contrib()[factoextra package] for bar graphs of variable contributions
```library(factoextra)
contri=fviz_contrib(PCA,choice = "ind",axes = 1:6)
contri   ```
copy
• Kmeans combined with principal component analysis for cluster analysis

fviz_pca_ind(): It is a function in the factoextra package, which can display the data analysis results in the form of scattered points.

```fviz_pca_ind(PCA, col.ind = "cos2", gradient.cols=c("red","yellow","green"))
```
copy

fviz_cluster provides elegant visualization of ggplot2-based partitioning methods. If ncol(data)>2, use principal components to represent observations with points in the plot. Draw an ellipse around each cluster.

```pca.km=kmeans(PCA\$scores[,1:6],8)
fviz_cluster(km,data1[6:11])
plot(as.matrix(PCA\$scores[,1:6]),col=pca.km\$cluster,cex=0.7)
points(pca.km\$centers,col=3,pch ="*",cex=3)
fviz_pca_ind(PCA,habillage = pca.km\$cluster,addEllipses = T,repel = T,ellipse.level=0.9)+ggtitle("PCA_KM")
```
copy

tp

tp

draw better pictures

2. Reference materials

• Multivariate Statistical Analysis and the Use of R (Fifth Edition)

Finish!

Posted by bruceleejr on Tue, 31 Jan 2023 13:12:55 +0530