[hands on data analysis] Task04 - Data Visualization

1, matplotlib Library

Information visualization can help us find outliers, carry out necessary data conversion, judge which relevant model to choose, and also need data interaction.

matplotlib is a third-party package for drawing high-quality charts, which has derived several data visualization tools, such as seaborn . Using pandas, matplotlib and seaborn, more static graphs can be drawn.

1.1 drawing images using matplotlib

Draw a line graph of [0, 9]

import matplotlib,pyplot as plt
import numpy as np

# Draw a simple line diagram
data = np.arange(10)
plt.plot(data)

1.2 Figure and Subplot

The images of matplotlib are located in the figure object. Use plt.figure() to create a new object:

fig = plt.figure()

# You cannot draw with an empty Figure, you must use add_subplot create one or more subplots (sub charts)
# Draw 2 * 2 image
ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)
ax3 = fig.add_subplot(2, 2, 3)

# At this time, execute the drawing command and use the last used subplot (if not, create a new one), that is, ax3
plt.plot(np.random.randn(50).cumsum(), 'k--')
# "k --" is a linetype option. It tells matplotlib to draw a dotted line

1.3 drawing with pandas and seaborn

Drawing with matplotlib is equivalent to assembling parts: chart type, legend, title, scale label and other annotation information. Drawing images can be realized with the help of pandas and Seaborn: Pandas's own method can simplify drawing graphics from DataFrame and Series; Seaborn simplifies the creation of many often visible types.

2, Visualization of Titanic data (analysis of survival)

2.1 importing data

# Import module
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# read in data
text = pd.read_csv(r'result.csv')

2.2 analyze the impact of gender on survival

sex = text.groupby('Sex')['Survived'].sum()
sex.plot.bar()
plt.title('survived_count')
plt.show()

  • The survival rate of women is higher than that of men, and the total number needs to be added to further analyze the survival rate
# Calculate the number of deaths among men and women, 1-survival, 0-death
text.groupby(['Sex','Survived'])['Survived'].count().unstack().plot(kind='bar',stacked='True')
plt.title('survived_count')
plt.ylabel('count')

  • As can be seen from the above figure, the survival rate of women is significantly higher than that of men

2.3 analyze the survival rate of people with different ticket prices

① Line chart

# Calculate the number of survival and death in different ticket prices. 1 means survival and 0 means death
# Unsorted line chart
fare_sur = text.groupby(['Fare'])['Survived'].value_counts()
fig = plt.figure(figsize=(20, 18))
fare_sur.plot(grid=True)
plt.legend()
plt.show()

# Calculate the number of survival and death in different ticket prices. 1 means survival and 0 means death
fare_sur = text.groupby(['Fare'])['Survived'].value_counts().sort_values(ascending=False)

# Draw a line chart sorted by the number of survivors
fig = plt.figure(figsize=(20, 10))
fare_sur.plot(grid=True)
plt.legend()
plt.show()

  • The sorted line chart here can better see the change trend of the number of survivors with the ticket price

② Histogram

# Classification by class and survival type
# 1-survival, 0-death
pclass_sur = text.groupby(['Pclass'])['Survived'].value_counts()

import seaborn as sns
sns.countplot(x="Pclass", hue="Survived", data=text)

  • The comparison of values between multiple states and different categories under the same category can be clearly observed by using the classified display of bar statistical chart.
  • As a result, passengers with lower class are less likely to survive

2.4 distribution of survival in different age groups

kernel density estimation 1 , corresponding to the kdeplot() function of seaborn

import seaborn as sns

facet = sns.FacetGrid(text, hue="Survived",aspect=3)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, text['Age'].max()))
facet.add_legend()

2.5 age distribution of people in different classes

text.Age[text.Pclass == 1].plot(kind='kde')
text.Age[text.Pclass == 2].plot(kind='kde')
text.Age[text.Pclass == 3].plot(kind='kde')
plt.xlabel("age")
plt.legend((1,2,3),loc="best")

  • Most of the third class is young people in their twenties, while the first class is middle-aged people in their thirties to forties
  • The older the age, the more senior class people: with the increase of age, the class level is improving

3, Summary

The function of data visualization is to make people understand the meaning of data more quickly and intuitively. Therefore, corresponding to different data and different needs, it is necessary to appropriately select the corresponding statistical chart. For example, broken line chart is suitable for showing the development trend of things, bar chart is suitable for comparing with data size, etc. But this is not absolute. They can show the distribution of data, and everything is subject to the demand.

  1. What is kernel density estimation? How to perceptual knowledge- Know ↩︎

Tags: Python Data Analysis matplotlib

Posted by torleone on Mon, 20 Sep 2021 11:42:36 +0530