1, matplotlib Library
Information visualization can help us find outliers, carry out necessary data conversion, judge which relevant model to choose, and also need data interaction.
matplotlib is a third-party package for drawing high-quality charts, which has derived several data visualization tools, such as seaborn . Using pandas, matplotlib and seaborn, more static graphs can be drawn.
1.1 drawing images using matplotlib
Draw a line graph of [0, 9]
import matplotlib,pyplot as plt import numpy as np # Draw a simple line diagram data = np.arange(10) plt.plot(data)
1.2 Figure and Subplot
The images of matplotlib are located in the figure object. Use plt.figure() to create a new object:
fig = plt.figure() # You cannot draw with an empty Figure, you must use add_subplot create one or more subplots (sub charts) # Draw 2 * 2 image ax1 = fig.add_subplot(2, 2, 1) ax2 = fig.add_subplot(2, 2, 2) ax3 = fig.add_subplot(2, 2, 3) # At this time, execute the drawing command and use the last used subplot (if not, create a new one), that is, ax3 plt.plot(np.random.randn(50).cumsum(), 'k--') # "k --" is a linetype option. It tells matplotlib to draw a dotted line
1.3 drawing with pandas and seaborn
Drawing with matplotlib is equivalent to assembling parts: chart type, legend, title, scale label and other annotation information. Drawing images can be realized with the help of pandas and Seaborn: Pandas's own method can simplify drawing graphics from DataFrame and Series; Seaborn simplifies the creation of many often visible types.
2, Visualization of Titanic data (analysis of survival)
2.1 importing data
# Import module import numpy as np import pandas as pd import matplotlib.pyplot as plt # read in data text = pd.read_csv(r'result.csv')
2.2 analyze the impact of gender on survival
sex = text.groupby('Sex')['Survived'].sum() sex.plot.bar() plt.title('survived_count') plt.show()
- The survival rate of women is higher than that of men, and the total number needs to be added to further analyze the survival rate
# Calculate the number of deaths among men and women, 1-survival, 0-death text.groupby(['Sex','Survived'])['Survived'].count().unstack().plot(kind='bar',stacked='True') plt.title('survived_count') plt.ylabel('count')
- As can be seen from the above figure, the survival rate of women is significantly higher than that of men
2.3 analyze the survival rate of people with different ticket prices
① Line chart
# Calculate the number of survival and death in different ticket prices. 1 means survival and 0 means death # Unsorted line chart fare_sur = text.groupby(['Fare'])['Survived'].value_counts() fig = plt.figure(figsize=(20, 18)) fare_sur.plot(grid=True) plt.legend() plt.show()
# Calculate the number of survival and death in different ticket prices. 1 means survival and 0 means death fare_sur = text.groupby(['Fare'])['Survived'].value_counts().sort_values(ascending=False) # Draw a line chart sorted by the number of survivors fig = plt.figure(figsize=(20, 10)) fare_sur.plot(grid=True) plt.legend() plt.show()
- The sorted line chart here can better see the change trend of the number of survivors with the ticket price
# Classification by class and survival type # 1-survival, 0-death pclass_sur = text.groupby(['Pclass'])['Survived'].value_counts() import seaborn as sns sns.countplot(x="Pclass", hue="Survived", data=text)
- The comparison of values between multiple states and different categories under the same category can be clearly observed by using the classified display of bar statistical chart.
- As a result, passengers with lower class are less likely to survive
2.4 distribution of survival in different age groups
kernel density estimation 1 , corresponding to the kdeplot() function of seaborn
import seaborn as sns facet = sns.FacetGrid(text, hue="Survived",aspect=3) facet.map(sns.kdeplot,'Age',shade= True) facet.set(xlim=(0, text['Age'].max())) facet.add_legend()
2.5 age distribution of people in different classes
text.Age[text.Pclass == 1].plot(kind='kde') text.Age[text.Pclass == 2].plot(kind='kde') text.Age[text.Pclass == 3].plot(kind='kde') plt.xlabel("age") plt.legend((1,2,3),loc="best")
- Most of the third class is young people in their twenties, while the first class is middle-aged people in their thirties to forties
- The older the age, the more senior class people: with the increase of age, the class level is improving
The function of data visualization is to make people understand the meaning of data more quickly and intuitively. Therefore, corresponding to different data and different needs, it is necessary to appropriately select the corresponding statistical chart. For example, broken line chart is suitable for showing the development trend of things, bar chart is suitable for comparing with data size, etc. But this is not absolute. They can show the distribution of data, and everything is subject to the demand.