[DataWhale hands-on series] 01 hands-on data analysis notes
Special thanks: The DataWhale open source organization initiated this hands-on data analysis project.
Datawhale is an open source organization focusing on data science and AI. It brings together outstanding learners from universities and well-known enterprises in many fields, and gathers a group of team members with open source spirit and exploration spirit. With the vision of "for the learner, grow with the learner", Datawhale encourages true self-expression, openness and inclusiveness, mutual trust and mutual assistance, daring to trial and error, and courage to take responsibility. Datawhale uses the concept of open source to explore open source content, open source learning and open source solutions, empower talent training, help talent grow, and establish connections between people, people and knowledge, people and enterprises, and people and the future.
🔗Official website link: https://datawhale.club/index.html
🔗DataWhale WeChat official account: Datawhale
0. Introduction to Pandas
Pandas is Python's core data analysis support library, providing fast, flexible, and explicit data structures designed to handle relational, labeled data simply and intuitively. The goal of Pandas is to become an essential advanced tool for Python data analysis practice and actual combat, and its long-term goal is to become the most powerful, flexible, and open source data analysis tool that can support any language. After years of unremitting efforts, Pandas is getting closer and closer to this goal.
Pandas is suitable for working with the following types of data:
- Tabular data with heterogeneous columns similar to SQL or Excel tables;
- ordered and unordered (non-fixed frequency) time series data;
- Matrix data with row and column labels, including homogeneous or heterogeneous data;
- Any other form of observation, statistical data set, the data does not need to be marked in advance when the data is transferred into the Pandas data structure
🔗Pandas Chinese website: https://www.pypandas.cn/
1. Load data
The data used in this hands-on data analysis is the Titanic project on Kaggle ( Titanic: Machine Learning from Disaster)
🖇️Dataset download link: https://www.kaggle.com/c/titanic/data
Titanic: Machine Learning from Disaster
In addition to processing data downloaded directly from the web, you can directly use the command line to download, which is faster and more direct;
🔸How to download data using command line:
🔹The first thing to do is to install the Kaggle API. Please check the official GtiHub for the specific installation steps: https://github.com/Kaggle/kaggle-api
🔹After installation, run it directly on the computer terminal: kaggle competitions download -c titanic
1.1 Import numpy and pandas
import numpy as np import pandas as pd
1.2 IO tools
The pandas I/O API is a set of read functions, such as the pandas.read_csv() function. Such functions can return pandas objects. The corresponding write function is an object method like DataFrame.to_csv(). The following is a list of methods, including all the reader functions and writer functions in it.
1.3 CSV and text file reading
The main method for reading text files (a.k.a. flat files) is read_csv().
- Load data using relative paths
df = pd.read_csv('train.csv')
- Load data using absolute path
df = pd.read_csv('/mydrive/Colab_Notebooks/DataWhela/Data Analysis/hands-on-data-analysis/Unit 1 Project Collection/train.csv')
- The difference between read_csv() and read_table()
df = pd.read_table('train.csv') df.head(3)
df = pd.read_csv('train.csv') df.head(3)
It can be seen from the above results that the format of the data returned by read_csv() is different from that of read_table(); the format of the data returned by read_csv() is tabulated, and the data of read_table() are separated by commas only. So, look at the two function definition codes:
read Csv make_parser_ function('read csv', sep=',') read csv= Appender( read csv doc)(read csv read_table=make_parser_function( 'read table', sep='\t') read table=Appender( read table doc)(read table)
According to the definition code, it can be seen that the essential difference between the two functions is that the separator sep is different. Others, except for the method name, are the same.
# Output data in the same format as read_csv() df = pd.read_table('train.csv',sep=',') df.head(3)
1.4 Read block by block
Because the amount of data in the dataset is usually huge, if all the data is directly read, it will not only be slow, but also consume computer resources; in order to read data efficiently and quickly, block-by-block reading is used.
# df = pd.read_csv('train.csv',chunksize=1000) df = pd.read_table('train.csv',sep=',',chunksize=1000) chunk = df.get_chunk() #get a block of data print(chunk)
1.5 Replacing the header and index
Change the header to Chinese and the index to passenger ID
df = pd.read_csv('train.csv',names=['passenger ID','survived','Passenger class','passenger's name','gender','age','cousin/number of sisters','Number of parents and children','Ticket information','fare','cabin','boarding port'],index_col='passenger ID',header=0)
- direct violent modification
df = pd.read_csv('train.csv') df.columns = ['passenger ID','survived','Passenger class','passenger's name','gender','age','cousin/number of sisters','Number of parents and children','Ticket information','fare','cabin','boarding port'] df.head()
- Through the rename method, note that when the parameter inplace=True, the original DataFrame can be modified.
df = pd.read_csv('train.csv') df.rename(columns={'PassengerId':'passenger ID','Survived':'survived','Pclass':'Passenger class','Name':'passenger's name','Sex':'gender','Age':'age','SibSp':'cousin/number of sisters','Parch':'Number of parents and children','Ticket':'Ticket information','Fare':'fare','Cabin':'cabin','Embarked':'boarding port'},inplace=True) df.head()
- Index modification, in the rename method, if you do not write columns=xx, the index will be modified by default.
df = pd.read_csv('train.csv') df.rename({0:'PassengerId'})
- The rename() function is suitable for modifying individual index or column names. It is more convenient to use the set_index() function if most or all of the modifications are required.
df.set_index('PassengerId',inplace=True)
- Modify headers and indexes
df = pd.read_csv('train.csv') df.rename(columns={'PassengerId':'passenger ID','Survived':'survived','Pclass':'Passenger class','Name':'passenger's name','Sex':'gender','Age':'age','SibSp':'cousin/number of sisters','Parch':'Number of parents and children','Ticket':'Ticket information','Fare':'fare','Cabin':'cabin','Embarked':'boarding port'},inplace=True) df.set_index('passenger ID',inplace=True) df.head()
1.6 View data foundation information
# Basic properties of DataFrame # Number of rows Number of columns df.shape # column data type df.dtypes # data dimension df.ndim # row index df.index # column index df.columns # object value, two-dimensional ndarray array df.values
# DataFrame overall situation # Display the first 10 lines, the default is 5 lines df.head(10) # Display the last few lines, the default is 5 df.tail() # Correlation coefficient, such as the number of rows, the number of columns, the column index, the number of non-null values in the column, the column type, and the memory usage df.info() # Quick statistical results, count, mean, standard deviation, maximum, quadratic, minimum df.describe()
2. Data structure
2.1.1 Series
Series are labeled one-dimensional arrays that can store integers, floats, strings, Python objects, and more. Axis labels are collectively called indices. A Series can be created by calling the pd.Series function:
s = pd.Series(data, index=index)
In the above code, data supports the following data types:
- Python dictionary
- Multidimensional Arrays
- scalar value (eg, 5)
index is a list of axis labels. Different data can be divided into the following situations:
Multidimensional Arrays
When data is a multidimensional array, the length of index must be the same as the length of data. When the index parameter is not specified, a numeric index is created, i.e. [0, ..., len(data) - 1].
>> s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) >> s a 0.469112 b -0.282863 c -1.509059 d -1.135632 e 1.212112 dtype: float64
dictionary
Series can be instantiated with dictionaries:
>> d = {'b': 1, 'a': 0, 'c': 2} >> pd.Series(d) b 1 a 0 c 2 dtype: int64
scalar value
When data is a scalar value, an index must be provided. Series repeats the scalar value by the index length.
>> Series(8., index=['a', 'b', 'c']) a 8.0 b 8.0 c 8.0 dtype: float6
2.1.2 Series is similar to multidimensional array
Series operations are similar to ndarray s and support most NumPy functions, as well as index slicing.
>> s[0] 0.4691122999071863 >> s[:3] a 0.469112 b -0.282863 c -1.509059 dtype: float64 >> In [15]: s[s > s.median()] a 0.469112 e 1.212112 dtype: float64 s[[4, 3, 1]] e 1.212112 d -1.135632 b -0.282863 dtype: float64 np.exp(s) a 1.598575 b 0.753623 c 0.221118 d 0.321219 e 3.360575 dtype: float64
2.1.3 Series is similar to dictionary
Series are like fixed-size dictionaries, and you can use index labels to extract or set values:
>> s['a'] 0.4691122999071863 s['e'] = 12. >> s a 0.469112 b -0.282863 c -1.509059 d -1.135632 e 12.000000 dtype: float64 >> 'e' in s True
2.2 DataFrame
A DataFrame is a two-dimensional labeled data structure composed of columns of various types, similar to a dictionary of Excel, SQL tables, or Series objects. DataFrame is the most commonly used Pandas object. Like Series, DataFrame supports several types of input data:
- 1D ndarray, list, dictionary, Series dictionary
- 2D numpy.ndarray
- Structure multidimensional array or record multidimensional array
- Series
- DataFrame
In addition to data, you can optionally pass index (row labels) and columns (column labels) parameters. Passing an index or column ensures that the resulting DataFrame contains the index or column. When adding the specified index to the Series dictionary, any data that does not match the passed index is discarded.
d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]} pd.DataFrame(d) d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} pd.DataFrame(d)
2.2.1 Extract, add, delete columns
DataFrame is like an indexed Series dictionary, the operations of extracting, setting, and deleting columns are similar to dictionaries
>> df['one'] a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64 >> df['three'] = df['one'] * df['two'] >> df['flag'] = df['one'] > 2 >> df one two three flag a 1.0 1.0 1.0 False b 2.0 2.0 4.0 False c 3.0 3.0 9.0 True d NaN 4.0 NaN False
Dropping (del, pop\drop) columns is also similar to a dictionary:
test_1 = pd.read_csv('test_1.csv') del test_1['a'] test_1.pop('a') test_1.drop(['a'],axis=1,inplace=True)
2.2.2 Index / Selection
The basic usage of the index is as follows:
# We filter by "Age" to display information on passengers under the age of 10. df[df.Age<10].head(3) # Condition "Age" to display the information of passengers over 10 years old and under 50 years old df[(df.Age>10)&(df.Age<50)].head(3) # The data for "Pclass" and "Sex" on line 100 is displayed df.loc[[100],['Pclass','Sex']] # The data for "Pclass", "Name" and "Sex" on lines 100, 105, 108 are displayed df.loc[[100,105,108],['Pclass','Name','Sex']] # Use the iloc method to display the data of "Pclass", "Name" and "Sex" in rows 100, 105, and 108 of the midage data df.iloc[[100,105,108],[2,3,4]]
3. Exploratory Data Analysis
3.1 Data sorting
#Build a DataFrame with numbers all by yourself d = {'2': [1., 2., 3., 4.],'1': [4., 3., 2., 1.],'3': [5., 23., 28., 30.]} frame = pd.DataFrame(d, index=['a', 'c', 'd', 'b'])
- Arrange the data in the constructed DataFrame in ascending order according to a certain column
frame.sort_values(by='1',ascending=False)
- 1. Sort the row index in ascending order
frame.sort_index()
- Sort column index in ascending order
frame.sort_index(axis=1)
- Sort column index in descending order
frame.sort_index(axis=1,ascending=False)
- Sort any two columns of data in descending order at the same time
frame.sort_values(by=['2','3'])
3,2 Exploratory Analysis
In the era of big data, chaotic, unstructured, multimedia massive data continuously accumulates and records various traces of human activities through various channels. Exploratory data analysis can be an effective tool.
In the book "data science practice" (written by Rachel Schutt, Cathy O'Neil, translated by Feng lingbing and Wang Qunfeng) published in the United States in 2014, exploratory data analysis is listed as a key step in the workflow of data science that can affect mu lt iple links (See below)
- Comprehensive sorting of Titanic data (trian.csv) by both fare and age columns (in descending order)
df.sort_values(by=['fare','age'],ascending=False).head(20)
- Use Pandas to perform arithmetic calculations to calculate the result of adding two DataFrame data
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D']) df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C']) df + df2
- Learn to use Pandas describe() function to view basic statistics of data
df.describe()
- Take a look at the basic statistics of the columns of the Titanic dataset, fare, parent and child, respectively.
df = pd.read_csv('train_chinese.csv') df['fare'].describe()