Data analysis using Python: data normalization
In many applications, data may be scattered in many files or databases, and the form of storage is not conducive to analysis. This section focuses on methods that can aggregate, merge, and reshape data.
# Import package import pandas as pd import numpy as np
Merge datasets
The data in the pandas object can be merged in several ways:
- pandas.merge connects rows in different dataframes based on one or more keys. It implements the join operation of the database.
- pandas.concat can stack multiple objects along a single axis.
- Instance method combine_first can splice duplicate data together and fill in missing values in another object with values in one object.
Database style DataFrame merging
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})
df1
key | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | a | 5 |
6 | b | 6 |
df2
key | data2 | |
---|---|---|
0 | a | 0 |
1 | b | 1 |
2 | d | 2 |
Specify with on= which column to connect:
pd.merge(df1, df2, on='key')
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1 |
1 | b | 1 | 1 |
2 | b | 6 | 1 |
3 | a | 2 | 0 |
4 | a | 4 | 0 |
5 | a | 5 | 0 |
If the column names of two objects are different, you can specify:
df1 = pd.DataFrame({'key1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)}) df2 = pd.DataFrame({'key2': ['a', 'b', 'd'], 'data2': range(3)})
pd.merge(df1, df2, left_on='key1', right_on = 'key2')
key1 | data1 | key2 | data2 | |
---|---|---|---|---|
0 | b | 0 | b | 1 |
1 | b | 1 | b | 1 |
2 | b | 6 | b | 1 |
3 | a | 2 | a | 0 |
4 | a | 4 | a | 0 |
5 | a | 5 | a | 0 |
By default, merge performs "internal connection"; The keys in the result are intersections. Other methods include "left", "right" and "outer", which are specified through the `how='field:
pd.merge(df1, df2, left_on='key1', right_on = 'key2',how = 'right')
key1 | data1 | key2 | data2 | |
---|---|---|---|---|
0 | a | 2.0 | a | 0 |
1 | a | 4.0 | a | 0 |
2 | a | 5.0 | a | 0 |
3 | b | 0.0 | b | 1 |
4 | b | 1.0 | b | 1 |
5 | b | 6.0 | b | 1 |
6 | NaN | NaN | d | 2 |
Here are four optional connection methods:
option | explain |
---|---|
inner | Use keys that both tables have |
left | Use all keys in the left table |
right | Use all keys in the right table |
outer | Use all keys in both tables |
To merge according to multiple keys, you can pass in a list consisting of column names:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'], 'key2': ['one', 'two', 'one'], 'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'], 'key2': ['one', 'one', 'one', 'two'], 'rval': [4, 5, 6, 7]})
pd.merge(left, right, on=['key1','key2'], how = 'outer')
key1 | key2 | lval | rval | |
---|---|---|---|---|
0 | foo | one | 1.0 | 4.0 |
1 | foo | one | 1.0 | 5.0 |
2 | foo | two | 2.0 | NaN |
3 | bar | one | 3.0 | 6.0 |
4 | bar | two | NaN | 7.0 |
If the objects to be merged have duplicate column names, you can specify the strings attached to the duplicate column names through the suffixes option:
pd.merge(left, right, on='key1',suffixes=('_left','_right'))
key1 | key2_left | lval | key2_right | rval | |
---|---|---|---|---|---|
0 | foo | one | 1 | one | 4 |
1 | foo | one | 1 | one | 5 |
2 | foo | two | 2 | one | 4 |
3 | foo | two | 2 | one | 5 |
4 | bar | one | 3 | one | 6 |
5 | bar | one | 3 | two | 7 |
The following are the parameters of the merge function:
Merge on index
Pd Merge method passed right_index and left_ The index parameter can be used to specify that the right or left row index reference is used as a connection key. This is not an example. This section introduces a more convenient join method, which enables merging by index more conveniently:
left = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=['a', 'c', 'e'], columns=['Ohio', 'Nevada'])
right = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]], index=['b', 'c', 'd', 'e'], columns=['Missouri', 'Alabama'])
left
Ohio | Nevada | |
---|---|---|
a | 1.0 | 2.0 |
c | 3.0 | 4.0 |
e | 5.0 | 6.0 |
right
Missouri | Alabama | |
---|---|---|
b | 7.0 | 8.0 |
c | 9.0 | 10.0 |
d | 11.0 | 12.0 |
e | 13.0 | 14.0 |
Due to the legacy of some historical versions, the join method of DataFrame uses left join by default and retains the row index of the left table:
left.join(right)
Ohio | Nevada | Missouri | Alabama | |
---|---|---|---|---|
a | 1.0 | 2.0 | NaN | NaN |
c | 3.0 | 4.0 | 9.0 | 10.0 |
e | 5.0 | 6.0 | 13.0 | 14.0 |
For simple index merging, you can also pass in a set of dataframes to the join:
another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]], index=['a', 'c', 'e', 'f'], columns=['New York','Oregon'])
another
New York | Oregon | |
---|---|---|
a | 7.0 | 8.0 |
c | 9.0 | 10.0 |
e | 11.0 | 12.0 |
f | 16.0 | 17.0 |
join supports {'left', 'right', 'outer' and 'inner'} four connection modes:
left.join([right,another], how='outer')
Ohio | Nevada | Missouri | Alabama | New York | Oregon | |
---|---|---|---|---|---|---|
a | 1.0 | 2.0 | NaN | NaN | 7.0 | 8.0 |
c | 3.0 | 4.0 | 9.0 | 10.0 | 9.0 | 10.0 |
e | 5.0 | 6.0 | 13.0 | 14.0 | 11.0 | 12.0 |
b | NaN | NaN | 7.0 | 8.0 | NaN | NaN |
d | NaN | NaN | 11.0 | 12.0 | NaN | NaN |
f | NaN | NaN | NaN | NaN | 16.0 | 17.0 |
Axial connection
Suppose there are three series without overlapping indexes, pd Concat can stack values and indexes together. It works on axis=0 by default:
s1 = pd.Series([0, 1], index=['a', 'b']) s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e']) s3 = pd.Series([5, 6], index=['f', 'g'])
pd.concat([s1,s2,s3])
a 0 b 1 c 2 d 3 e 4 f 5 g 6 dtype: int64
If axis=1 is passed in, the connection in the column direction will become a DataFrame:
pd.concat([s1,s2,s3], axis=1)
0 | 1 | 2 | |
---|---|---|---|
a | 0.0 | NaN | NaN |
b | 1.0 | NaN | NaN |
c | NaN | 2.0 | NaN |
d | NaN | 3.0 | NaN |
e | NaN | 4.0 | NaN |
f | NaN | NaN | 5.0 |
g | NaN | NaN | 6.0 |
After concat, you do not know what the stacked data represents. You can set the index of the connected fragment in the result through the keys parameter:
#Merge the Series along axis=0 to generate a hierarchical index pd.concat([s1,s2,s3], keys = ['s1','s2','s3'])
s1 a 0 b 1 s2 c 2 d 3 e 4 s3 f 5 g 6 dtype: int64
#If you merge Series along axis=1, keys will become the column header of the DataFrame: pd.concat([s1,s2,s3], axis=1, keys = ['s1','s2','s3'])
s1 | s2 | s3 | |
---|---|---|---|
a | 0.0 | NaN | NaN |
b | 1.0 | NaN | NaN |
c | NaN | 2.0 | NaN |
d | NaN | 3.0 | NaN |
e | NaN | 4.0 | NaN |
f | NaN | NaN | 5.0 |
g | NaN | NaN | 6.0 |
If the indexes overlap, you can use the join= parameter to control the connection method as inner or outer (the default is' outer '):
s4 = pd.concat([s1, s3])
s1
a 0 b 1 dtype: int64
s4
a 0 b 1 f 5 g 6 dtype: int64
pd.concat([s1,s4], axis =1)
0 | 1 | |
---|---|---|
a | 0.0 | 0 |
b | 1.0 | 1 |
f | NaN | 5 |
g | NaN | 6 |
pd.concat([s1,s4], axis =1, join='inner')
0 | 1 | |
---|---|---|
a | 0 | 0 |
b | 1 | 1 |
The same logic as the Series above applies to DataFrame objects. It should be noted that if the row index of the DataFrame does not contain any meaningful data, you can pass in ignore_index=True declares that the indexes on the connection axis are not reserved, and a new set of indexes range (total\u length) is generated:
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
df1
a | b | c | d | |
---|---|---|---|---|
0 | 0.095987 | -1.429683 | 0.213154 | 0.209158 |
1 | 0.405074 | 2.060263 | 0.154940 | -1.547939 |
2 | -0.283798 | 0.185476 | -0.833157 | 0.204095 |
df2
b | d | a | |
---|---|---|---|
0 | -0.743573 | -0.081083 | -0.832971 |
1 | 0.432616 | -0.393941 | -1.465352 |
pd.concat([df1,df2], ignore_index = True)
a | b | c | d | |
---|---|---|---|---|
0 | 0.095987 | -1.429683 | 0.213154 | 0.209158 |
1 | 0.405074 | 2.060263 | 0.154940 | -1.547939 |
2 | -0.283798 | 0.185476 | -0.833157 | 0.204095 |
3 | -0.832971 | -0.743573 | NaN | -0.081083 |
4 | -1.465352 | 0.432616 | NaN | -0.393941 |
Merge overlapping data
For example, you might have two datasets whose indexes overlap in whole or in part. You want to use one dataset to fill in the missing values of the corresponding fields in another dataset:
s1 = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], index=['f', 'e', 'd', 'c', 'b', 'a'])
s2 = pd.Series(np.arange(len(a), dtype=np.float64), index=['f', 'e', 'd', 'c', 'b', 'a']) s2[-1] = np.nan
s1
f NaN e 2.5 d NaN c 3.5 b 4.5 a NaN dtype: float64
s2
f 0.0 e 1.0 d 2.0 c 3.0 b 4.0 a NaN dtype: float64
Method 1: np Where, which is equivalent to an array oriented if else:
# Where the conditional expression holds, i.e. s1 is a null value, take the median value of s2; otherwise, keep the median value of s1 np.where(pd.isnull(s1), s2, s1)
array([0. , 2.5, 2. , 3.5, 4.5, nan])
Method 2: fillna
s1.fillna(s2)
f 0.0 e 2.5 d 2.0 c 3.5 b 4.5 a NaN dtype: float64
Method 3: combine_first
s1.combine_first(s2)
f 0.0 e 2.5 d 2.0 c 3.5 b 4.5 a NaN dtype: float64
The above methods are also applicable to DataFrame. The data in the transfer object is used to "patch" the missing data of the calling object:
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan], 'b': [np.nan, 2., np.nan, 6.], 'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.], 'b': [np.nan, 3., 4., 6., 8.]})
df1
a | b | c | |
---|---|---|---|
0 | 1.0 | NaN | 2 |
1 | NaN | 2.0 | 6 |
2 | 5.0 | NaN | 10 |
3 | NaN | 6.0 | 14 |
df2
a | b | |
---|---|---|
0 | 5.0 | NaN |
1 | 4.0 | 3.0 |
2 | NaN | 4.0 |
3 | 3.0 | 6.0 |
4 | 7.0 | 8.0 |
df1.combine_first(df2)
a | b | c | |
---|---|---|---|
0 | 1.0 | NaN | 2.0 |
1 | 4.0 | 2.0 | 6.0 |
2 | 5.0 | 4.0 | 10.0 |
3 | 3.0 | 6.0 | 14.0 |
4 | 7.0 | 8.0 | NaN |
Hierarchical index
hierarchical indexing is an important function of pandas. It enables you to have multiple (more than two) index levels on one axis.
For a DataFrame, each axis can have a hierarchical index:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
frame
Ohio | Colorado | |||
---|---|---|---|---|
Green | Red | Green | ||
a | 1 | 0 | 1 | 2 |
2 | 3 | 4 | 5 | |
b | 1 | 6 | 7 | 8 |
2 | 9 | 10 | 11 |
Each layer can have a name:
frame.index.names = ['key1','key2']
frame.columns.names = ['state','color']
frame
state | Ohio | Colorado | ||
---|---|---|---|---|
color | Green | Red | Green | |
key1 | key2 | |||
a | 1 | 0 | 1 | 2 |
2 | 3 | 4 | 5 | |
b | 1 | 6 | 7 | 8 |
2 | 9 | 10 | 11 |
Rearrangement and hierarchical sorting
Swapelevel accepts two level numbers or names and returns a new object with the levels interchanged (but the data will not change):
frame.swaplevel(0,1)
state | Ohio | Colorado | ||
---|---|---|---|---|
color | Green | Red | Green | |
key2 | key1 | |||
1 | a | 0 | 1 | 2 |
2 | a | 3 | 4 | 5 |
1 | b | 6 | 7 | 8 |
2 | b | 9 | 10 | 11 |
frame.swaplevel('key1','key2')
state | Ohio | Colorado | ||
---|---|---|---|---|
color | Green | Red | Green | |
key2 | key1 | |||
1 | a | 0 | 1 | 2 |
2 | a | 3 | 4 | 5 |
1 | b | 6 | 7 | 8 |
2 | b | 9 | 10 | 11 |
sort_index sorts the data according to the values in a single level. This command is often used when switching levels.
frame.swaplevel('key1','key2').sort_index(level=0)
state | Ohio | Colorado | ||
---|---|---|---|---|
color | Green | Red | Green | |
key2 | key1 | |||
1 | a | 0 | 1 | 2 |
b | 6 | 7 | 8 | |
2 | a | 3 | 4 | 5 |
b | 9 | 10 | 11 |
Index using DataFrame columns
In some data analysis scenarios, you may want to use one or more columns of the DataFrame as row indexes, or you may want to turn row indexes into columns of the DataFrame.
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1), 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], 'd': [0, 1, 2, 0, 1, 2, 3]})
frame
a | b | c | d | |
---|---|---|---|---|
0 | 0 | 7 | one | 0 |
1 | 1 | 6 | one | 1 |
2 | 2 | 5 | one | 2 |
3 | 3 | 4 | two | 0 |
4 | 4 | 3 | two | 1 |
5 | 5 | 2 | two | 2 |
6 | 6 | 1 | two | 3 |
Set of DataFrame_ The index function converts one or more of its columns to a row index and creates a new DataFrame:
frame2 = frame.set_index(['c','d'])
frame2
a | b | ||
---|---|---|---|
c | d | ||
one | 0 | 0 | 7 |
1 | 1 | 6 | |
2 | 2 | 5 | |
two | 0 | 3 | 4 |
1 | 4 | 3 | |
2 | 5 | 2 | |
3 | 6 | 1 |
reset_ The function of index is the same as that of set_index is just the opposite. The level of the hierarchical index will be transferred to the column:
frame2.reset_index()
c | d | a | b | |
---|---|---|---|---|
0 | one | 0 | 0 | 7 |
1 | one | 1 | 1 | 6 |
2 | one | 2 | 2 | 5 |
3 | two | 0 | 3 | 4 |
4 | two | 1 | 4 | 3 |
5 | two | 2 | 5 | 2 |
6 | two | 3 | 6 | 1 |
Data reshaping
Hierarchical index provides a well consistent way for the rearrangement of DataFrame data. There are two main functions:
- stack: rotate the columns of data into rows.
- unstack: rotates rows of data into columns.
For example, the following data are the test scores of four students a, b, c and d in three courses 1, 2 and 3:
data = pd.DataFrame(list(zip(['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 3, 1, 2, 2, 3], np.random.randn(9))), columns = ['student','course','score'])
data
student | course | score | |
---|---|---|---|
0 | a | 1 | 0.469069 |
1 | a | 2 | 0.568829 |
2 | a | 3 | 1.293216 |
3 | b | 1 | -0.799458 |
4 | b | 3 | 2.323059 |
5 | c | 1 | 0.419970 |
6 | c | 2 | 0.576399 |
7 | d | 2 | 0.459112 |
8 | d | 3 | -0.654535 |
Hierarchical indexes play an important role in data remodeling and grouping based operations (such as pivot table generation), such as:
First, set "student" and "course" to index to form a hierarchical index:
data = data.set_index(['student','course'])
data
score | ||
---|---|---|
student | course | |
a | 1 | 0.469069 |
2 | 0.568829 | |
3 | 1.293216 | |
b | 1 | -0.799458 |
3 | 2.323059 | |
c | 1 | 0.419970 |
2 | 0.576399 | |
d | 2 | 0.459112 |
3 | -0.654535 |
Secondly, the data is reshaped through the unstack command to obtain the pivot table of students and courses, which is convenient for subsequent statistical operations between grades:
result = data.unstack()
result
score | |||
---|---|---|---|
course | 1 | 2 | 3 |
student | |||
a | 0.469069 | 0.568829 | 1.293216 |
b | -0.799458 | NaN | 2.323059 |
c | 0.419970 | 0.576399 | NaN |
d | NaN | 0.459112 | -0.654535 |
stack is the inverse of unstack:
result.stack()
score | ||
---|---|---|
student | course | |
a | 1 | 0.469069 |
2 | 0.568829 | |
3 | 1.293216 | |
b | 1 | -0.799458 |
3 | 2.323059 | |
c | 1 | 0.419970 |
2 | 0.576399 | |
d | 2 | 0.459112 |
3 | -0.654535 |
Previous period:
Data analysis with Python: preparation
Data analysis using Python: missing data (based on DataFrame)
Data analysis with Python: data transformation (based on DataFrame)