Data analysis using Python: data normalization (based on DataFrame)

Data analysis using Python: data normalization

In many applications, data may be scattered in many files or databases, and the form of storage is not conducive to analysis. This section focuses on methods that can aggregate, merge, and reshape data.

# Import package
import pandas as pd
import numpy as np

Merge datasets

The data in the pandas object can be merged in several ways:

  • pandas.merge connects rows in different dataframes based on one or more keys. It implements the join operation of the database.
  • pandas.concat can stack multiple objects along a single axis.
  • Instance method combine_first can splice duplicate data together and fill in missing values in another object with values in one object.

Database style DataFrame merging

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})
df1
keydata1
0b0
1b1
2a2
3c3
4a4
5a5
6b6
df2
keydata2
0a0
1b1
2d2

Specify with on= which column to connect:

pd.merge(df1, df2, on='key')
keydata1data2
0b01
1b11
2b61
3a20
4a40
5a50

If the column names of two objects are different, you can specify:

df1 = pd.DataFrame({'key1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})

df2 = pd.DataFrame({'key2': ['a', 'b', 'd'],
                    'data2': range(3)})
pd.merge(df1, df2, left_on='key1', right_on = 'key2')
key1data1key2data2
0b0b1
1b1b1
2b6b1
3a2a0
4a4a0
5a5a0

By default, merge performs "internal connection"; The keys in the result are intersections. Other methods include "left", "right" and "outer", which are specified through the `how='field:

pd.merge(df1, df2, left_on='key1', right_on = 'key2',how = 'right')
key1data1key2data2
0a2.0a0
1a4.0a0
2a5.0a0
3b0.0b1
4b1.0b1
5b6.0b1
6NaNNaNd2

Here are four optional connection methods:

optionexplain
innerUse keys that both tables have
leftUse all keys in the left table
rightUse all keys in the right table
outerUse all keys in both tables

To merge according to multiple keys, you can pass in a list consisting of column names:

left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})
 right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                       'key2': ['one', 'one', 'one', 'two'],
                       'rval': [4, 5, 6, 7]})
pd.merge(left, right, on=['key1','key2'], how = 'outer')
key1key2lvalrval
0fooone1.04.0
1fooone1.05.0
2footwo2.0NaN
3barone3.06.0
4bartwoNaN7.0

If the objects to be merged have duplicate column names, you can specify the strings attached to the duplicate column names through the suffixes option:

pd.merge(left, right, on='key1',suffixes=('_left','_right'))
key1key2_leftlvalkey2_rightrval
0fooone1one4
1fooone1one5
2footwo2one4
3footwo2one5
4barone3one6
5barone3two7

The following are the parameters of the merge function:


Merge on index

Pd Merge method passed right_index and left_ The index parameter can be used to specify that the right or left row index reference is used as a connection key. This is not an example. This section introduces a more convenient join method, which enables merging by index more conveniently:

left = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                     index=['a', 'c', 'e'],
                     columns=['Ohio', 'Nevada'])
right = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                     index=['b', 'c', 'd', 'e'],
                     columns=['Missouri', 'Alabama'])
left
OhioNevada
a1.02.0
c3.04.0
e5.06.0
right
MissouriAlabama
b7.08.0
c9.010.0
d11.012.0
e13.014.0

Due to the legacy of some historical versions, the join method of DataFrame uses left join by default and retains the row index of the left table:

left.join(right)
OhioNevadaMissouriAlabama
a1.02.0NaNNaN
c3.04.09.010.0
e5.06.013.014.0

For simple index merging, you can also pass in a set of dataframes to the join:

another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                       index=['a', 'c', 'e', 'f'],
                       columns=['New York','Oregon'])
another
New YorkOregon
a7.08.0
c9.010.0
e11.012.0
f16.017.0

join supports {'left', 'right', 'outer' and 'inner'} four connection modes:

left.join([right,another], how='outer')
OhioNevadaMissouriAlabamaNew YorkOregon
a1.02.0NaNNaN7.08.0
c3.04.09.010.09.010.0
e5.06.013.014.011.012.0
bNaNNaN7.08.0NaNNaN
dNaNNaN11.012.0NaNNaN
fNaNNaNNaNNaN16.017.0

Axial connection

Suppose there are three series without overlapping indexes, pd Concat can stack values and indexes together. It works on axis=0 by default:

s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
pd.concat([s1,s2,s3])
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

If axis=1 is passed in, the connection in the column direction will become a DataFrame:

pd.concat([s1,s2,s3], axis=1)
012
a0.0NaNNaN
b1.0NaNNaN
cNaN2.0NaN
dNaN3.0NaN
eNaN4.0NaN
fNaNNaN5.0
gNaNNaN6.0

After concat, you do not know what the stacked data represents. You can set the index of the connected fragment in the result through the keys parameter:

#Merge the Series along axis=0 to generate a hierarchical index
pd.concat([s1,s2,s3], keys = ['s1','s2','s3'])
s1  a    0
    b    1
s2  c    2
    d    3
    e    4
s3  f    5
    g    6
dtype: int64
#If you merge Series along axis=1, keys will become the column header of the DataFrame:
pd.concat([s1,s2,s3], axis=1, keys = ['s1','s2','s3'])
s1s2s3
a0.0NaNNaN
b1.0NaNNaN
cNaN2.0NaN
dNaN3.0NaN
eNaN4.0NaN
fNaNNaN5.0
gNaNNaN6.0

If the indexes overlap, you can use the join= parameter to control the connection method as inner or outer (the default is' outer '):

s4 = pd.concat([s1, s3])
s1
a    0
b    1
dtype: int64
s4
a    0
b    1
f    5
g    6
dtype: int64
pd.concat([s1,s4], axis =1)
01
a0.00
b1.01
fNaN5
gNaN6
pd.concat([s1,s4], axis =1, join='inner')
01
a00
b11

The same logic as the Series above applies to DataFrame objects. It should be noted that if the row index of the DataFrame does not contain any meaningful data, you can pass in ignore_index=True declares that the indexes on the connection axis are not reserved, and a new set of indexes range (total\u length) is generated:

df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
df1
abcd
00.095987-1.4296830.2131540.209158
10.4050742.0602630.154940-1.547939
2-0.2837980.185476-0.8331570.204095
df2
bda
0-0.743573-0.081083-0.832971
10.432616-0.393941-1.465352
pd.concat([df1,df2], ignore_index = True)
abcd
00.095987-1.4296830.2131540.209158
10.4050742.0602630.154940-1.547939
2-0.2837980.185476-0.8331570.204095
3-0.832971-0.743573NaN-0.081083
4-1.4653520.432616NaN-0.393941

Merge overlapping data

For example, you might have two datasets whose indexes overlap in whole or in part. You want to use one dataset to fill in the missing values of the corresponding fields in another dataset:

s1 = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
              index=['f', 'e', 'd', 'c', 'b', 'a'])
s2 = pd.Series(np.arange(len(a), dtype=np.float64),
              index=['f', 'e', 'd', 'c', 'b', 'a'])
s2[-1] = np.nan
s1
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64
s2
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

Method 1: np Where, which is equivalent to an array oriented if else:

# Where the conditional expression holds, i.e. s1 is a null value, take the median value of s2; otherwise, keep the median value of s1
np.where(pd.isnull(s1), s2, s1)
array([0. , 2.5, 2. , 3.5, 4.5, nan])

Method 2: fillna

s1.fillna(s2)
f    0.0
e    2.5
d    2.0
c    3.5
b    4.5
a    NaN
dtype: float64

Method 3: combine_first

s1.combine_first(s2)
f    0.0
e    2.5
d    2.0
c    3.5
b    4.5
a    NaN
dtype: float64

The above methods are also applicable to DataFrame. The data in the transfer object is used to "patch" the missing data of the calling object:

df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                     'b': [np.nan, 2., np.nan, 6.],
                     'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})
df1
abc
01.0NaN2
1NaN2.06
25.0NaN10
3NaN6.014
df2
ab
05.0NaN
14.03.0
2NaN4.0
33.06.0
47.08.0
df1.combine_first(df2)
abc
01.0NaN2.0
14.02.06.0
25.04.010.0
33.06.014.0
47.08.0NaN

Hierarchical index

hierarchical indexing is an important function of pandas. It enables you to have multiple (more than two) index levels on one axis.

For a DataFrame, each axis can have a hierarchical index:

frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                              ['Green', 'Red', 'Green']])
frame
OhioColorado
GreenRedGreen
a1012
2345
b1678
291011

Each layer can have a name:

frame.index.names = ['key1','key2']
frame.columns.names = ['state','color']
frame
stateOhioColorado
colorGreenRedGreen
key1key2
a1012
2345
b1678
291011

Rearrangement and hierarchical sorting

Swapelevel accepts two level numbers or names and returns a new object with the levels interchanged (but the data will not change):

frame.swaplevel(0,1)
stateOhioColorado
colorGreenRedGreen
key2key1
1a012
2a345
1b678
2b91011
frame.swaplevel('key1','key2')
stateOhioColorado
colorGreenRedGreen
key2key1
1a012
2a345
1b678
2b91011

sort_index sorts the data according to the values in a single level. This command is often used when switching levels.

frame.swaplevel('key1','key2').sort_index(level=0)
stateOhioColorado
colorGreenRedGreen
key2key1
1a012
b678
2a345
b91011

Index using DataFrame columns

In some data analysis scenarios, you may want to use one or more columns of the DataFrame as row indexes, or you may want to turn row indexes into columns of the DataFrame.

frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
frame
abcd
007one0
116one1
225one2
334two0
443two1
552two2
661two3

Set of DataFrame_ The index function converts one or more of its columns to a row index and creates a new DataFrame:

frame2 = frame.set_index(['c','d'])
frame2
ab
cd
one007
116
225
two034
143
252
361

reset_ The function of index is the same as that of set_index is just the opposite. The level of the hierarchical index will be transferred to the column:

frame2.reset_index()
cdab
0one007
1one116
2one225
3two034
4two143
5two252
6two361

Data reshaping

Hierarchical index provides a well consistent way for the rearrangement of DataFrame data. There are two main functions:

  • stack: rotate the columns of data into rows.
  • unstack: rotates rows of data into columns.

For example, the following data are the test scores of four students a, b, c and d in three courses 1, 2 and 3:

data = pd.DataFrame(list(zip(['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                             [1, 2, 3, 1, 3, 1, 2, 2, 3],
                             np.random.randn(9))),
                    columns = ['student','course','score'])
data
studentcoursescore
0a10.469069
1a20.568829
2a31.293216
3b1-0.799458
4b32.323059
5c10.419970
6c20.576399
7d20.459112
8d3-0.654535

Hierarchical indexes play an important role in data remodeling and grouping based operations (such as pivot table generation), such as:

First, set "student" and "course" to index to form a hierarchical index:

data = data.set_index(['student','course'])
data
score
studentcourse
a10.469069
20.568829
31.293216
b1-0.799458
32.323059
c10.419970
20.576399
d20.459112
3-0.654535

Secondly, the data is reshaped through the unstack command to obtain the pivot table of students and courses, which is convenient for subsequent statistical operations between grades:

result = data.unstack()
result
score
course123
student
a0.4690690.5688291.293216
b-0.799458NaN2.323059
c0.4199700.576399NaN
dNaN0.459112-0.654535

stack is the inverse of unstack:

result.stack()
score
studentcourse
a10.469069
20.568829
31.293216
b1-0.799458
32.323059
c10.419970
20.576399
d20.459112
3-0.654535

Previous period:
Data analysis with Python: preparation
Data analysis using Python: missing data (based on DataFrame)
Data analysis with Python: data transformation (based on DataFrame)

Tags: Python Data Analysis Data Mining

Posted by Rob the R on Wed, 01 Jun 2022 15:16:29 +0530