String parsing in pandas

Table of contents

A column (series) is cast to string type .astype(str)

Error reporting experience:

  Phenomenon:  

A string type column removes the special symbols at both ends.str.strip()

Splitting a string type column str.split() Detailed explanation

A column (series) is cast to string type .astype(str)

The function to view the data type of each column in pandas is df.dtypes , note that there is no need to add parentheses after it. The correspondence table between each data type in pandas and python is as follows: Note that the string str type in python corresponds to object by default, but object does not fully have the characteristics of str, so if you want to use the characteristics of str, it is generally better to be Cast to str instead of setting to object type. (Because object is usually not all converted to int or float type in pandas, it is replaced by object by default).

When a column of data is converted to object, str.split cannot be called correctly, and the result is empty.

#data itself is a df 
data['test']=202
data['testStr']=303
data=data[['test','testStr']]
data['test']=data['test'].astype(object) # Convert to object type
data['test_object']=data.test.str.split('0')# The above cast to object cannot call str.split correctly, and the result is empty.
print(data.head(2))
print(data.dtypes)

result:

When a column is converted to str type, call str.split to ensure correct segmentation and value

data['testStr']=303
print(data.head(2))
data['testStr']=data['testStr'].astype(str) # Convert to str type
data['test_str_spl']=data.testStr.str.split('0') # Because the above is converted to str type, str.split can be called to split and take values
print(data.head(2))
print(data.dtypes)

Output: (Although df.dtypes, the testStr column is also displayed as an object, but it is actually an object that has been coerced into str. So usually if you want to call str.function, you usually need to coerce this column into str, which is safer.  

Error reporting experience:

  Phenomenon:  

data['test_object_error']=data.test.str.split('0').str[0]

"AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas" 
At this time, it is necessary to consider whether it is not the original object type that is completely a string to call the related functions of str. Because the data.test column is only of object type, not str, the result of this division is NaN, and the subsequent str[0] cannot be called further.  If test is forced to be converted to str, you can. 

A string type column removes the special symbols at both ends.str.strip()

Case, remove the "[" and "]" marks at the beginning and end of the string

data.per has been cast to str type. 

Splitting a string type column str.split() Detailed explanation

Use str.split to split, where the parameter pat indicates the delimiter, and the parameter expand=True indicates that the returned dataframe is a dataframe composed of the obtained fields after the split as each column, and the column name starts from 0.

pers=[0.2,0.25,0.275,0.3,0.325,0.35,0.375,0.40,0.50,0.60,0.65,0.7,0.75,0.8,0.84,0.86,0.88,0.90,0.92,0.94,0.96,0.98
]
pers_dic={}
for k ,v in enumerate(pers):
    pers_dic[k]=str(v)
per_df=data.per.str.split(pat=",",expand=True).rename(columns=pers_dic) # Use str.split to split, where the parameter pat indicates the delimiter, expand=True indicates that what is returned is a dataframe composed of the obtained fields after the split as each column, and the column name starts from 0.  rename is to rename
per_df.index
per_df[[str(v) for v in pers]] =per_df[[str(v) for v in pers]].astype('float').round(4)  # conversion type
data_part=data[['timeZone']]
res_merge=pd.merge(data_part,per_df,how='left',left_index=True,right_index=True) # Merge the obtained df and the original df with merge, note that both left_index and right_index are set to True, so that the index is correct. 

The data form before processing, per is str type, and the data is connected by commas in the middle

After running the code processing:

Detailed explanation of str.split function parameters:

The parameters of pandas.Series.str.split(pat=None, n=-1, expand=False) are as follows:

  • pat: string or regular expression, if empty, it is a continuous space, including (newline, space, tab)
  • n: The default value is -1. If it is None, 0 will be modified to -1 (as can be seen from the source code in the above figure), that is, it can be divided as many times as possible, which is the same as the n of str.split(). =-1, maxsplit=0 of re.split() is consistent;
  • expand: Determines whether the split results are distributed in multiple columns (True returns a DataFrame) or in a list (default False returns a series list) in one column (returns Series)

Tags: Python programming language

Posted by cutups on Thu, 02 Jun 2022 05:47:26 +0530