Table of contents
A column (series) is cast to string type .astype(str)
A string type column removes the special symbols at both ends.str.strip()
Splitting a string type column str.split() Detailed explanation
A column (series) is cast to string type .astype(str)
The function to view the data type of each column in pandas is df.dtypes , note that there is no need to add parentheses after it. The correspondence table between each data type in pandas and python is as follows: Note that the string str type in python corresponds to object by default, but object does not fully have the characteristics of str, so if you want to use the characteristics of str, it is generally better to be Cast to str instead of setting to object type. (Because object is usually not all converted to int or float type in pandas, it is replaced by object by default).
When a column of data is converted to object, str.split cannot be called correctly, and the result is empty.
#data itself is a df data['test']=202 data['testStr']=303 data=data[['test','testStr']] data['test']=data['test'].astype(object) # Convert to object type data['test_object']=data.test.str.split('0')# The above cast to object cannot call str.split correctly, and the result is empty. print(data.head(2)) print(data.dtypes)
result:
When a column is converted to str type, call str.split to ensure correct segmentation and value
data['testStr']=303 print(data.head(2)) data['testStr']=data['testStr'].astype(str) # Convert to str type data['test_str_spl']=data.testStr.str.split('0') # Because the above is converted to str type, str.split can be called to split and take values print(data.head(2)) print(data.dtypes)
Output: (Although df.dtypes, the testStr column is also displayed as an object, but it is actually an object that has been coerced into str. So usually if you want to call str.function, you usually need to coerce this column into str, which is safer.
Error reporting experience:
Phenomenon:
data['test_object_error']=data.test.str.split('0').str[0]
"AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas" At this time, it is necessary to consider whether it is not the original object type that is completely a string to call the related functions of str. Because the data.test column is only of object type, not str, the result of this division is NaN, and the subsequent str[0] cannot be called further. If test is forced to be converted to str, you can.
A string type column removes the special symbols at both ends.str.strip()
Case, remove the "[" and "]" marks at the beginning and end of the string
data.per has been cast to str type.
Splitting a string type column str.split() Detailed explanation
Use str.split to split, where the parameter pat indicates the delimiter, and the parameter expand=True indicates that the returned dataframe is a dataframe composed of the obtained fields after the split as each column, and the column name starts from 0.
pers=[0.2,0.25,0.275,0.3,0.325,0.35,0.375,0.40,0.50,0.60,0.65,0.7,0.75,0.8,0.84,0.86,0.88,0.90,0.92,0.94,0.96,0.98 ] pers_dic={} for k ,v in enumerate(pers): pers_dic[k]=str(v) per_df=data.per.str.split(pat=",",expand=True).rename(columns=pers_dic) # Use str.split to split, where the parameter pat indicates the delimiter, expand=True indicates that what is returned is a dataframe composed of the obtained fields after the split as each column, and the column name starts from 0. rename is to rename per_df.index per_df[[str(v) for v in pers]] =per_df[[str(v) for v in pers]].astype('float').round(4) # conversion type data_part=data[['timeZone']] res_merge=pd.merge(data_part,per_df,how='left',left_index=True,right_index=True) # Merge the obtained df and the original df with merge, note that both left_index and right_index are set to True, so that the index is correct.
The data form before processing, per is str type, and the data is connected by commas in the middle
After running the code processing:
Detailed explanation of str.split function parameters:
The parameters of pandas.Series.str.split(pat=None, n=-1, expand=False) are as follows:
- pat: string or regular expression, if empty, it is a continuous space, including (newline, space, tab)
- n: The default value is -1. If it is None, 0 will be modified to -1 (as can be seen from the source code in the above figure), that is, it can be divided as many times as possible, which is the same as the n of str.split(). =-1, maxsplit=0 of re.split() is consistent;
- expand: Determines whether the split results are distributed in multiple columns (True returns a DataFrame) or in a list (default False returns a series list) in one column (returns Series)