Exercise ranking and summary statistics
Data source: https://tianchi.aliyun.com/competition/entrance/231593/information
Using the data set ccf_offline_stage1_test_revised.csv in Coupon Usage Data for O2O, try to find out the merchants that issued the largest amount of coupons and the largest number of coupons during July 2016. (here, only the full reduced amount is considered, and the coupons with a few discounts are not considered.)
To import data, first filter out the data of july2016. This time, it is also imported directly from the database
import sqlalchemy as sac import pandas as pd engine=sac.create_engine('mysql+pymysql://user:password@host:3306/ali_sql') df=pd.read_sql('select * from `ccf_offline_stage1_test_revised`',engine)
Preprocess data first
1. First, adjust the data format to date type, and add a column of "collecting date" to change it to the date format of%Y-%m. Here, make a habit of sorting before filtering date type data (in fact, the data source only has the data of july2016)
2. Use df Str.split() to filter out the data of coupon amount.
3. The filtered data will be separated into a new DataFrame and merged into the original data
4. After merging, it is found that there is a None value in the data. Here, use the method df Fillna(), replace None with 0, and then change the data type of the whole column to int (the original data is of object type). Note that df Str.replace replaces None, but it is invalid. After query, it is found that None and NaN are both missing values. Then, replace the missing value directly
5. After all the data is processed, it is finally merged into a new DataFrame
df['Receipt date']=pd.to_datetime(df['Date_received']).dt.strftime('%Y-%m') df=df.sort_values(by='Receipt date',ascending=True) df1=df[df['Receipt date']=='2016-07'] s1=df1['Discount_rate'].str.split(':',expand=True) s1=pd.concat([df1,s1],axis=1).rename(columns={1:'amount of money'}) s1['amount of money']=s1['amount of money'].fillna(0) s2=s1.amount of money.astype('int') df1=pd.concat([s1,s2],axis=1)
Final summary statistics
After the data is processed, the summary and statistics are relatively simple, df Groupby method summary statistics
Maximum total amount of coupons issued=df1.groupby('Merchant_id')['amount of money'].sum().sort_values(by='amount of money',ascending=False)[:1] print(Maximum total amount of coupons issued) Maximum number of coupons issued=df1.groupby('Merchant_id')[['Coupon_id']].count().sort_values(by='Coupon_id',ascending=False)[:1] print(Maximum number of coupons issued)
Full code
#read in data import sqlalchemy as sac import pandas as pd engine=sac.create_engine('mysql+pymysql://user:password@host:3306/ali_sql') df=pd.read_sql('select * from `ccf_offline_stage1_test_revised`',engine) #Pretreatment, cleaning data df['Receipt date']=pd.to_datetime(df['Date_received']).dt.strftime('%Y-%m') df=df.sort_values(by='Receipt date',ascending=True) df1=df[df['Receipt date']=='2016-07'] s1=df1['Discount_rate'].str.split(':',expand=True) s1=pd.concat([df1,s1],axis=1).rename(columns={1:'amount of money'}) s1['amount of money']=s1['amount of money'].fillna(0) s2=s1.amount of money.astype('int') df1=pd.concat([s1,s2],axis=1) #Summary statistics Maximum total amount of coupons issued=df1.groupby('Merchant_id')['amount of money'].sum().sort_values(by='amount of money',ascending=False)[:1] print(Maximum total amount of coupons issued) Maximum number of coupons issued=df1.groupby('Merchant_id')[['Coupon_id']].count().sort_values(by='Coupon_id',ascending=False)[:1] print(Maximum number of coupons issued)
SQL code
-- Merchants with the largest total amount of coupons SELECT Merchant_id, SUM(SUBSTRING_INDEX(`Discount_rate`,':',-1)) AS discount_amount FROM ccf_offline_stage1_test_revised WHERE Date_received BETWEEN '2016-07-01' AND '2016-07-31' GROUP BY Merchant_id ORDER BY discount_amount DESC LIMIT 1 -- The merchant that issues the most coupons SELECT Merchant_id,COUNT(1) AS cnt FROM ccf_offline_stage1_test_revised WHERE Date_received BETWEEN '2016-07-01' AND '2016-07-31' GROUP BY Merchant_id ORDER BY cnt DESC LIMIT 1
Maximum total amount of coupon
Maximum number of coupons
Generally speaking, SQL is very convenient in dealing with this problem; The problem with python is that the data source is full of object type data. Preprocessing, data type conversion and data cleaning are required first. This part costs more code.
The SQL seems to have no problem. The string data type is in substring_ After the index string is intercepted, summary calculation can be performed directly