Epidemic analysis of novel coronavirus (covid-19/2019 ncov) in 2019
Pray LHL
Important notes
Analysis document: completion: code quality 3:5:2
Among them, the analysis document refers to the analysis idea of each problem, the explanation and explanation of the results in the process of data analysis (it is required to be concise and comprehensive, not written for writing)
ps: your own code is better than all ghostwriting. It has nothing to do with beauty or ugliness. Just ask that today is better than yesterday! Come on!
Because there is too much data, try to use head() or tail() to view the data, so as to prevent the program from being unresponsive for a long time
=======================
The data of this project comes from dingxiangyuan. The main purpose of this project is to better understand the epidemic situation and the development trend of the epidemic through the analysis and research of the historical data of the epidemic situation, and to provide data support for the decision-making of combating the epidemic situation.
For the datasets used in this chapter, please click - > My station B video Get it in the comments area.
I Ask questions
The following issues will be mainly studied from three aspects: nationwide, your province and city, and the epidemic situation abroad:
(1) What is the trend of cumulative confirmed / suspected / cured / dead cases over time in China?
(2) What is the trend of new confirmed / suspected / cured / dead cases in China over time?
(3) What is the trend of new overseas input across the country over time?
(4) What is the situation in your province?
(5) What is the epidemic situation abroad?
(6) Based on your analysis results, what are your suggestions for individuals and society in combating the epidemic?
II Understanding data
Original dataset: areainfo CSV, import related packages and read data:
r_hex = '#dc2624' # red, RGB = 220,38,36 dt_hex = '#2b4750' # dark teal, RGB = 43,71,80 tl_hex = '#45a0a2' # teal, RGB = 69,160,162 r1_hex = '#e87a59' # red, RGB = 232,122,89 tl1_hex = '#7dcaa9' # teal, RGB = 125,202,169 g_hex = '#649E7D' # green, RGB = 100,158,125 o_hex = '#dc8018' # orange, RGB = 220,128,24 tn_hex = '#C89F91' # tan, RGB = 200,159,145 g50_hex = '#6c6d6c' # grey-50, RGB = 108,109,108 bg_hex = '#4f6268' # blue grey, RGB = 79,98,104 g25_hex = '#c7cccf' # grey-25, RGB = 199,204,207
import numpy as np import pandas as pd import matplotlib,re import matplotlib.pyplot as plt from matplotlib.pyplot import MultipleLocator data = pd.read_csv(r'data/AreaInfo.csv')
View and statistics to get a general understanding of the data
data.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | North America | North America | United States | United States of America | United States | United States of America | 971002 | 2306247 | 0.0 | 640198 | 120351 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | South America | South America | Brazil | Brazil | Brazil | Brazil | 973003 | 1106470 | 0.0 | 549386 | 51271 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Europe | Europe | Britain | United Kingdom | Britain | United Kingdom | 961007 | 305289 | 0.0 | 539 | 42647 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Europe | Europe | Russia | Russia | Russia | Russia | 964006 | 592280 | 0.0 | 344416 | 8206 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | South America | South America | Chile | Chile | Chile | Chile | 973004 | 246963 | 0.0 | 44946 | 4502 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
III Data cleaning
(1) Basic data processing
Data cleaning mainly includes: subset selection, missing data processing, data format conversion, outlier data processing, etc.
Domestic epidemic data selection (the final selected data is named china)
-
Select domestic epidemic data
-
For the Updatetime column, you need to convert it to a date type, extract the year month day, and view the processing results. (prompt: dt.date)
-
Since the data is updated hourly every day, there are many duplicate data within a day. Please de duplicate and only keep the latest data within a day.
Tip: df Drop_ Duplicates (subset=['provinceName', 'updateTime'], keep= 'first', inplace=False)
Where df is the DataFrame of the domestic epidemic data you selected
Analysis: select the row with the value of CHINA in the countryName column to form CHINA.
CHINA = data.loc[data['countryName'] == 'China'] CHINA.dropna(subset=['cityName'], how='any', inplace=True) #CHINA
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Analysis: take out the list containing all Chinese cities
cities = list(set(CHINA['cityName']))
Analysis: traverse to get the sub dataframe of each city, and then sort the updateTime with sort
for city in cities: CHINA.loc[data['cityName'] == city].sort_values(by = 'updateTime')
Analysis: remove the row of null value
CHINA.dropna(subset=['cityName'],inplace=True) #China Loc[china['cityname'] = = 'Qinhuangdao'] tail(20)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """Entry point for launching an IPython kernel.
Analysis: format the updateTime column in CHINA
CHINA.updateTime = pd.to_datetime(CHINA.updateTime,format="%Y-%m-%d",errors='coerce').dt.date #China Loc[data['cityname'] = 'Qinhuangdao'] tail(15)
D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self[name] = value
CHINA.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
136 | Asia | Asia | China | China | Shaanxi Province | Shaanxi | 610000 | 317 | 1.0 | 307 | 3 | 2020-06-23 | Overseas input | NaN | 0.0 | 72.0 | 0.0 | 65.0 | 0.0 |
137 | Asia | Asia | China | China | Shaanxi Province | Shaanxi | 610000 | 317 | 1.0 | 307 | 3 | 2020-06-23 | Xi'an | Xi'an | 610100.0 | 120.0 | 0.0 | 117.0 | 3.0 |
138 | Asia | Asia | China | China | Shaanxi Province | Shaanxi | 610000 | 317 | 1.0 | 307 | 3 | 2020-06-23 | Ankang | Ankang | 610900.0 | 26.0 | 0.0 | 26.0 | 0.0 |
139 | Asia | Asia | China | China | Shaanxi Province | Shaanxi | 610000 | 317 | 1.0 | 307 | 3 | 2020-06-23 | Hanzhong | Hanzhong | 610700.0 | 26.0 | 0.0 | 26.0 | 0.0 |
140 | Asia | Asia | China | China | Shaanxi Province | Shaanxi | 610000 | 317 | 1.0 | 307 | 3 | 2020-06-23 | Xianyang | Xianyang | 610400.0 | 17.0 | 0.0 | 17.0 | 0.0 |
Analysis: only the first data is retained for daily data De duplication, because the time has been sorted before, and the first data is the latest data of the day
Analysis: considering that concat is required for merging dataframe s, an initial china needs to be created
real = CHINA.loc[data['cityName'] == cities[1]] real.drop_duplicates(subset='updateTime', keep='first', inplace=True) china = real
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Analysis: traverse the dataframe of each city to de duplicate the daily data. Otherwise, only the data of one city will be retained on the same date
for city in cities[2:]: real_data = CHINA.loc[data['cityName'] == city] real_data.drop_duplicates(subset='updateTime', keep='first', inplace=True) china = pd.concat([real_data, china],sort=False)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy This is separate from the ipykernel package so we can avoid doing imports until
Check the data information to see whether there is missing data / whether the data type is correct.
Tip: if the missing value will not be processed, it can be discarded
Analysis: some cities do not report every day. If only those cities that report are counted on a certain day, those cities that have patients but do not report will be ignored, and the data will be distorted. It is necessary to complete the daily data of all cities. Even those cities that do not report must record the data statistics every day. Therefore, interpolation processing is required to complete part of the data. See data perspective and analysis for the processing methods
china.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 32812 entries, 96106 to 208267 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 continentName 32812 non-null object 1 continentEnglishName 32812 non-null object 2 countryName 32812 non-null object 3 countryEnglishName 32812 non-null object 4 provinceName 32812 non-null object 5 provinceEnglishName 32812 non-null object 6 province_zipCode 32812 non-null int64 7 province_confirmedCount 32812 non-null int64 8 province_suspectedCount 32812 non-null float64 9 province_curedCount 32812 non-null int64 10 province_deadCount 32812 non-null int64 11 updateTime 32812 non-null object 12 cityName 32812 non-null object 13 cityEnglishName 31968 non-null object 14 city_zipCode 32502 non-null float64 15 city_confirmedCount 32812 non-null float64 16 city_suspectedCount 32812 non-null float64 17 city_curedCount 32812 non-null float64 18 city_deadCount 32812 non-null float64 dtypes: float64(6), int64(4), object(9) memory usage: 5.0+ MB
china.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
96106 | Asia | Asia | China | China | Guangxi Zhuang Autonomous Region | Guangxi | 450000 | 254 | 0.0 | 252 | 2 | 2020-04-02 | Guigang | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
125120 | Asia | Asia | China | China | Guangxi Zhuang Autonomous Region | Guangxi | 450000 | 254 | 0.0 | 250 | 2 | 2020-03-20 | Guigang | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
128762 | Asia | Asia | China | China | Guangxi Zhuang Autonomous Region | Guangxi | 450000 | 253 | 0.0 | 250 | 2 | 2020-03-18 | Guigang | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
130607 | Asia | Asia | China | China | Guangxi Zhuang Autonomous Region | Guangxi | 450000 | 253 | 0.0 | 248 | 2 | 2020-03-17 | Guigang | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
131428 | Asia | Asia | China | China | Guangxi Zhuang Autonomous Region | Guangxi | 450000 | 252 | 0.0 | 248 | 2 | 2020-03-16 | Guigang | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
Epidemic data selection in your province (the final selected data is named myhome)
This step can also be done later
myhome = china.loc[data['provinceName'] == 'Guangdong Province'] myhome.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
205259 | Asia | Asia | China | China | Guangdong Province | Guangdong | 440000 | 277 | 0.0 | 5 | 0 | 2020-01-29 | Foreign visitors to Guangdong | NaN | NaN | 5.0 | 0.0 | 0.0 | 0.0 |
206335 | Asia | Asia | China | China | Guangdong Province | Guangdong | 440000 | 207 | 0.0 | 4 | 0 | 2020-01-28 | Heyuan City | NaN | NaN | 1.0 | 0.0 | 0.0 | 0.0 |
205239 | Asia | Asia | China | China | Guangdong Province | Guangdong | 440000 | 277 | 0.0 | 5 | 0 | 2020-01-29 | Personnel from other places | NaN | NaN | 5.0 | 0.0 | 0.0 | 0.0 |
252 | Asia | Asia | China | China | Guangdong Province | Guangdong | 440000 | 1634 | 11.0 | 1619 | 8 | 2020-06-23 | Chaozhou | Chaozhou | 445100.0 | 6.0 | 0.0 | 6.0 | 0.0 |
2655 | Asia | Asia | China | China | Guangdong Province | Guangdong | 440000 | 1634 | 11.0 | 1614 | 8 | 2020-06-21 | Chaozhou | Chaozhou | 445100.0 | 6.0 | 0.0 | 6.0 | 0.0 |
Selection of overseas epidemic data (the final selected data is named world)
This step can also be done later
world = data.loc[data['countryName'] != 'China'] world.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | North America | North America | United States | United States of America | United States | United States of America | 971002 | 2306247 | 0.0 | 640198 | 120351 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | South America | South America | Brazil | Brazil | Brazil | Brazil | 973003 | 1106470 | 0.0 | 549386 | 51271 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Europe | Europe | Britain | United Kingdom | Britain | United Kingdom | 961007 | 305289 | 0.0 | 539 | 42647 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Europe | Europe | Russia | Russia | Russia | Russia | 964006 | 592280 | 0.0 | 344416 | 8206 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | South America | South America | Chile | Chile | Chile | Chile | 973004 | 246963 | 0.0 | 44946 | 4502 | 2020-06-23 10:01:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Pivoting and analysis
Analysis: interpolate china to complete some data
china.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
96106 | Asia | Asia | China | China | Guangxi Zhuang Autonomous Region | Guangxi | 450000 | 254 | 0.0 | 252 | 2 | 2020-04-02 | Guigang | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
125120 | Asia | Asia | China | China | Guangxi Zhuang Autonomous Region | Guangxi | 450000 | 254 | 0.0 | 250 | 2 | 2020-03-20 | Guigang | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
128762 | Asia | Asia | China | China | Guangxi Zhuang Autonomous Region | Guangxi | 450000 | 253 | 0.0 | 250 | 2 | 2020-03-18 | Guigang | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
130607 | Asia | Asia | China | China | Guangxi Zhuang Autonomous Region | Guangxi | 450000 | 253 | 0.0 | 248 | 2 | 2020-03-17 | Guigang | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
131428 | Asia | Asia | China | China | Guangxi Zhuang Autonomous Region | Guangxi | 450000 | 252 | 0.0 | 248 | 2 | 2020-03-16 | Guigang | Guigang | 450800.0 | 8.0 | 0.0 | 8.0 | 0.0 |
Analysis: first create a province list and a date list, and initialize a draft
province = list(set(china['provinceName']))#Each province #p_city = list(set(china[china['provinceName'] == province[0]]['cityName']))#Cities in each province date_0 = [] for dt in china.loc[china['provinceName'] == province[0]]['updateTime']: date_0.append(str(dt)) date_0 = list(set(date_0)) date_0.sort() start = china.loc[china['provinceName'] == province[0]]['updateTime'].min() end = china.loc[china['provinceName'] == province[0]]['updateTime'].max() dates = pd.date_range(start=str(start), end=str(end)) aid_frame = pd.DataFrame({'updateTime': dates,'provinceName':[province[0]]*len(dates)}) aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date #draft = pd.merge(china.loc[china['provinceName'] == province[1]], aid_frame, on='updateTime', how='outer').sort_values('updateTime') draft = pd.concat([china.loc[china['provinceName'] == province[0]], aid_frame], join='outer').sort_values('updateTime') draft.province_confirmedCount.fillna(method="ffill",inplace=True) draft.province_suspectedCount.fillna(method="ffill", inplace=True) draft.province_curedCount.fillna(method="ffill", inplace=True) draft.province_deadCount.fillna(method="ffill", inplace=True)
Analysis: complete part of the time and interpolate the data of the previous day. Because some provinces have no new patients since the end of April and will not report them, the data of these provinces can only be completed until the end of April, and the subsequent data will gradually lose their authenticity
Analysis: date formatting at the same time
for p in range(1,len(province)): date_d = [] for dt in china.loc[china['provinceName'] == province[p]]['updateTime']: date_d.append(dt) date_d = list(set(date_d)) date_d.sort() start = china.loc[china['provinceName'] == province[p]]['updateTime'].min() end = china.loc[china['provinceName'] == province[p]]['updateTime'].max() dates = pd.date_range(start=start, end=end) aid_frame = pd.DataFrame({'updateTime': dates,'provinceName':[province[p]]*len(dates)}) aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date X = china.loc[china['provinceName'] == province[p]] X.reset_index(drop= True) Y = aid_frame Y.reset_index(drop= True) draft_d = pd.concat([X,Y], join='outer').sort_values('updateTime') draft = pd.concat([draft,draft_d]) draft.province_confirmedCount.fillna(method="ffill",inplace=True) draft.province_suspectedCount.fillna(method="ffill", inplace=True) draft.province_curedCount.fillna(method="ffill", inplace=True) draft.province_deadCount.fillna(method="ffill", inplace=True) #draft['updateTime'] = draft['updateTime'].strftime('%Y-%m-%d') #draft['updateTime'] = pd.to_datetime(draft['updateTime'],format="%Y-%m-%d",errors='coerce').dt.date
china = draft
china.head()
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
208226 | Asia | Asia | China | China | Tianjin | Tianjin | 120000.0 | 14.0 | 0.0 | 0.0 | 0.0 | 2020-01-26 | From other places to Tianjin | NaN | NaN | 2.0 | 0.0 | 0.0 | 0.0 |
208224 | Asia | Asia | China | China | Tianjin | Tianjin | 120000.0 | 14.0 | 0.0 | 0.0 | 0.0 | 2020-01-26 | Hebei District | Hebei District | 120105.0 | 5.0 | 0.0 | 0.0 | 0.0 |
208228 | Asia | Asia | China | China | Tianjin | Tianjin | 120000.0 | 14.0 | 0.0 | 0.0 | 0.0 | 2020-01-26 | Peace zone | Heping District | 120101.0 | 1.0 | 0.0 | 0.0 | 0.0 |
208227 | Asia | Asia | China | China | Tianjin | Tianjin | 120000.0 | 14.0 | 0.0 | 0.0 | 0.0 | 2020-01-26 | Binhai New Area | Binhai New Area | 120116.0 | 1.0 | 0.0 | 0.0 | 0.0 |
208230 | Asia | Asia | China | China | Tianjin | Tianjin | 120000.0 | 14.0 | 0.0 | 0.0 | 0.0 | 2020-01-26 | Xiqing District | Xiqing District | 120111.0 | 1.0 | 0.0 | 0.0 | 0.0 |
IV Data analysis and visualization
During data analysis and visualization, select the required variables according to each problem and create a new DataFrame for analysis and visual display, so that the data is not easy to be disordered and the organization is more clear.
Basic analysis
For basic analysis, only numpy, pandas and matplotlib libraries are allowed.
It can be displayed in multiple coordinate systems on one drawing or on multiple drawings
Please select the type of graph (line chart, pie chart, histogram, scatter chart, etc.) according to the analysis purpose. There is no idea to go to Baidu epidemic map or other epidemic analysis sites to inspire inspiration.
(1) What is the trend of cumulative confirmed / suspected / cured / dead cases over time in China?
Analysis: in order to obtain the change trend of the national cumulative situation over time, we first need to integrate the daily national cumulative diagnosis situation into date_confirmed
Analysis: in order to integrate the daily cumulative diagnosis of the whole country, we must first extract the latest cumulative diagnosis number of each province on the same day, and then sum the provincial data to form a dataframe,
for loop splicing to date_confirmed
date = list(set(china['updateTime'])) date.sort() date
[datetime.date(2020, 1, 24), datetime.date(2020, 1, 25), datetime.date(2020, 1, 26), datetime.date(2020, 1, 27), datetime.date(2020, 1, 28), datetime.date(2020, 1, 29), datetime.date(2020, 1, 30), datetime.date(2020, 1, 31), datetime.date(2020, 2, 1), datetime.date(2020, 2, 2), datetime.date(2020, 2, 3), datetime.date(2020, 2, 4), datetime.date(2020, 2, 5), datetime.date(2020, 2, 6), datetime.date(2020, 2, 7), datetime.date(2020, 2, 8), datetime.date(2020, 2, 9), datetime.date(2020, 2, 10), datetime.date(2020, 2, 11), datetime.date(2020, 2, 12), datetime.date(2020, 2, 13), datetime.date(2020, 2, 14), datetime.date(2020, 2, 15), datetime.date(2020, 2, 16), datetime.date(2020, 2, 17), datetime.date(2020, 2, 18), datetime.date(2020, 2, 19), datetime.date(2020, 2, 20), datetime.date(2020, 2, 21), datetime.date(2020, 2, 22), datetime.date(2020, 2, 23), datetime.date(2020, 2, 24), datetime.date(2020, 2, 25), datetime.date(2020, 2, 26), datetime.date(2020, 2, 27), datetime.date(2020, 2, 28), datetime.date(2020, 2, 29), datetime.date(2020, 3, 1), datetime.date(2020, 3, 2), datetime.date(2020, 3, 3), datetime.date(2020, 3, 4), datetime.date(2020, 3, 5), datetime.date(2020, 3, 6), datetime.date(2020, 3, 7), datetime.date(2020, 3, 8), datetime.date(2020, 3, 9), datetime.date(2020, 3, 10), datetime.date(2020, 3, 11), datetime.date(2020, 3, 12), datetime.date(2020, 3, 13), datetime.date(2020, 3, 14), datetime.date(2020, 3, 15), datetime.date(2020, 3, 16), datetime.date(2020, 3, 17), datetime.date(2020, 3, 18), datetime.date(2020, 3, 19), datetime.date(2020, 3, 20), datetime.date(2020, 3, 21), datetime.date(2020, 3, 22), datetime.date(2020, 3, 23), datetime.date(2020, 3, 24), datetime.date(2020, 3, 25), datetime.date(2020, 3, 26), datetime.date(2020, 3, 27), datetime.date(2020, 3, 28), datetime.date(2020, 3, 29), datetime.date(2020, 3, 30), datetime.date(2020, 3, 31), datetime.date(2020, 4, 1), datetime.date(2020, 4, 2), datetime.date(2020, 4, 3), datetime.date(2020, 4, 4), datetime.date(2020, 4, 5), datetime.date(2020, 4, 6), datetime.date(2020, 4, 7), datetime.date(2020, 4, 8), datetime.date(2020, 4, 9), datetime.date(2020, 4, 10), datetime.date(2020, 4, 11), datetime.date(2020, 4, 12), datetime.date(2020, 4, 13), datetime.date(2020, 4, 14), datetime.date(2020, 4, 15), datetime.date(2020, 4, 16), datetime.date(2020, 4, 17), datetime.date(2020, 4, 18), datetime.date(2020, 4, 19), datetime.date(2020, 4, 20), datetime.date(2020, 4, 21), datetime.date(2020, 4, 22), datetime.date(2020, 4, 23), datetime.date(2020, 4, 24), datetime.date(2020, 4, 25), datetime.date(2020, 4, 26), datetime.date(2020, 4, 27), datetime.date(2020, 4, 28), datetime.date(2020, 4, 29), datetime.date(2020, 4, 30), datetime.date(2020, 5, 1), datetime.date(2020, 5, 2), datetime.date(2020, 5, 3), datetime.date(2020, 5, 4), datetime.date(2020, 5, 5), datetime.date(2020, 5, 6), datetime.date(2020, 5, 7), datetime.date(2020, 5, 8), datetime.date(2020, 5, 9), datetime.date(2020, 5, 10), datetime.date(2020, 5, 11), datetime.date(2020, 5, 12), datetime.date(2020, 5, 13), datetime.date(2020, 5, 14), datetime.date(2020, 5, 15), datetime.date(2020, 5, 16), datetime.date(2020, 5, 17), datetime.date(2020, 5, 18), datetime.date(2020, 5, 19), datetime.date(2020, 5, 20), datetime.date(2020, 5, 21), datetime.date(2020, 5, 22), datetime.date(2020, 5, 23), datetime.date(2020, 5, 24), datetime.date(2020, 5, 25), datetime.date(2020, 5, 26), datetime.date(2020, 5, 27), datetime.date(2020, 5, 28), datetime.date(2020, 5, 29), datetime.date(2020, 5, 30), datetime.date(2020, 5, 31), datetime.date(2020, 6, 1), datetime.date(2020, 6, 2), datetime.date(2020, 6, 3), datetime.date(2020, 6, 4), datetime.date(2020, 6, 5), datetime.date(2020, 6, 6), datetime.date(2020, 6, 7), datetime.date(2020, 6, 8), datetime.date(2020, 6, 9), datetime.date(2020, 6, 10), datetime.date(2020, 6, 11), datetime.date(2020, 6, 12), datetime.date(2020, 6, 13), datetime.date(2020, 6, 14), datetime.date(2020, 6, 15), datetime.date(2020, 6, 16), datetime.date(2020, 6, 17), datetime.date(2020, 6, 18), datetime.date(2020, 6, 19), datetime.date(2020, 6, 20), datetime.date(2020, 6, 21), datetime.date(2020, 6, 22), datetime.date(2020, 6, 23)]
china = china.set_index('provinceName') china = china.reset_index()
Analysis: cycle through provinces and dates to obtain the daily cumulative diagnosis of each province. Because splicing is required, initialize a date first_ Confirmed
list_p = [] list_d = [] list_e = [] for p in range(0,32): try: con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_confirmedCount'])#Cumulative number of confirmed cases in each province on that day except: continue list_d.append(sum(list_p)) list_e.append(str(date[0])) date_confirmed = pd.DataFrame(list_d,index=list_e) date_confirmed.index.name="date" date_confirmed.columns=["China_confirmedCount"] date_confirmed
China_confirmedCount | |
---|---|
date | |
2020-01-24 | 1956.0 |
Analysis: traverse the dataframe of the total number of confirmed patients in each province
l = 0 for i in date[3:]: list_p = [] list_d = [] list_e = [] l +=1 for p in range(0,32): try: con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_confirmedCount'])#Cumulative number of confirmed cases in each province on that day except: continue #Con_ 0 = china Loc[china['updatetime'] = = date[0]] Loc[china['provincename'] = = 'Hebei Province'] Loc[[0]] Iloc[0] #list_p.append(con_0['province_confirmedCount'])#Cumulative number of confirmed cases in each province on that day list_d.append(sum(list_p)) list_e.append(str(date[l])) confirmed = pd.DataFrame(list_d, index=list_e) confirmed.index.name="date" confirmed.columns=["China_confirmedCount"] date_confirmed = pd.concat([date_confirmed,confirmed],sort=False) date_confirmed
China_confirmedCount | |
---|---|
date | |
2020-01-24 | 1956.0 |
2020-01-25 | 2253.0 |
2020-01-26 | 1956.0 |
2020-01-27 | 2825.0 |
2020-01-28 | 4589.0 |
... | ... |
2020-06-17 | 8106.0 |
2020-06-18 | 6862.0 |
2020-06-19 | 6894.0 |
2020-06-20 | 6921.0 |
2020-06-21 | 6157.0 |
150 rows × 1 columns
Analysis: removing null and incomplete values
date_confirmed.dropna(subset=['China_confirmedCount'],inplace=True) date_confirmed.tail(20)
China_confirmedCount | |
---|---|
date | |
2020-06-02 | 78782.0 |
2020-06-03 | 78780.0 |
2020-06-04 | 76903.0 |
2020-06-05 | 76908.0 |
2020-06-06 | 8777.0 |
2020-06-07 | 8782.0 |
2020-06-08 | 8628.0 |
2020-06-09 | 8634.0 |
2020-06-10 | 8638.0 |
2020-06-11 | 8649.0 |
2020-06-12 | 8658.0 |
2020-06-13 | 8665.0 |
2020-06-14 | 8733.0 |
2020-06-15 | 8772.0 |
2020-06-16 | 8055.0 |
2020-06-17 | 8106.0 |
2020-06-18 | 6862.0 |
2020-06-19 | 6894.0 |
2020-06-20 | 6921.0 |
2020-06-21 | 6157.0 |
Analysis: since the end of April to the end of May, the data has been distorted due to the lack of data from too many provinces (some provinces have no new patients since the end of April), and has completely lost its authenticity since June 6, 2020. Therefore, I deleted the data after June 6, 2020
date_confirmed = date_confirmed.drop(['2020-06-06','2020-06-07','2020-06-08','2020-06-09','2020-06-10','2020-06-11','2020-06-12','2020-06-13','2020-06-14', '2020-06-15','2020-06-16','2020-06-19','2020-06-18','2020-06-20','2020-06-17','2020-06-21'])
Analysis: constructing splice functions
def data_frame(self,china,element): l = 0 for i in date[3:]: list_p = [] list_d = [] list_e = [] l +=1 for p in range(0,32): try: con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0[element]) except: continue #Con_ 0 = china Loc[china['updatetime'] = = date[0]] Loc[china['provincename'] = = 'Hebei Province'] Loc[[0]] Iloc[0] #list_p.append(con_0['province_confirmedCount']) list_d.append(sum(list_p)) list_e.append(str(date[l])) link = pd.DataFrame(list_d, index=list_e) link.index.name="date" link.columns=["China"] self = pd.concat([self,link],sort=False) self.dropna(subset=['China'],inplace=True) self = self.drop(['2020-06-06','2020-06-07','2020-06-08','2020-06-09','2020-06-10','2020-06-11','2020-06-12','2020-06-13','2020-06-14', '2020-06-15','2020-06-16','2020-06-19','2020-06-18','2020-06-20','2020-06-17','2020-06-21']) return self
Analysis: initializing individual variables
#Cumulative number of cured persons date_curedCount list_p = [] list_d = [] list_e = [] for p in range(0,32): try: con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_curedCount']) except: continue list_d.append(sum(list_p)) list_e.append(str(date[0])) date_cured = pd.DataFrame(list_d, index=list_e) date_cured.index.name="date" date_cured.columns=["China"] #Cumulative number of deaths date_dead list_p = [] list_d = [] list_e = [] for p in range(0,32): try: con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_deadCount']) except: continue list_d.append(sum(list_p)) list_e.append(str(date[0])) date_dead = pd.DataFrame(list_d, index=list_e) date_dead.index.name="date" date_dead.columns=["China"]
#Cumulative confirmed patient date_confirmed plt.rcParams['font.sans-serif'] = ['SimHei'] #Change the font, otherwise Chinese characters cannot be displayed fig = plt.figure( figsize=(16,6), dpi=100) ax = fig.add_subplot(1,1,1) x = date_confirmed.index y = date_confirmed.values ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' ) ax.set_title('Cumulative confirmed patients',fontdict={ 'color':'black', 'size':24 }) ax.set_xticks( range(0,len(x),30))
[<matplotlib.axis.XTick at 0x255520e4908>, <matplotlib.axis.XTick at 0x255520e49e8>, <matplotlib.axis.XTick at 0x255520af048>, <matplotlib.axis.XTick at 0x2555216b0b8>, <matplotlib.axis.XTick at 0x2555216b4e0>]
#Cumulative cured patient date_curedCount date_cured = data_frame(date_cured,china,'province_curedCount') fig = plt.figure( figsize=(16,6), dpi=100) ax = fig.add_subplot(1,1,1) x = date_cured.index y = date_cured.values ax.set_title('Cumulative cured patients',fontdict={ 'color':'black', 'size':24 }) ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' ) ax.set_xticks( range(0,len(x),30))
[<matplotlib.axis.XTick at 0x25550ef60f0>, <matplotlib.axis.XTick at 0x255521cd0b8>, <matplotlib.axis.XTick at 0x255521b7780>, <matplotlib.axis.XTick at 0x2555208ffd0>, <matplotlib.axis.XTick at 0x2555208f0f0>]
Analysis: the accumulated suspected data cannot be obtained by completing the data
#Cumulative patient death date_dead date_dead = data_frame(date_dead,china,'province_deadCount') fig = plt.figure( figsize=(16,6), dpi=100) ax = fig.add_subplot(1,1,1) x = date_dead.index y = date_dead.values ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' ) x_major_locator=MultipleLocator(12) ax=plt.gca() ax.set_title('Cumulative dead patients',fontdict={ 'color':'black', 'size':24 }) ax.xaxis.set_major_locator(x_major_locator) ax.set_xticks( range(0,len(x),30))
[<matplotlib.axis.XTick at 0x255521fda90>, <matplotlib.axis.XTick at 0x255521fda58>, <matplotlib.axis.XTick at 0x25552a51550>, <matplotlib.axis.XTick at 0x25552a75470>, <matplotlib.axis.XTick at 0x25552a75908>]
Analysis: the epidemic began to break out in early January, slowed down at the end of February, and leveled off at the end of April. The number of cured people has increased significantly since the beginning of February, and tends to be flat by the end of March. The number of deaths has increased since the end of January, and tends to be flat by the end of February. By the end of April, the number of deaths due to statistical factors has soared and tends to be flat.
Analysis and summary: from the end of April to the end of May, the number of confirmed cases and cure data were distorted due to the lack of data from too many provinces (some provinces have no new patients so far). Other data should be supplemented as much as possible, and the closer to the tail data, the more distorted it will be. The completion of death data was relatively successful, with almost no errors or omissions.
(2) What is the trend of new confirmed / suspected / cured / dead cases in China over time?
Analysis: the newly added diagnosis / cure / death data need to be calculated for china, and the diff difference calculation should be performed every day in each province
Analysis: first initialize each data, and then follow the above splicing function to make a splicing function applicable to the problem
#New confirmed number date_new_confirmed list_p = [] list_d = [] list_e = [] for p in range(0,32): try: con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_confirmedCount'])#Cumulative number of confirmed cases in each province on that day except: continue list_d.append(sum(list_p)) list_e.append(str(date[0])) date_new_confirmed = pd.DataFrame(list_d,index=list_e) date_new_confirmed.index.name="date" date_new_confirmed.columns=["China"] date_new_confirmed #New number of cured persons date_new_curedCount list_p = [] list_d = [] list_e = [] for p in range(0,32): try: con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_curedCount']) except: continue list_d.append(sum(list_p)) list_e.append(str(date[0])) date_new_cured = pd.DataFrame(list_d, index=list_e) date_new_cured.index.name="date" date_new_cured.columns=["China"] #New death toll date_new_dead list_p = [] list_d = [] list_e = [] for p in range(0,32): try: con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['province_deadCount']) except: continue list_d.append(sum(list_p)) list_e.append(str(date[0])) date_new_dead = pd.DataFrame(list_d, index=list_e) date_new_dead.index.name="date" date_new_dead.columns=["China"]
Analysis: constructing splice functions
def data_new_frame(self,china,element): l = 0 for i in date[3:]: list_p = [] list_d = [] list_e = [] l +=1 for p in range(0,32): try: con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0[element]) except: continue #Con_ 0 = china Loc[china['updatetime'] = = date[0]] Loc[china['provincename'] = = 'Hebei Province'] Loc[[0]] Iloc[0] #list_p.append(con_0['province_confirmedCount']) list_d.append(sum(list_p)) list_e.append(str(date[l])) link = pd.DataFrame(list_d, index=list_e) link.index.name="date" link.columns=["China"] self = pd.concat([self,link],sort=False) self.dropna(subset=['China'],inplace=True) return self
Analysis: data completion and data removal including missing provinces
d = data_new_frame(date_new_confirmed,china,'province_confirmedCount') for i in range(len(d)): dr = [] for a,b in zip(range(0,len(d)-1),range(1,len(d)-2)): if d.iloc[b].iloc[0] < d.iloc[a].iloc[0]: dr.append(d.iloc[b].iloc[0]) d = d[~d['China'].isin(dr)]
Analysis: perform difference operation
d['China'] = d['China'].diff()
Analysis: remove two dates with missing provinces
d.drop(['2020-06-20','2020-06-21'],inplace=True)
Analysis: make a line chart to show the time trend
#New confirmed patient date_confirmed fig = plt.figure( figsize=(16,6), dpi=100) ax = fig.add_subplot(1,1,1) x = d.index y = d.values ax.set_title('Newly diagnosed patients',fontdict={ 'color':'black', 'size':24 }) ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' ) ax.set_xticks( range(0,len(x),10))
[<matplotlib.axis.XTick at 0x25552a9c898>, <matplotlib.axis.XTick at 0x25552a9c860>, <matplotlib.axis.XTick at 0x25552ab7550>, <matplotlib.axis.XTick at 0x25552ad50f0>, <matplotlib.axis.XTick at 0x25552ad5518>, <matplotlib.axis.XTick at 0x25552ad59b0>, <matplotlib.axis.XTick at 0x25552ad5e48>, <matplotlib.axis.XTick at 0x25552adc320>]
Analysis: using initialization data to construct date_ New_ Curved dataframe, and then make a line chart to show the time trend
cu = data_new_frame(date_new_cured,china,'province_curedCount') for i in range(len(cu)): dr = [] for a,b in zip(range(0,len(cu)-1),range(1,len(cu)-2)): if cu.iloc[b].iloc[0] < cu.iloc[a].iloc[0]: dr.append(cu.iloc[b].iloc[0]) cu = cu[~cu['China'].isin(dr)] cu['China'] = cu['China'].diff() cu.drop(['2020-06-20','2020-06-21'],inplace=True) #New cured patient date_new_cured fig = plt.figure( figsize=(16,6), dpi=100) ax = fig.add_subplot(1,1,1) x = cu.index y = cu.values ax.set_title('New cured patients',fontdict={ 'color':'black', 'size':24 }) ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' ) ax.set_xticks( range(0,len(x),10))
[<matplotlib.axis.XTick at 0x25552b13b00>, <matplotlib.axis.XTick at 0x25552b13ac8>, <matplotlib.axis.XTick at 0x25552b137b8>, <matplotlib.axis.XTick at 0x25552b3f470>, <matplotlib.axis.XTick at 0x25552b3f908>, <matplotlib.axis.XTick at 0x25552b3fda0>, <matplotlib.axis.XTick at 0x25552b47278>]
Analysis: using initialization data to construct date_ New_ dataframe of dead, and then make a line chart to show the time trend
de = data_new_frame( date_new_dead,china,'province_deadCount') for i in range(len(de)): dr = [] for a,b in zip(range(0,len(de)-1),range(1,len(de)-2)): if de.iloc[b].iloc[0] < de.iloc[a].iloc[0]: dr.append(de.iloc[b].iloc[0]) de = de[~de['China'].isin(dr)] de['China'] = de['China'].diff() de.drop(['2020-06-21'],inplace=True) #New dead patient date_new_dead fig = plt.figure( figsize=(16,6), dpi=100) ax = fig.add_subplot(1,1,1) x = de.index y = de.values ax.set_title('New dead patients',fontdict={ 'color':'black', 'size':24 }) ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' ) ax.set_xticks( range(0,len(x),10))
[<matplotlib.axis.XTick at 0x25553bdfd30>, <matplotlib.axis.XTick at 0x25553bdfcf8>, <matplotlib.axis.XTick at 0x25553c01f60>, <matplotlib.axis.XTick at 0x25553c146a0>, <matplotlib.axis.XTick at 0x25553c14b38>, <matplotlib.axis.XTick at 0x25553c14d68>, <matplotlib.axis.XTick at 0x25553c1b4a8>, <matplotlib.axis.XTick at 0x25553c1b940>, <matplotlib.axis.XTick at 0x25553c1bdd8>]
Analysis: the number of newly added patients began to increase at the end of January, reached the peak before and after February 14, and then decreased and tended to be flat.
Analysis: the number of newly cured patients began to increase at the end of January, peaked around March 2, and then decreased, and leveled off from the beginning of April.
Analysis: the number of new death patients increased from the end of January to the peak in February. The increase was gentle from the beginning of March, and fell back after soaring due to statistical factors around April 17.
(3) What is the trend of new overseas input across the country over time?
Analysis: CHINA needs to be calculated and subtracted day by day for newly added overseas input data.
Analysis: first take out the overseas input data from CHINA, and then complete the time series and make the difference.
imported = CHINA.loc[CHINA['cityName'] == 'Overseas input'] imported.updateTime = pd.to_datetime(imported.updateTime,format="%Y-%m-%d",errors='coerce').dt.date imported
D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self[name] = value
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
136 | Asia | Asia | China | China | Shaanxi Province | Shaanxi | 610000 | 317 | 1.0 | 307 | 3 | 2020-06-23 | Overseas input | NaN | 0.0 | 72.0 | 0.0 | 65.0 | 0.0 |
150 | Asia | Asia | China | China | Jiangsu Province | Jiangsu | 320000 | 654 | 3.0 | 653 | 0 | 2020-06-23 | Overseas input | NaN | 0.0 | 23.0 | 0.0 | 22.0 | 0.0 |
201 | Asia | Asia | China | China | Beijing | Beijing | 110000 | 843 | 164.0 | 584 | 9 | 2020-06-23 | Overseas input | NaN | 0.0 | 174.0 | 3.0 | 173.0 | 0.0 |
214 | Asia | Asia | China | China | Hebei Province | Hebei | 130000 | 346 | 0.0 | 323 | 6 | 2020-06-23 | Overseas input | NaN | 0.0 | 10.0 | 0.0 | 10.0 | 0.0 |
218 | Asia | Asia | China | China | Tianjin | Tianjin | 120000 | 198 | 48.0 | 192 | 3 | 2020-06-23 | Overseas input | NaN | 0.0 | 61.0 | 0.0 | 59.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
115420 | Asia | Asia | China | China | Shaanxi Province | Shaanxi | 610000 | 250 | 1.0 | 240 | 3 | 2020-03-25 | Overseas input | NaN | 0.0 | 5.0 | 1.0 | 0.0 | 0.0 |
115956 | Asia | Asia | China | China | Tianjin | Tianjin | 120000 | 145 | 0.0 | 133 | 3 | 2020-03-24 | Overseas input | NaN | 0.0 | 9.0 | 0.0 | 0.0 | 0.0 |
116164 | Asia | Asia | China | China | Gansu Province | Gansu | 620000 | 136 | 0.0 | 119 | 2 | 2020-03-24 | Overseas input | NaN | 0.0 | 45.0 | 0.0 | 30.0 | 0.0 |
117171 | Asia | Asia | China | China | Shanghai | Shanghai | 310000 | 414 | 0.0 | 330 | 4 | 2020-03-24 | Overseas input | NaN | 0.0 | 75.0 | 0.0 | 3.0 | 0.0 |
117597 | Asia | Asia | China | China | Tianjin | Tianjin | 120000 | 142 | 0.0 | 133 | 3 | 2020-03-24 | Overseas input | NaN | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 |
607 rows × 19 columns
Analysis: complete the data of missing time in provinces
for i in range(0,len(province)): list_j_d = [] date_b = [] for dt in imported.loc[imported['provinceName'] == province[i]]['updateTime']: date_b.append(str(dt)) list_j_d = list(set(date_b)) list_j_d.sort() #imported.loc[imported['provinceName'] == province[3]] try: start = imported.loc[imported['provinceName'] == province[i]]['updateTime'].min() end = imported.loc[imported['provinceName'] == province[i]]['updateTime'].max() dates_b = pd.date_range(start=str(start), end=str(end)) aid_frame_b = pd.DataFrame({'updateTime': dates_b,'provinceName':[province[i]]*len(dates_b)}) aid_frame_b.updateTime = pd.to_datetime(aid_frame_b.updateTime,format="%Y-%m-%d",errors='coerce').dt.date #draft = pd.merge(china.loc[china['provinceName'] == province[1]], aid_frame, on='updateTime', how='outer').sort_values('updateTime') draft_b = pd.concat([imported.loc[imported['provinceName'] == province[i]], aid_frame_b], join='outer').sort_values('updateTime') draft_b.city_confirmedCount.fillna(method="ffill",inplace=True) draft_b.city_suspectedCount.fillna(method="ffill", inplace=True) draft_b.city_curedCount.fillna(method="ffill", inplace=True) draft_b.city_deadCount.fillna(method="ffill", inplace=True) draft_b.loc[draft_b['provinceName'] == province[i]].fillna(0,inplace=True,limit = 1) draft_b.loc[draft_b['provinceName'] == province[i]].loc[:,'city_confirmedCount':'city_deadCount'] = draft_b.loc[draft_b['provinceName'] == province[i]].loc[:,'city_confirmedCount':'city_deadCount'].diff() draft_b.dropna(subset=['city_confirmedCount','city_suspectedCount','city_curedCount','city_deadCount'],inplace=True) imported = pd.concat([imported,draft_b], join='outer').sort_values('updateTime') except: continue imported
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
115956 | Asia | Asia | China | China | Tianjin | Tianjin | 120000.0 | 145.0 | 0.0 | 133.0 | 3.0 | 2020-03-24 | Overseas input | NaN | 0.0 | 9.0 | 0.0 | 0.0 | 0.0 |
0 | NaN | NaN | NaN | NaN | Gansu Province | NaN | NaN | NaN | NaN | NaN | NaN | 2020-03-24 | NaN | NaN | NaN | 45.0 | 0.0 | 30.0 | 0.0 |
117597 | Asia | Asia | China | China | Tianjin | Tianjin | 120000.0 | 142.0 | 0.0 | 133.0 | 3.0 | 2020-03-24 | Overseas input | NaN | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 |
117597 | Asia | Asia | China | China | Tianjin | Tianjin | 120000.0 | 142.0 | 0.0 | 133.0 | 3.0 | 2020-03-24 | Overseas input | NaN | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 |
116164 | Asia | Asia | China | China | Gansu Province | Gansu | 620000.0 | 136.0 | 0.0 | 119.0 | 2.0 | 2020-03-24 | Overseas input | NaN | 0.0 | 45.0 | 0.0 | 30.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
150 | Asia | Asia | China | China | Jiangsu Province | Jiangsu | 320000.0 | 654.0 | 3.0 | 653.0 | 0.0 | 2020-06-23 | Overseas input | NaN | 0.0 | 23.0 | 0.0 | 22.0 | 0.0 |
136 | Asia | Asia | China | China | Shaanxi Province | Shaanxi | 610000.0 | 317.0 | 1.0 | 307.0 | 3.0 | 2020-06-23 | Overseas input | NaN | 0.0 | 72.0 | 0.0 | 65.0 | 0.0 |
91 | NaN | NaN | NaN | NaN | Tianjin | NaN | NaN | NaN | NaN | NaN | NaN | 2020-06-23 | NaN | NaN | NaN | 61.0 | 0.0 | 59.0 | 0.0 |
136 | Asia | Asia | China | China | Shaanxi Province | Shaanxi | 610000.0 | 317.0 | 1.0 | 307.0 | 3.0 | 2020-06-23 | Overseas input | NaN | 0.0 | 72.0 | 0.0 | 65.0 | 0.0 |
201 | Asia | Asia | China | China | Beijing | Beijing | 110000.0 | 843.0 | 164.0 | 584.0 | 9.0 | 2020-06-23 | Overseas input | NaN | 0.0 | 174.0 | 3.0 | 173.0 | 0.0 |
2524 rows × 19 columns
Analysis: copy() to prevent data processing errors from causing the loss of original data
draft_i = imported.copy()
Analysis: initialize the data of a province to ensure that this method is feasible
real_s = imported.loc[imported['provinceName'] == province[0]] real_s.drop_duplicates(subset='updateTime', keep='first', inplace=True) draft_i = real_s for p in province: real_data = imported.loc[imported['provinceName'] == p] real_data.drop_duplicates(subset='updateTime', keep='first', inplace=True) #imported = pd.concat([real_data, china],sort=False) draft_i = pd.concat([real_data,draft_i],sort=False)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Analysis: confirm that the method is correct, and carry out the same treatment for the remaining provinces
imported = draft_i
imported = imported.set_index('provinceName') imported = imported.reset_index()
Analysis: data consolidation of each province.
list_p = [] list_d = [] list_e = [] for p in range(0,32): try: con_0 = imported.loc[imported['updateTime'] == date[2]].loc[imported['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['city_confirmedCount'])#Cumulative number of confirmed cases in each province on that day except: continue list_d.append(sum(list_p)) list_e.append(str(date[0])) date_new_foreign_confirmed = pd.DataFrame(list_d,index=list_e) date_new_foreign_confirmed.index.name="date" date_new_foreign_confirmed.columns=["imported_confirmedCount"] date_new_foreign_confirmed
imported_confirmedCount | |
---|---|
date | |
2020-01-24 | 0 |
l = 0 for i in date[3:]: list_p = [] list_d = [] list_e = [] l +=1 for p in range(0,32): try: con_0 = imported.loc[imported['updateTime'] == date[l]].loc[imported['provinceName'] == province[p]].iloc[[0]].iloc[0] list_p.append(con_0['city_confirmedCount'])#Cumulative number of confirmed cases in each province on that day except: continue #Con_ 0 = imported Loc[imported['updatetime'] = = date[0]] Loc[imported['provincename'] = = 'Hebei Province'] Loc[[0]] Iloc[0] #list_p.append(con_0['city_confirmedCount'])#Cumulative number of confirmed cases in each province on that day list_d.append(sum(list_p)) list_e.append(str(date[l])) confirmed = pd.DataFrame(list_d, index=list_e) confirmed.index.name="date" confirmed.columns=["imported_confirmedCount"] date_new_foreign_confirmed = pd.concat([date_new_foreign_confirmed,confirmed],sort=False) date_new_foreign_confirmed
imported_confirmedCount | |
---|---|
date | |
2020-01-24 | 0.0 |
2020-01-25 | 0.0 |
2020-01-26 | 0.0 |
2020-01-27 | 0.0 |
2020-01-28 | 0.0 |
... | ... |
2020-06-17 | 848.0 |
2020-06-18 | 800.0 |
2020-06-19 | 800.0 |
2020-06-20 | 802.0 |
2020-06-21 | 775.0 |
150 rows × 1 columns
#New overseas input fig = plt.figure( figsize=(16,4), dpi=100) ax = fig.add_subplot(1,1,1) x = date_new_foreign_confirmed.index y = date_new_foreign_confirmed.values plot = ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-',label='date_new_foreign_confirmed' ) ax.set_xticks( range(0,len(x),10)) plt.xlabel('date',fontsize=20) plt.ylabel('Number of persons',fontsize=20) plt.title('COVID-19-New overseas input',fontsize=30) ax.legend( loc=0, frameon=True )
<matplotlib.legend.Legend at 0x25553ca5f28>
Analysis and summary: since the end of March, the number of imported cases has increased sharply, and the growth rate tends to be flat at the beginning of May, and slowed down at the beginning of June.
(4) What is the situation in your province?
Analysis: first, take out all time series of Guangdong Province, convert them into string type, and then sort them
m_dates = list(set(myhome['updateTime'])) aid_d = m_dates.copy() for d in aid_d: a = str(d) m_dates.remove(d) m_dates.append(a) m_dates.sort()
myhome = myhome.set_index('provinceName') myhome = myhome.reset_index()
Analysis: traverse the time of the corresponding province of my city to build the corresponding dataframe
#Total number of confirmed cases in Guangdong Province list_g = [] for i in range(0,len(m_dates)): try: con_m = myhome.loc[myhome['updateTime'] == date[i]].loc[myhome['cityName'] == 'Maoming'].iloc[[0]].iloc[0] list_g.append(con_m['province_confirmedCount']) except: list_g.append(0) continue g_date_confirmed = pd.DataFrame(list_g, index=m_dates) g_date_confirmed.index.name="date" g_date_confirmed.columns=["g_confirmed"] g_date_confirmed=g_date_confirmed[~g_date_confirmed['g_confirmed'].isin([0])] #Cumulative number of cured persons in Guangdong Province list_g = [] for i in range(0,len(m_dates)): try: con_m = myhome.loc[myhome['updateTime'] == date[i]].loc[myhome['cityName'] == 'Maoming'].iloc[[0]].iloc[0] list_g.append(con_m['province_curedCount']) except: list_g.append(0) continue g_date_cured = pd.DataFrame(list_g, index=m_dates) g_date_cured.index.name="date" g_date_cured.columns=["g_cured"] g_date_cured=g_date_cured[~g_date_cured['g_cured'].isin([0])] #Cumulative death toll in Guangdong Province list_g = [] for i in range(0,len(m_dates)): try: con_m = myhome.loc[myhome['updateTime'] == date[i]].loc[myhome['cityName'] == 'Maoming'].iloc[[0]].iloc[0] list_g.append(con_m['province_deadCount']) except: list_g.append(0) continue g_date_dead = pd.DataFrame(list_g, index=m_dates) g_date_dead.index.name="date" g_date_dead.columns=["g_dead"] g_date_dead=g_date_dead[~g_date_dead['g_dead'].isin([0])]
Analysis: make a line chart to show the trend of epidemic situation over time
##Cumulative number of confirmed cases in Guangdong Province cumulative number of cured cases in Guangdong Province plt.rcParams['font.sans-serif'] = ['SimHei'] x= g_date_confirmed.index y1 = g_date_confirmed.values y2 = g_date_cured.values y3 = g_date_dead #font_manager = font_manager.FontProperties(fname = 'C:/Windows/Fonts/simsun.ttc',size = 18) plt.figure(figsize=(20,10),dpi = 80) plt.plot(x,y1,color = r_hex,label = 'confirmed') plt.plot(x,y2,color = g_hex,label = 'cured') x_major_locator=MultipleLocator(12) ax=plt.gca() ax.xaxis.set_major_locator(x_major_locator) plt.title('COVID-19 - Guangdong Province',fontsize=30) plt.xlabel('date',fontsize=20) plt.ylabel('Number of persons',fontsize=20) plt.legend(loc=1, bbox_to_anchor=(1.00,0.90), bbox_transform=ax.transAxes)
<matplotlib.legend.Legend at 0x25553d02a90>
#Cumulative death toll in Guangdong Province plt.rcParams['font.sans-serif'] = ['SimHei'] fig = plt.figure( figsize=(16,4), dpi=100) ax = fig.add_subplot(1,1,1) x = g_date_dead.index y = g_date_dead.values plot = ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-',label='dead' ) ax.set_xticks( range(0,len(x),10)) plt.xlabel('date',fontsize=20) plt.ylabel('Number of persons',fontsize=20) plt.title('COVID-19-Guangdong Province',fontsize=30) ax.legend( loc=0, frameon=True )
<matplotlib.legend.Legend at 0x25553d94940>
Analysis: the data completion of Guangdong Province is very successful with high authenticity.
Analysis: according to the broken line chart, the number of infected people in Guangdong Province has increased sharply since the end of January, and leveled off until the middle of February. Since the beginning of March, due to the popularity of testing and statistical factors, the number of short-term confirmed patients has increased slightly. Since early February, the number of cured people in Guangdong Province has increased sharply. Since early June, the number of new infections has leveled off, so the number of cured people has leveled off. Since the beginning of March, there have been no new deaths in Guangdong Province.
(5) What is the epidemic situation abroad?
Analysis: data removal nulls
world.dropna(axis=1, how='any', inplace=True) #world.set_index('updateTime')
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """Entry point for launching an IPython kernel.
Analysis: create a country list country, create a date list date_y
country = list(set(world['provinceName'])) date_y = [] for dt in world.loc[world['provinceName'] == country[0]]['updateTime']: date_y.append(str(dt)) date_y = list(set(date_0)) date_y.sort()
Analysis: traverse the country list to process and de duplicate the updateTime in the world.
for c in country: world.loc[world['provinceName'] == c].sort_values(by = 'updateTime') world.dropna(subset=['provinceName'],inplace=True) world.updateTime = pd.to_datetime(world.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy This is separate from the ipykernel package so we can avoid doing imports until D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self[name] = value
Analysis: Province from the top 15 countries_ Confirmedcount perspective constitutes the world_confirmed and complete the data
world_confirmed = world.loc[world['provinceName'] == world.head(15)['provinceName'][0]].pivot_table(index='updateTime', columns='provinceName', values='province_confirmedCount',aggfunc=np.mean) for i in world.head(15)['provinceName'][1:]: draft_c = world.loc[world['provinceName'] == i].pivot_table(index='updateTime', columns='provinceName', values='province_confirmedCount',aggfunc=np.mean) world_confirmed = pd.merge(world_confirmed,draft_c,on='updateTime', how='outer',sort=True) world_confirmed.fillna(0,inplace=True,limit = 1) world_confirmed.fillna(method="ffill",inplace=True) world_confirmed
provinceName | United States | Brazil | Britain | Russia | Chile | India | Pakistan | Peru | Spain | Bangladesh | France | Saudi Arabia | Sweden | South Africa | Ecuador |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
updateTime | |||||||||||||||
2020-01-27 | 5.000000e+00 | 0.00 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 3.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
2020-01-29 | 0.000000e+00 | 0.00 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 4.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
2020-01-30 | 0.000000e+00 | 0.00 | 0.000000 | 0.0 | 0.0 | 1.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 5.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
2020-01-31 | 6.000000e+00 | 0.00 | 2.000000 | 2.0 | 0.0 | 1.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
2020-02-01 | 6.000000e+00 | 0.00 | 2.000000 | 2.0 | 0.0 | 1.000000 | 0.000000 | 0.000000 | 4.00 | 0.00 | 5.500000 | 0.000000 | 1.000000 | 0.0 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-19 | 2.184912e+06 | 976906.50 | 300469.000000 | 563084.0 | 225103.0 | 371474.666667 | 162935.600000 | 243518.000000 | 245268.00 | 102292.00 | 158641.000000 | 145991.000000 | 55672.750000 | 83020.5 | 48256.400000 |
2020-06-20 | 2.221982e+06 | 1038568.00 | 302138.750000 | 573007.5 | 231393.0 | 390209.333333 | 169464.666667 | 247925.000000 | 245665.75 | 105535.00 | 159452.000000 | 151277.250000 | 56201.500000 | 87715.0 | 49519.666667 |
2020-06-21 | 2.253118e+06 | 1068977.25 | 303284.428571 | 579160.0 | 236748.0 | 399451.714286 | 174346.222222 | 251338.000000 | 245938.00 | 109657.75 | 160093.000000 | 154715.714286 | 56360.000000 | 92681.0 | 49731.000000 |
2020-06-22 | 2.279603e+06 | 1084312.25 | 304331.000000 | 587720.0 | 243276.6 | 416389.400000 | 179148.750000 | 254336.333333 | 246272.00 | 112306.00 | 160336.428571 | 158177.500000 | 57346.000000 | 96377.8 | 50092.600000 |
2020-06-23 | 2.299650e+06 | 1106470.00 | 305289.000000 | 592280.0 | 246963.0 | 425282.000000 | 182562.666667 | 257447.000000 | 246504.00 | 115786.00 | 160750.000000 | 161005.000000 | 59060.666667 | 101590.0 | 50487.666667 |
144 rows × 15 columns
Analysis: the epidemic situation of the top 15 countries over time
#plt.rcParams['font.sans-serif'] = ['SimHei'] fig = plt.figure(figsize=(16,10)) plt.plot(world_confirmed) plt.legend(world_confirmed.columns) plt.title('Cumulative number of confirmed cases in the top 15 countries',fontsize=20) plt.xlabel('date',fontsize=20) plt.ylabel('Number of persons/million',fontsize=20);
Analysis: the completion of foreign data is relatively successful and has certain authenticity.
Analysis: since the end of March, the number of newly diagnosed patients in foreign countries has increased sharply. The epidemic situation in the top four countries has not been controlled. The trend of overseas epidemic situation is that the number of confirmed patients continues to increase.
(6) Based on your analysis results, what are your suggestions for individuals and society in combating the epidemic?
According to the broken line chart of domestic epidemic situation, the epidemic situation has tended to be gentle since the end of April. On the contrary, the epidemic situation in foreign countries has begun to break out since the beginning of April, and no gentle trend has been seen so far.
From the perspective of overseas imported cases, we need to be wary of overseas imported cases and prevent the re spread of domestic new crowns. We cannot relax our vigilance.
For individuals, we should avoid going to densely populated areas, wear masks when going out, and do comprehensive disinfection at home.
For the society, in the areas with developed traffic and densely populated areas, it is necessary to popularize virus detection and site disinfection measures, cut off the transmission path of the virus, and maintain the achievements of epidemic prevention and control in China.
Additional analysis (optional)
For additional analysis, there is no limit to the libraries used. For example, seaborn, pyecharts and other libraries can be used.
Limited to personal ability, didn't do it.