Epidemic analysis of 2019 novel coronavirus (covid-19/2019 ncov) with Python

Epidemic analysis of novel coronavirus (covid-19/2019 ncov) in 2019

Pray LHL

Important notes

Analysis document: completion: code quality 3:5:2

Among them, the analysis document refers to the analysis idea of each problem, the explanation and explanation of the results in the process of data analysis (it is required to be concise and comprehensive, not written for writing)

ps: your own code is better than all ghostwriting. It has nothing to do with beauty or ugliness. Just ask that today is better than yesterday! Come on!

Because there is too much data, try to use head() or tail() to view the data, so as to prevent the program from being unresponsive for a long time

=======================

The data of this project comes from dingxiangyuan. The main purpose of this project is to better understand the epidemic situation and the development trend of the epidemic through the analysis and research of the historical data of the epidemic situation, and to provide data support for the decision-making of combating the epidemic situation.

For the datasets used in this chapter, please click - > My station B video Get it in the comments area.

I Ask questions

The following issues will be mainly studied from three aspects: nationwide, your province and city, and the epidemic situation abroad:

(1) What is the trend of cumulative confirmed / suspected / cured / dead cases over time in China?

(2) What is the trend of new confirmed / suspected / cured / dead cases in China over time?

(3) What is the trend of new overseas input across the country over time?

(4) What is the situation in your province?

(5) What is the epidemic situation abroad?

(6) Based on your analysis results, what are your suggestions for individuals and society in combating the epidemic?

II Understanding data

Original dataset: areainfo CSV, import related packages and read data:

r_hex = '#dc2624'     # red,       RGB = 220,38,36
dt_hex = '#2b4750'    # dark teal, RGB = 43,71,80
tl_hex = '#45a0a2'    # teal,      RGB = 69,160,162
r1_hex = '#e87a59'    # red,       RGB = 232,122,89
tl1_hex = '#7dcaa9'   # teal,      RGB = 125,202,169
g_hex = '#649E7D'     # green,     RGB = 100,158,125
o_hex = '#dc8018'     # orange,    RGB = 220,128,24
tn_hex = '#C89F91'    # tan,       RGB = 200,159,145
g50_hex = '#6c6d6c'   # grey-50,   RGB = 108,109,108
bg_hex = '#4f6268'    # blue grey, RGB = 79,98,104
g25_hex = '#c7cccf'   # grey-25,   RGB = 199,204,207
import numpy as np
import pandas as pd
import matplotlib,re
import matplotlib.pyplot as plt
from matplotlib.pyplot import MultipleLocator


data = pd.read_csv(r'data/AreaInfo.csv')

View and statistics to get a general understanding of the data

data.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
0 North America North America United States United States of America United States United States of America 971002 2306247 0.0 640198 120351 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
1 South America South America Brazil Brazil Brazil Brazil 973003 1106470 0.0 549386 51271 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
2 Europe Europe Britain United Kingdom Britain United Kingdom 961007 305289 0.0 539 42647 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
3 Europe Europe Russia Russia Russia Russia 964006 592280 0.0 344416 8206 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
4 South America South America Chile Chile Chile Chile 973004 246963 0.0 44946 4502 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN

III Data cleaning

(1) Basic data processing

Data cleaning mainly includes: subset selection, missing data processing, data format conversion, outlier data processing, etc.

Domestic epidemic data selection (the final selected data is named china)

  1. Select domestic epidemic data

  2. For the Updatetime column, you need to convert it to a date type, extract the year month day, and view the processing results. (prompt: dt.date)

  3. Since the data is updated hourly every day, there are many duplicate data within a day. Please de duplicate and only keep the latest data within a day.

Tip: df Drop_ Duplicates (subset=['provinceName', 'updateTime'], keep= 'first', inplace=False)

Where df is the DataFrame of the domestic epidemic data you selected

Analysis: select the row with the value of CHINA in the countryName column to form CHINA.

CHINA = data.loc[data['countryName'] == 'China']
CHINA.dropna(subset=['cityName'], how='any', inplace=True)
#CHINA
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Analysis: take out the list containing all Chinese cities

cities = list(set(CHINA['cityName']))

Analysis: traverse to get the sub dataframe of each city, and then sort the updateTime with sort

for city in cities:
    CHINA.loc[data['cityName'] == city].sort_values(by = 'updateTime')

Analysis: remove the row of null value

CHINA.dropna(subset=['cityName'],inplace=True)
#China Loc[china['cityname'] = = 'Qinhuangdao'] tail(20)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

Analysis: format the updateTime column in CHINA

CHINA.updateTime = pd.to_datetime(CHINA.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
#China Loc[data['cityname'] = 'Qinhuangdao'] tail(15)
D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
CHINA.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
136 Asia Asia China China Shaanxi Province Shaanxi 610000 317 1.0 307 3 2020-06-23 Overseas input NaN 0.0 72.0 0.0 65.0 0.0
137 Asia Asia China China Shaanxi Province Shaanxi 610000 317 1.0 307 3 2020-06-23 Xi'an Xi'an 610100.0 120.0 0.0 117.0 3.0
138 Asia Asia China China Shaanxi Province Shaanxi 610000 317 1.0 307 3 2020-06-23 Ankang Ankang 610900.0 26.0 0.0 26.0 0.0
139 Asia Asia China China Shaanxi Province Shaanxi 610000 317 1.0 307 3 2020-06-23 Hanzhong Hanzhong 610700.0 26.0 0.0 26.0 0.0
140 Asia Asia China China Shaanxi Province Shaanxi 610000 317 1.0 307 3 2020-06-23 Xianyang Xianyang 610400.0 17.0 0.0 17.0 0.0

Analysis: only the first data is retained for daily data De duplication, because the time has been sorted before, and the first data is the latest data of the day
Analysis: considering that concat is required for merging dataframe s, an initial china needs to be created

real = CHINA.loc[data['cityName'] == cities[1]]
real.drop_duplicates(subset='updateTime', keep='first', inplace=True)
china = real
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Analysis: traverse the dataframe of each city to de duplicate the daily data. Otherwise, only the data of one city will be retained on the same date

for city in cities[2:]:
    real_data = CHINA.loc[data['cityName'] == city]
    real_data.drop_duplicates(subset='updateTime', keep='first', inplace=True)
    china = pd.concat([real_data, china],sort=False)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

Check the data information to see whether there is missing data / whether the data type is correct.

Tip: if the missing value will not be processed, it can be discarded

Analysis: some cities do not report every day. If only those cities that report are counted on a certain day, those cities that have patients but do not report will be ignored, and the data will be distorted. It is necessary to complete the daily data of all cities. Even those cities that do not report must record the data statistics every day. Therefore, interpolation processing is required to complete part of the data. See data perspective and analysis for the processing methods

china.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32812 entries, 96106 to 208267
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   continentName            32812 non-null  object 
 1   continentEnglishName     32812 non-null  object 
 2   countryName              32812 non-null  object 
 3   countryEnglishName       32812 non-null  object 
 4   provinceName             32812 non-null  object 
 5   provinceEnglishName      32812 non-null  object 
 6   province_zipCode         32812 non-null  int64  
 7   province_confirmedCount  32812 non-null  int64  
 8   province_suspectedCount  32812 non-null  float64
 9   province_curedCount      32812 non-null  int64  
 10  province_deadCount       32812 non-null  int64  
 11  updateTime               32812 non-null  object 
 12  cityName                 32812 non-null  object 
 13  cityEnglishName          31968 non-null  object 
 14  city_zipCode             32502 non-null  float64
 15  city_confirmedCount      32812 non-null  float64
 16  city_suspectedCount      32812 non-null  float64
 17  city_curedCount          32812 non-null  float64
 18  city_deadCount           32812 non-null  float64
dtypes: float64(6), int64(4), object(9)
memory usage: 5.0+ MB
china.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
96106 Asia Asia China China Guangxi Zhuang Autonomous Region Guangxi 450000 254 0.0 252 2 2020-04-02 Guigang Guigang 450800.0 8.0 0.0 8.0 0.0
125120 Asia Asia China China Guangxi Zhuang Autonomous Region Guangxi 450000 254 0.0 250 2 2020-03-20 Guigang Guigang 450800.0 8.0 0.0 8.0 0.0
128762 Asia Asia China China Guangxi Zhuang Autonomous Region Guangxi 450000 253 0.0 250 2 2020-03-18 Guigang Guigang 450800.0 8.0 0.0 8.0 0.0
130607 Asia Asia China China Guangxi Zhuang Autonomous Region Guangxi 450000 253 0.0 248 2 2020-03-17 Guigang Guigang 450800.0 8.0 0.0 8.0 0.0
131428 Asia Asia China China Guangxi Zhuang Autonomous Region Guangxi 450000 252 0.0 248 2 2020-03-16 Guigang Guigang 450800.0 8.0 0.0 8.0 0.0

Epidemic data selection in your province (the final selected data is named myhome)

This step can also be done later

myhome = china.loc[data['provinceName'] == 'Guangdong Province']
myhome.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
205259 Asia Asia China China Guangdong Province Guangdong 440000 277 0.0 5 0 2020-01-29 Foreign visitors to Guangdong NaN NaN 5.0 0.0 0.0 0.0
206335 Asia Asia China China Guangdong Province Guangdong 440000 207 0.0 4 0 2020-01-28 Heyuan City NaN NaN 1.0 0.0 0.0 0.0
205239 Asia Asia China China Guangdong Province Guangdong 440000 277 0.0 5 0 2020-01-29 Personnel from other places NaN NaN 5.0 0.0 0.0 0.0
252 Asia Asia China China Guangdong Province Guangdong 440000 1634 11.0 1619 8 2020-06-23 Chaozhou Chaozhou 445100.0 6.0 0.0 6.0 0.0
2655 Asia Asia China China Guangdong Province Guangdong 440000 1634 11.0 1614 8 2020-06-21 Chaozhou Chaozhou 445100.0 6.0 0.0 6.0 0.0

Selection of overseas epidemic data (the final selected data is named world)

This step can also be done later

world = data.loc[data['countryName'] != 'China']
world.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
0 North America North America United States United States of America United States United States of America 971002 2306247 0.0 640198 120351 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
1 South America South America Brazil Brazil Brazil Brazil 973003 1106470 0.0 549386 51271 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
2 Europe Europe Britain United Kingdom Britain United Kingdom 961007 305289 0.0 539 42647 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
3 Europe Europe Russia Russia Russia Russia 964006 592280 0.0 344416 8206 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN
4 South America South America Chile Chile Chile Chile 973004 246963 0.0 44946 4502 2020-06-23 10:01:45 NaN NaN NaN NaN NaN NaN NaN

Pivoting and analysis

Analysis: interpolate china to complete some data

china.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
96106 Asia Asia China China Guangxi Zhuang Autonomous Region Guangxi 450000 254 0.0 252 2 2020-04-02 Guigang Guigang 450800.0 8.0 0.0 8.0 0.0
125120 Asia Asia China China Guangxi Zhuang Autonomous Region Guangxi 450000 254 0.0 250 2 2020-03-20 Guigang Guigang 450800.0 8.0 0.0 8.0 0.0
128762 Asia Asia China China Guangxi Zhuang Autonomous Region Guangxi 450000 253 0.0 250 2 2020-03-18 Guigang Guigang 450800.0 8.0 0.0 8.0 0.0
130607 Asia Asia China China Guangxi Zhuang Autonomous Region Guangxi 450000 253 0.0 248 2 2020-03-17 Guigang Guigang 450800.0 8.0 0.0 8.0 0.0
131428 Asia Asia China China Guangxi Zhuang Autonomous Region Guangxi 450000 252 0.0 248 2 2020-03-16 Guigang Guigang 450800.0 8.0 0.0 8.0 0.0

Analysis: first create a province list and a date list, and initialize a draft

province = list(set(china['provinceName']))#Each province
#p_city = list(set(china[china['provinceName'] == province[0]]['cityName']))#Cities in each province
date_0 = []
for dt in china.loc[china['provinceName'] ==  province[0]]['updateTime']:
    date_0.append(str(dt))
date_0 = list(set(date_0))
date_0.sort()
start = china.loc[china['provinceName'] ==  province[0]]['updateTime'].min()
end = china.loc[china['provinceName'] ==  province[0]]['updateTime'].max()
dates = pd.date_range(start=str(start), end=str(end))
aid_frame = pd.DataFrame({'updateTime': dates,'provinceName':[province[0]]*len(dates)})
aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
#draft = pd.merge(china.loc[china['provinceName'] ==  province[1]], aid_frame, on='updateTime', how='outer').sort_values('updateTime')
draft = pd.concat([china.loc[china['provinceName'] ==  province[0]], aid_frame], join='outer').sort_values('updateTime')
draft.province_confirmedCount.fillna(method="ffill",inplace=True)
draft.province_suspectedCount.fillna(method="ffill", inplace=True)
draft.province_curedCount.fillna(method="ffill", inplace=True)
draft.province_deadCount.fillna(method="ffill", inplace=True)

Analysis: complete part of the time and interpolate the data of the previous day. Because some provinces have no new patients since the end of April and will not report them, the data of these provinces can only be completed until the end of April, and the subsequent data will gradually lose their authenticity

Analysis: date formatting at the same time

for p in range(1,len(province)):
    date_d = []
    for dt in china.loc[china['provinceName'] ==  province[p]]['updateTime']:
        date_d.append(dt)
    date_d = list(set(date_d))
    date_d.sort()
    start = china.loc[china['provinceName'] ==  province[p]]['updateTime'].min()
    end = china.loc[china['provinceName'] ==  province[p]]['updateTime'].max()
    dates = pd.date_range(start=start, end=end)
    aid_frame = pd.DataFrame({'updateTime': dates,'provinceName':[province[p]]*len(dates)})
    aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
    X = china.loc[china['provinceName'] ==  province[p]]
    X.reset_index(drop= True)
    Y = aid_frame
    Y.reset_index(drop= True)
    draft_d = pd.concat([X,Y], join='outer').sort_values('updateTime')
    draft = pd.concat([draft,draft_d])
    draft.province_confirmedCount.fillna(method="ffill",inplace=True)
    draft.province_suspectedCount.fillna(method="ffill", inplace=True)
    draft.province_curedCount.fillna(method="ffill", inplace=True)
    draft.province_deadCount.fillna(method="ffill", inplace=True)
    #draft['updateTime'] = draft['updateTime'].strftime('%Y-%m-%d')
    #draft['updateTime'] = pd.to_datetime(draft['updateTime'],format="%Y-%m-%d",errors='coerce').dt.date
china = draft
china.head()
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
208226 Asia Asia China China Tianjin Tianjin 120000.0 14.0 0.0 0.0 0.0 2020-01-26 From other places to Tianjin NaN NaN 2.0 0.0 0.0 0.0
208224 Asia Asia China China Tianjin Tianjin 120000.0 14.0 0.0 0.0 0.0 2020-01-26 Hebei District Hebei District 120105.0 5.0 0.0 0.0 0.0
208228 Asia Asia China China Tianjin Tianjin 120000.0 14.0 0.0 0.0 0.0 2020-01-26 Peace zone Heping District 120101.0 1.0 0.0 0.0 0.0
208227 Asia Asia China China Tianjin Tianjin 120000.0 14.0 0.0 0.0 0.0 2020-01-26 Binhai New Area Binhai New Area 120116.0 1.0 0.0 0.0 0.0
208230 Asia Asia China China Tianjin Tianjin 120000.0 14.0 0.0 0.0 0.0 2020-01-26 Xiqing District Xiqing District 120111.0 1.0 0.0 0.0 0.0

IV Data analysis and visualization

During data analysis and visualization, select the required variables according to each problem and create a new DataFrame for analysis and visual display, so that the data is not easy to be disordered and the organization is more clear.

Basic analysis

For basic analysis, only numpy, pandas and matplotlib libraries are allowed.

It can be displayed in multiple coordinate systems on one drawing or on multiple drawings

Please select the type of graph (line chart, pie chart, histogram, scatter chart, etc.) according to the analysis purpose. There is no idea to go to Baidu epidemic map or other epidemic analysis sites to inspire inspiration.

(1) What is the trend of cumulative confirmed / suspected / cured / dead cases over time in China?

Analysis: in order to obtain the change trend of the national cumulative situation over time, we first need to integrate the daily national cumulative diagnosis situation into date_confirmed

Analysis: in order to integrate the daily cumulative diagnosis of the whole country, we must first extract the latest cumulative diagnosis number of each province on the same day, and then sum the provincial data to form a dataframe,
for loop splicing to date_confirmed

date = list(set(china['updateTime']))
date.sort()
date
[datetime.date(2020, 1, 24),
 datetime.date(2020, 1, 25),
 datetime.date(2020, 1, 26),
 datetime.date(2020, 1, 27),
 datetime.date(2020, 1, 28),
 datetime.date(2020, 1, 29),
 datetime.date(2020, 1, 30),
 datetime.date(2020, 1, 31),
 datetime.date(2020, 2, 1),
 datetime.date(2020, 2, 2),
 datetime.date(2020, 2, 3),
 datetime.date(2020, 2, 4),
 datetime.date(2020, 2, 5),
 datetime.date(2020, 2, 6),
 datetime.date(2020, 2, 7),
 datetime.date(2020, 2, 8),
 datetime.date(2020, 2, 9),
 datetime.date(2020, 2, 10),
 datetime.date(2020, 2, 11),
 datetime.date(2020, 2, 12),
 datetime.date(2020, 2, 13),
 datetime.date(2020, 2, 14),
 datetime.date(2020, 2, 15),
 datetime.date(2020, 2, 16),
 datetime.date(2020, 2, 17),
 datetime.date(2020, 2, 18),
 datetime.date(2020, 2, 19),
 datetime.date(2020, 2, 20),
 datetime.date(2020, 2, 21),
 datetime.date(2020, 2, 22),
 datetime.date(2020, 2, 23),
 datetime.date(2020, 2, 24),
 datetime.date(2020, 2, 25),
 datetime.date(2020, 2, 26),
 datetime.date(2020, 2, 27),
 datetime.date(2020, 2, 28),
 datetime.date(2020, 2, 29),
 datetime.date(2020, 3, 1),
 datetime.date(2020, 3, 2),
 datetime.date(2020, 3, 3),
 datetime.date(2020, 3, 4),
 datetime.date(2020, 3, 5),
 datetime.date(2020, 3, 6),
 datetime.date(2020, 3, 7),
 datetime.date(2020, 3, 8),
 datetime.date(2020, 3, 9),
 datetime.date(2020, 3, 10),
 datetime.date(2020, 3, 11),
 datetime.date(2020, 3, 12),
 datetime.date(2020, 3, 13),
 datetime.date(2020, 3, 14),
 datetime.date(2020, 3, 15),
 datetime.date(2020, 3, 16),
 datetime.date(2020, 3, 17),
 datetime.date(2020, 3, 18),
 datetime.date(2020, 3, 19),
 datetime.date(2020, 3, 20),
 datetime.date(2020, 3, 21),
 datetime.date(2020, 3, 22),
 datetime.date(2020, 3, 23),
 datetime.date(2020, 3, 24),
 datetime.date(2020, 3, 25),
 datetime.date(2020, 3, 26),
 datetime.date(2020, 3, 27),
 datetime.date(2020, 3, 28),
 datetime.date(2020, 3, 29),
 datetime.date(2020, 3, 30),
 datetime.date(2020, 3, 31),
 datetime.date(2020, 4, 1),
 datetime.date(2020, 4, 2),
 datetime.date(2020, 4, 3),
 datetime.date(2020, 4, 4),
 datetime.date(2020, 4, 5),
 datetime.date(2020, 4, 6),
 datetime.date(2020, 4, 7),
 datetime.date(2020, 4, 8),
 datetime.date(2020, 4, 9),
 datetime.date(2020, 4, 10),
 datetime.date(2020, 4, 11),
 datetime.date(2020, 4, 12),
 datetime.date(2020, 4, 13),
 datetime.date(2020, 4, 14),
 datetime.date(2020, 4, 15),
 datetime.date(2020, 4, 16),
 datetime.date(2020, 4, 17),
 datetime.date(2020, 4, 18),
 datetime.date(2020, 4, 19),
 datetime.date(2020, 4, 20),
 datetime.date(2020, 4, 21),
 datetime.date(2020, 4, 22),
 datetime.date(2020, 4, 23),
 datetime.date(2020, 4, 24),
 datetime.date(2020, 4, 25),
 datetime.date(2020, 4, 26),
 datetime.date(2020, 4, 27),
 datetime.date(2020, 4, 28),
 datetime.date(2020, 4, 29),
 datetime.date(2020, 4, 30),
 datetime.date(2020, 5, 1),
 datetime.date(2020, 5, 2),
 datetime.date(2020, 5, 3),
 datetime.date(2020, 5, 4),
 datetime.date(2020, 5, 5),
 datetime.date(2020, 5, 6),
 datetime.date(2020, 5, 7),
 datetime.date(2020, 5, 8),
 datetime.date(2020, 5, 9),
 datetime.date(2020, 5, 10),
 datetime.date(2020, 5, 11),
 datetime.date(2020, 5, 12),
 datetime.date(2020, 5, 13),
 datetime.date(2020, 5, 14),
 datetime.date(2020, 5, 15),
 datetime.date(2020, 5, 16),
 datetime.date(2020, 5, 17),
 datetime.date(2020, 5, 18),
 datetime.date(2020, 5, 19),
 datetime.date(2020, 5, 20),
 datetime.date(2020, 5, 21),
 datetime.date(2020, 5, 22),
 datetime.date(2020, 5, 23),
 datetime.date(2020, 5, 24),
 datetime.date(2020, 5, 25),
 datetime.date(2020, 5, 26),
 datetime.date(2020, 5, 27),
 datetime.date(2020, 5, 28),
 datetime.date(2020, 5, 29),
 datetime.date(2020, 5, 30),
 datetime.date(2020, 5, 31),
 datetime.date(2020, 6, 1),
 datetime.date(2020, 6, 2),
 datetime.date(2020, 6, 3),
 datetime.date(2020, 6, 4),
 datetime.date(2020, 6, 5),
 datetime.date(2020, 6, 6),
 datetime.date(2020, 6, 7),
 datetime.date(2020, 6, 8),
 datetime.date(2020, 6, 9),
 datetime.date(2020, 6, 10),
 datetime.date(2020, 6, 11),
 datetime.date(2020, 6, 12),
 datetime.date(2020, 6, 13),
 datetime.date(2020, 6, 14),
 datetime.date(2020, 6, 15),
 datetime.date(2020, 6, 16),
 datetime.date(2020, 6, 17),
 datetime.date(2020, 6, 18),
 datetime.date(2020, 6, 19),
 datetime.date(2020, 6, 20),
 datetime.date(2020, 6, 21),
 datetime.date(2020, 6, 22),
 datetime.date(2020, 6, 23)]
china = china.set_index('provinceName')
china = china.reset_index()

Analysis: cycle through provinces and dates to obtain the daily cumulative diagnosis of each province. Because splicing is required, initialize a date first_ Confirmed

list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['province_confirmedCount'])#Cumulative number of confirmed cases in each province on that day
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_confirmed = pd.DataFrame(list_d,index=list_e)
date_confirmed.index.name="date"
date_confirmed.columns=["China_confirmedCount"]
date_confirmed
China_confirmedCount
date
2020-01-24 1956.0

Analysis: traverse the dataframe of the total number of confirmed patients in each province

l = 0
for i in date[3:]:
    list_p = []
    list_d = []
    list_e = []
    l +=1
    for p in range(0,32):
        try:
            con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
            list_p.append(con_0['province_confirmedCount'])#Cumulative number of confirmed cases in each province on that day
        except:
            continue
    #Con_ 0 = china Loc[china['updatetime'] = = date[0]] Loc[china['provincename'] = = 'Hebei Province'] Loc[[0]] Iloc[0]
    #list_p.append(con_0['province_confirmedCount'])#Cumulative number of confirmed cases in each province on that day
    list_d.append(sum(list_p))
    list_e.append(str(date[l]))
    confirmed = pd.DataFrame(list_d, index=list_e)
    confirmed.index.name="date"
    confirmed.columns=["China_confirmedCount"]
    date_confirmed = pd.concat([date_confirmed,confirmed],sort=False)
date_confirmed
China_confirmedCount
date
2020-01-24 1956.0
2020-01-25 2253.0
2020-01-26 1956.0
2020-01-27 2825.0
2020-01-28 4589.0
... ...
2020-06-17 8106.0
2020-06-18 6862.0
2020-06-19 6894.0
2020-06-20 6921.0
2020-06-21 6157.0

150 rows × 1 columns

Analysis: removing null and incomplete values

date_confirmed.dropna(subset=['China_confirmedCount'],inplace=True)
date_confirmed.tail(20)
China_confirmedCount
date
2020-06-02 78782.0
2020-06-03 78780.0
2020-06-04 76903.0
2020-06-05 76908.0
2020-06-06 8777.0
2020-06-07 8782.0
2020-06-08 8628.0
2020-06-09 8634.0
2020-06-10 8638.0
2020-06-11 8649.0
2020-06-12 8658.0
2020-06-13 8665.0
2020-06-14 8733.0
2020-06-15 8772.0
2020-06-16 8055.0
2020-06-17 8106.0
2020-06-18 6862.0
2020-06-19 6894.0
2020-06-20 6921.0
2020-06-21 6157.0

Analysis: since the end of April to the end of May, the data has been distorted due to the lack of data from too many provinces (some provinces have no new patients since the end of April), and has completely lost its authenticity since June 6, 2020. Therefore, I deleted the data after June 6, 2020

date_confirmed = date_confirmed.drop(['2020-06-06','2020-06-07','2020-06-08','2020-06-09','2020-06-10','2020-06-11','2020-06-12','2020-06-13','2020-06-14',
                     '2020-06-15','2020-06-16','2020-06-19','2020-06-18','2020-06-20','2020-06-17','2020-06-21'])

Analysis: constructing splice functions

def data_frame(self,china,element):
    l = 0
    for i in date[3:]:
        list_p = []
        list_d = []
        list_e = []
        l +=1
        for p in range(0,32):
            try:
                con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
                list_p.append(con_0[element])
            except:
                continue
        #Con_ 0 = china Loc[china['updatetime'] = = date[0]] Loc[china['provincename'] = = 'Hebei Province'] Loc[[0]] Iloc[0]
        #list_p.append(con_0['province_confirmedCount'])
        list_d.append(sum(list_p))
        list_e.append(str(date[l]))
        link = pd.DataFrame(list_d, index=list_e)
        link.index.name="date"
        link.columns=["China"]
        self = pd.concat([self,link],sort=False)
    self.dropna(subset=['China'],inplace=True)
    self = self.drop(['2020-06-06','2020-06-07','2020-06-08','2020-06-09','2020-06-10','2020-06-11','2020-06-12','2020-06-13','2020-06-14',
                  '2020-06-15','2020-06-16','2020-06-19','2020-06-18','2020-06-20','2020-06-17','2020-06-21'])
    return self

Analysis: initializing individual variables

#Cumulative number of cured persons date_curedCount
list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['province_curedCount'])
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_cured = pd.DataFrame(list_d, index=list_e)
date_cured.index.name="date"
date_cured.columns=["China"]



#Cumulative number of deaths date_dead
list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['province_deadCount'])
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_dead = pd.DataFrame(list_d, index=list_e)
date_dead.index.name="date"
date_dead.columns=["China"]
#Cumulative confirmed patient date_confirmed
plt.rcParams['font.sans-serif'] = ['SimHei'] #Change the font, otherwise Chinese characters cannot be displayed
fig = plt.figure( figsize=(16,6), dpi=100)
ax = fig.add_subplot(1,1,1)
x = date_confirmed.index
y = date_confirmed.values
ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )
ax.set_title('Cumulative confirmed patients',fontdict={
      'color':'black',
      'size':24
})
ax.set_xticks( range(0,len(x),30))
[<matplotlib.axis.XTick at 0x255520e4908>,
 <matplotlib.axis.XTick at 0x255520e49e8>,
 <matplotlib.axis.XTick at 0x255520af048>,
 <matplotlib.axis.XTick at 0x2555216b0b8>,
 <matplotlib.axis.XTick at 0x2555216b4e0>]

#Cumulative cured patient date_curedCount
date_cured = data_frame(date_cured,china,'province_curedCount')
fig = plt.figure( figsize=(16,6), dpi=100)
ax = fig.add_subplot(1,1,1)
x = date_cured.index
y = date_cured.values
ax.set_title('Cumulative cured patients',fontdict={
      'color':'black',
      'size':24
})
ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )
ax.set_xticks( range(0,len(x),30))
[<matplotlib.axis.XTick at 0x25550ef60f0>,
 <matplotlib.axis.XTick at 0x255521cd0b8>,
 <matplotlib.axis.XTick at 0x255521b7780>,
 <matplotlib.axis.XTick at 0x2555208ffd0>,
 <matplotlib.axis.XTick at 0x2555208f0f0>]

Analysis: the accumulated suspected data cannot be obtained by completing the data

#Cumulative patient death date_dead
date_dead = data_frame(date_dead,china,'province_deadCount')
fig = plt.figure( figsize=(16,6), dpi=100)
ax = fig.add_subplot(1,1,1)
x = date_dead.index
y = date_dead.values
ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )
x_major_locator=MultipleLocator(12)
ax=plt.gca()
ax.set_title('Cumulative dead patients',fontdict={
      'color':'black',
      'size':24
})
ax.xaxis.set_major_locator(x_major_locator)
ax.set_xticks( range(0,len(x),30))
[<matplotlib.axis.XTick at 0x255521fda90>,
 <matplotlib.axis.XTick at 0x255521fda58>,
 <matplotlib.axis.XTick at 0x25552a51550>,
 <matplotlib.axis.XTick at 0x25552a75470>,
 <matplotlib.axis.XTick at 0x25552a75908>]

Analysis: the epidemic began to break out in early January, slowed down at the end of February, and leveled off at the end of April. The number of cured people has increased significantly since the beginning of February, and tends to be flat by the end of March. The number of deaths has increased since the end of January, and tends to be flat by the end of February. By the end of April, the number of deaths due to statistical factors has soared and tends to be flat.
Analysis and summary: from the end of April to the end of May, the number of confirmed cases and cure data were distorted due to the lack of data from too many provinces (some provinces have no new patients so far). Other data should be supplemented as much as possible, and the closer to the tail data, the more distorted it will be. The completion of death data was relatively successful, with almost no errors or omissions.

(2) What is the trend of new confirmed / suspected / cured / dead cases in China over time?

Analysis: the newly added diagnosis / cure / death data need to be calculated for china, and the diff difference calculation should be performed every day in each province

Analysis: first initialize each data, and then follow the above splicing function to make a splicing function applicable to the problem

#New confirmed number date_new_confirmed
list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['province_confirmedCount'])#Cumulative number of confirmed cases in each province on that day
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_new_confirmed = pd.DataFrame(list_d,index=list_e)
date_new_confirmed.index.name="date"
date_new_confirmed.columns=["China"]
date_new_confirmed


#New number of cured persons date_new_curedCount
list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['province_curedCount'])
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_new_cured = pd.DataFrame(list_d, index=list_e)
date_new_cured.index.name="date"
date_new_cured.columns=["China"]


#New death toll date_new_dead
list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['province_deadCount'])
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_new_dead = pd.DataFrame(list_d, index=list_e)
date_new_dead.index.name="date"
date_new_dead.columns=["China"]

Analysis: constructing splice functions

def data_new_frame(self,china,element):
    l = 0
    for i in date[3:]:
        list_p = []
        list_d = []
        list_e = []
        l +=1
        for p in range(0,32):
            try:
                con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
                list_p.append(con_0[element])
            except:
                continue
        #Con_ 0 = china Loc[china['updatetime'] = = date[0]] Loc[china['provincename'] = = 'Hebei Province'] Loc[[0]] Iloc[0]
        #list_p.append(con_0['province_confirmedCount'])
        list_d.append(sum(list_p))
        list_e.append(str(date[l]))
        link = pd.DataFrame(list_d, index=list_e)
        link.index.name="date"
        link.columns=["China"]
        self = pd.concat([self,link],sort=False)
    self.dropna(subset=['China'],inplace=True)
    return self

Analysis: data completion and data removal including missing provinces

d = data_new_frame(date_new_confirmed,china,'province_confirmedCount')
for i in range(len(d)):
    dr = []
    for a,b in zip(range(0,len(d)-1),range(1,len(d)-2)):
        if d.iloc[b].iloc[0] < d.iloc[a].iloc[0]:
            dr.append(d.iloc[b].iloc[0])
    d = d[~d['China'].isin(dr)]

Analysis: perform difference operation

d['China'] = d['China'].diff()

Analysis: remove two dates with missing provinces

d.drop(['2020-06-20','2020-06-21'],inplace=True)

Analysis: make a line chart to show the time trend

#New confirmed patient date_confirmed
fig = plt.figure( figsize=(16,6), dpi=100)
ax = fig.add_subplot(1,1,1)
x = d.index
y = d.values
ax.set_title('Newly diagnosed patients',fontdict={
      'color':'black',
      'size':24
})
ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )
ax.set_xticks( range(0,len(x),10))
[<matplotlib.axis.XTick at 0x25552a9c898>,
 <matplotlib.axis.XTick at 0x25552a9c860>,
 <matplotlib.axis.XTick at 0x25552ab7550>,
 <matplotlib.axis.XTick at 0x25552ad50f0>,
 <matplotlib.axis.XTick at 0x25552ad5518>,
 <matplotlib.axis.XTick at 0x25552ad59b0>,
 <matplotlib.axis.XTick at 0x25552ad5e48>,
 <matplotlib.axis.XTick at 0x25552adc320>]

Analysis: using initialization data to construct date_ New_ Curved dataframe, and then make a line chart to show the time trend

cu = data_new_frame(date_new_cured,china,'province_curedCount')
for i in range(len(cu)):
    dr = []
    for a,b in zip(range(0,len(cu)-1),range(1,len(cu)-2)):
        if cu.iloc[b].iloc[0] < cu.iloc[a].iloc[0]:
            dr.append(cu.iloc[b].iloc[0])
    cu = cu[~cu['China'].isin(dr)]
cu['China'] = cu['China'].diff()
cu.drop(['2020-06-20','2020-06-21'],inplace=True)
#New cured patient date_new_cured
fig = plt.figure( figsize=(16,6), dpi=100)
ax = fig.add_subplot(1,1,1)
x = cu.index
y = cu.values
ax.set_title('New cured patients',fontdict={
      'color':'black',
      'size':24
})
ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )
ax.set_xticks( range(0,len(x),10))
[<matplotlib.axis.XTick at 0x25552b13b00>,
 <matplotlib.axis.XTick at 0x25552b13ac8>,
 <matplotlib.axis.XTick at 0x25552b137b8>,
 <matplotlib.axis.XTick at 0x25552b3f470>,
 <matplotlib.axis.XTick at 0x25552b3f908>,
 <matplotlib.axis.XTick at 0x25552b3fda0>,
 <matplotlib.axis.XTick at 0x25552b47278>]

Analysis: using initialization data to construct date_ New_ dataframe of dead, and then make a line chart to show the time trend

de = data_new_frame( date_new_dead,china,'province_deadCount')
for i in range(len(de)):
    dr = []
    for a,b in zip(range(0,len(de)-1),range(1,len(de)-2)):
        if de.iloc[b].iloc[0] < de.iloc[a].iloc[0]:
            dr.append(de.iloc[b].iloc[0])
    de = de[~de['China'].isin(dr)]
de['China'] = de['China'].diff()
de.drop(['2020-06-21'],inplace=True)
#New dead patient date_new_dead
fig = plt.figure( figsize=(16,6), dpi=100)
ax = fig.add_subplot(1,1,1)
x = de.index
y = de.values
ax.set_title('New dead patients',fontdict={
      'color':'black',
      'size':24
})
ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )
ax.set_xticks( range(0,len(x),10))
[<matplotlib.axis.XTick at 0x25553bdfd30>,
 <matplotlib.axis.XTick at 0x25553bdfcf8>,
 <matplotlib.axis.XTick at 0x25553c01f60>,
 <matplotlib.axis.XTick at 0x25553c146a0>,
 <matplotlib.axis.XTick at 0x25553c14b38>,
 <matplotlib.axis.XTick at 0x25553c14d68>,
 <matplotlib.axis.XTick at 0x25553c1b4a8>,
 <matplotlib.axis.XTick at 0x25553c1b940>,
 <matplotlib.axis.XTick at 0x25553c1bdd8>]

Analysis: the number of newly added patients began to increase at the end of January, reached the peak before and after February 14, and then decreased and tended to be flat.
Analysis: the number of newly cured patients began to increase at the end of January, peaked around March 2, and then decreased, and leveled off from the beginning of April.
Analysis: the number of new death patients increased from the end of January to the peak in February. The increase was gentle from the beginning of March, and fell back after soaring due to statistical factors around April 17.

(3) What is the trend of new overseas input across the country over time?

Analysis: CHINA needs to be calculated and subtracted day by day for newly added overseas input data.

Analysis: first take out the overseas input data from CHINA, and then complete the time series and make the difference.

imported = CHINA.loc[CHINA['cityName'] == 'Overseas input']
imported.updateTime = pd.to_datetime(imported.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
imported
D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
136 Asia Asia China China Shaanxi Province Shaanxi 610000 317 1.0 307 3 2020-06-23 Overseas input NaN 0.0 72.0 0.0 65.0 0.0
150 Asia Asia China China Jiangsu Province Jiangsu 320000 654 3.0 653 0 2020-06-23 Overseas input NaN 0.0 23.0 0.0 22.0 0.0
201 Asia Asia China China Beijing Beijing 110000 843 164.0 584 9 2020-06-23 Overseas input NaN 0.0 174.0 3.0 173.0 0.0
214 Asia Asia China China Hebei Province Hebei 130000 346 0.0 323 6 2020-06-23 Overseas input NaN 0.0 10.0 0.0 10.0 0.0
218 Asia Asia China China Tianjin Tianjin 120000 198 48.0 192 3 2020-06-23 Overseas input NaN 0.0 61.0 0.0 59.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
115420 Asia Asia China China Shaanxi Province Shaanxi 610000 250 1.0 240 3 2020-03-25 Overseas input NaN 0.0 5.0 1.0 0.0 0.0
115956 Asia Asia China China Tianjin Tianjin 120000 145 0.0 133 3 2020-03-24 Overseas input NaN 0.0 9.0 0.0 0.0 0.0
116164 Asia Asia China China Gansu Province Gansu 620000 136 0.0 119 2 2020-03-24 Overseas input NaN 0.0 45.0 0.0 30.0 0.0
117171 Asia Asia China China Shanghai Shanghai 310000 414 0.0 330 4 2020-03-24 Overseas input NaN 0.0 75.0 0.0 3.0 0.0
117597 Asia Asia China China Tianjin Tianjin 120000 142 0.0 133 3 2020-03-24 Overseas input NaN 0.0 6.0 0.0 0.0 0.0

607 rows × 19 columns

Analysis: complete the data of missing time in provinces

for i in range(0,len(province)):
    list_j_d = []
    date_b = []
    for dt in imported.loc[imported['provinceName'] ==  province[i]]['updateTime']:
        date_b.append(str(dt))
    list_j_d = list(set(date_b))
    list_j_d.sort()
    #imported.loc[imported['provinceName'] == province[3]]
    try:
        start = imported.loc[imported['provinceName'] ==  province[i]]['updateTime'].min()
        end = imported.loc[imported['provinceName'] ==  province[i]]['updateTime'].max()
        dates_b = pd.date_range(start=str(start), end=str(end))
        aid_frame_b = pd.DataFrame({'updateTime': dates_b,'provinceName':[province[i]]*len(dates_b)})
        aid_frame_b.updateTime = pd.to_datetime(aid_frame_b.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
        #draft = pd.merge(china.loc[china['provinceName'] ==  province[1]], aid_frame, on='updateTime', how='outer').sort_values('updateTime')
        draft_b = pd.concat([imported.loc[imported['provinceName'] ==  province[i]], aid_frame_b], join='outer').sort_values('updateTime')
        draft_b.city_confirmedCount.fillna(method="ffill",inplace=True)
        draft_b.city_suspectedCount.fillna(method="ffill", inplace=True)
        draft_b.city_curedCount.fillna(method="ffill", inplace=True)
        draft_b.city_deadCount.fillna(method="ffill", inplace=True)
        draft_b.loc[draft_b['provinceName'] ==  province[i]].fillna(0,inplace=True,limit = 1)
        draft_b.loc[draft_b['provinceName'] ==  province[i]].loc[:,'city_confirmedCount':'city_deadCount'] = draft_b.loc[draft_b['provinceName'] ==  province[i]].loc[:,'city_confirmedCount':'city_deadCount'].diff()
        draft_b.dropna(subset=['city_confirmedCount','city_suspectedCount','city_curedCount','city_deadCount'],inplace=True)
        imported = pd.concat([imported,draft_b], join='outer').sort_values('updateTime')
    except:
        continue
imported
continentName continentEnglishName countryName countryEnglishName provinceName provinceEnglishName province_zipCode province_confirmedCount province_suspectedCount province_curedCount province_deadCount updateTime cityName cityEnglishName city_zipCode city_confirmedCount city_suspectedCount city_curedCount city_deadCount
115956 Asia Asia China China Tianjin Tianjin 120000.0 145.0 0.0 133.0 3.0 2020-03-24 Overseas input NaN 0.0 9.0 0.0 0.0 0.0
0 NaN NaN NaN NaN Gansu Province NaN NaN NaN NaN NaN NaN 2020-03-24 NaN NaN NaN 45.0 0.0 30.0 0.0
117597 Asia Asia China China Tianjin Tianjin 120000.0 142.0 0.0 133.0 3.0 2020-03-24 Overseas input NaN 0.0 6.0 0.0 0.0 0.0
117597 Asia Asia China China Tianjin Tianjin 120000.0 142.0 0.0 133.0 3.0 2020-03-24 Overseas input NaN 0.0 6.0 0.0 0.0 0.0
116164 Asia Asia China China Gansu Province Gansu 620000.0 136.0 0.0 119.0 2.0 2020-03-24 Overseas input NaN 0.0 45.0 0.0 30.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
150 Asia Asia China China Jiangsu Province Jiangsu 320000.0 654.0 3.0 653.0 0.0 2020-06-23 Overseas input NaN 0.0 23.0 0.0 22.0 0.0
136 Asia Asia China China Shaanxi Province Shaanxi 610000.0 317.0 1.0 307.0 3.0 2020-06-23 Overseas input NaN 0.0 72.0 0.0 65.0 0.0
91 NaN NaN NaN NaN Tianjin NaN NaN NaN NaN NaN NaN 2020-06-23 NaN NaN NaN 61.0 0.0 59.0 0.0
136 Asia Asia China China Shaanxi Province Shaanxi 610000.0 317.0 1.0 307.0 3.0 2020-06-23 Overseas input NaN 0.0 72.0 0.0 65.0 0.0
201 Asia Asia China China Beijing Beijing 110000.0 843.0 164.0 584.0 9.0 2020-06-23 Overseas input NaN 0.0 174.0 3.0 173.0 0.0

2524 rows × 19 columns

Analysis: copy() to prevent data processing errors from causing the loss of original data

draft_i = imported.copy()

Analysis: initialize the data of a province to ensure that this method is feasible

real_s = imported.loc[imported['provinceName'] == province[0]]
real_s.drop_duplicates(subset='updateTime', keep='first', inplace=True)
draft_i =  real_s
for p in province:
    real_data = imported.loc[imported['provinceName'] == p]
    real_data.drop_duplicates(subset='updateTime', keep='first', inplace=True)
    #imported = pd.concat([real_data, china],sort=False)
    draft_i = pd.concat([real_data,draft_i],sort=False)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Analysis: confirm that the method is correct, and carry out the same treatment for the remaining provinces

imported = draft_i
imported = imported.set_index('provinceName')
imported = imported.reset_index()

Analysis: data consolidation of each province.

list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = imported.loc[imported['updateTime'] == date[2]].loc[imported['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['city_confirmedCount'])#Cumulative number of confirmed cases in each province on that day
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_new_foreign_confirmed = pd.DataFrame(list_d,index=list_e)
date_new_foreign_confirmed.index.name="date"
date_new_foreign_confirmed.columns=["imported_confirmedCount"]
date_new_foreign_confirmed
imported_confirmedCount
date
2020-01-24 0
l = 0
for i in date[3:]:
    list_p = []
    list_d = []
    list_e = []
    l +=1
    for p in range(0,32):
        try:
            con_0 = imported.loc[imported['updateTime'] == date[l]].loc[imported['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
            list_p.append(con_0['city_confirmedCount'])#Cumulative number of confirmed cases in each province on that day
        except:
            continue
    #Con_ 0 = imported Loc[imported['updatetime'] = = date[0]] Loc[imported['provincename'] = = 'Hebei Province'] Loc[[0]] Iloc[0]
    #list_p.append(con_0['city_confirmedCount'])#Cumulative number of confirmed cases in each province on that day
    list_d.append(sum(list_p))
    list_e.append(str(date[l]))
    confirmed = pd.DataFrame(list_d, index=list_e)
    confirmed.index.name="date"
    confirmed.columns=["imported_confirmedCount"]
    date_new_foreign_confirmed = pd.concat([date_new_foreign_confirmed,confirmed],sort=False)
date_new_foreign_confirmed
imported_confirmedCount
date
2020-01-24 0.0
2020-01-25 0.0
2020-01-26 0.0
2020-01-27 0.0
2020-01-28 0.0
... ...
2020-06-17 848.0
2020-06-18 800.0
2020-06-19 800.0
2020-06-20 802.0
2020-06-21 775.0

150 rows × 1 columns

#New overseas input
fig = plt.figure( figsize=(16,4), dpi=100)
ax = fig.add_subplot(1,1,1)
x = date_new_foreign_confirmed.index
y = date_new_foreign_confirmed.values
plot = ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-',label='date_new_foreign_confirmed' )
ax.set_xticks( range(0,len(x),10))
plt.xlabel('date',fontsize=20)
plt.ylabel('Number of persons',fontsize=20)
plt.title('COVID-19-New overseas input',fontsize=30)
ax.legend( loc=0, frameon=True )
<matplotlib.legend.Legend at 0x25553ca5f28>

Analysis and summary: since the end of March, the number of imported cases has increased sharply, and the growth rate tends to be flat at the beginning of May, and slowed down at the beginning of June.

(4) What is the situation in your province?

Analysis: first, take out all time series of Guangdong Province, convert them into string type, and then sort them

m_dates = list(set(myhome['updateTime']))
aid_d = m_dates.copy()
for d in aid_d:
    a = str(d)
    m_dates.remove(d)
    m_dates.append(a)
m_dates.sort()
myhome = myhome.set_index('provinceName')
myhome = myhome.reset_index()

Analysis: traverse the time of the corresponding province of my city to build the corresponding dataframe

#Total number of confirmed cases in Guangdong Province
list_g = []
for i in range(0,len(m_dates)):
    try:
        con_m = myhome.loc[myhome['updateTime'] == date[i]].loc[myhome['cityName'] == 'Maoming'].iloc[[0]].iloc[0] 
        list_g.append(con_m['province_confirmedCount'])
    except:
        list_g.append(0)
        continue
g_date_confirmed = pd.DataFrame(list_g, index=m_dates)
g_date_confirmed.index.name="date"
g_date_confirmed.columns=["g_confirmed"]
g_date_confirmed=g_date_confirmed[~g_date_confirmed['g_confirmed'].isin([0])]


#Cumulative number of cured persons in Guangdong Province
list_g = []
for i in range(0,len(m_dates)):
    try:
        con_m = myhome.loc[myhome['updateTime'] == date[i]].loc[myhome['cityName'] == 'Maoming'].iloc[[0]].iloc[0] 
        list_g.append(con_m['province_curedCount'])
    except:
        list_g.append(0)
        continue
g_date_cured = pd.DataFrame(list_g, index=m_dates)
g_date_cured.index.name="date"
g_date_cured.columns=["g_cured"]
g_date_cured=g_date_cured[~g_date_cured['g_cured'].isin([0])]


#Cumulative death toll in Guangdong Province
list_g = []
for i in range(0,len(m_dates)):
    try:
        con_m = myhome.loc[myhome['updateTime'] == date[i]].loc[myhome['cityName'] == 'Maoming'].iloc[[0]].iloc[0] 
        list_g.append(con_m['province_deadCount'])
    except:
        list_g.append(0)
        continue
g_date_dead = pd.DataFrame(list_g, index=m_dates)
g_date_dead.index.name="date"
g_date_dead.columns=["g_dead"]
g_date_dead=g_date_dead[~g_date_dead['g_dead'].isin([0])]

Analysis: make a line chart to show the trend of epidemic situation over time

##Cumulative number of confirmed cases in Guangdong Province cumulative number of cured cases in Guangdong Province
plt.rcParams['font.sans-serif'] = ['SimHei'] 
x= g_date_confirmed.index
y1 = g_date_confirmed.values
y2 = g_date_cured.values
y3 = g_date_dead
#font_manager = font_manager.FontProperties(fname = 'C:/Windows/Fonts/simsun.ttc',size = 18)
plt.figure(figsize=(20,10),dpi = 80)
plt.plot(x,y1,color = r_hex,label = 'confirmed')
plt.plot(x,y2,color = g_hex,label = 'cured')
x_major_locator=MultipleLocator(12)
ax=plt.gca()
ax.xaxis.set_major_locator(x_major_locator)
plt.title('COVID-19 - Guangdong Province',fontsize=30)
plt.xlabel('date',fontsize=20)
plt.ylabel('Number of persons',fontsize=20)
plt.legend(loc=1, bbox_to_anchor=(1.00,0.90), bbox_transform=ax.transAxes)
<matplotlib.legend.Legend at 0x25553d02a90>

#Cumulative death toll in Guangdong Province
plt.rcParams['font.sans-serif'] = ['SimHei'] 
fig = plt.figure( figsize=(16,4), dpi=100)
ax = fig.add_subplot(1,1,1)
x = g_date_dead.index
y = g_date_dead.values
plot = ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-',label='dead' )
ax.set_xticks( range(0,len(x),10))
plt.xlabel('date',fontsize=20)
plt.ylabel('Number of persons',fontsize=20)
plt.title('COVID-19-Guangdong Province',fontsize=30)
ax.legend( loc=0, frameon=True )
<matplotlib.legend.Legend at 0x25553d94940>

Analysis: the data completion of Guangdong Province is very successful with high authenticity.
Analysis: according to the broken line chart, the number of infected people in Guangdong Province has increased sharply since the end of January, and leveled off until the middle of February. Since the beginning of March, due to the popularity of testing and statistical factors, the number of short-term confirmed patients has increased slightly. Since early February, the number of cured people in Guangdong Province has increased sharply. Since early June, the number of new infections has leveled off, so the number of cured people has leveled off. Since the beginning of March, there have been no new deaths in Guangdong Province.

(5) What is the epidemic situation abroad?

Analysis: data removal nulls

world.dropna(axis=1, how='any', inplace=True)
#world.set_index('updateTime')
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

Analysis: create a country list country, create a date list date_y

country = list(set(world['provinceName']))
date_y = []
for dt in world.loc[world['provinceName'] ==  country[0]]['updateTime']:
    date_y.append(str(dt))
date_y = list(set(date_0))
date_y.sort()

Analysis: traverse the country list to process and de duplicate the updateTime in the world.

for c in country:
    world.loc[world['provinceName'] == c].sort_values(by = 'updateTime')
world.dropna(subset=['provinceName'],inplace=True)
world.updateTime = pd.to_datetime(world.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value

Analysis: Province from the top 15 countries_ Confirmedcount perspective constitutes the world_confirmed and complete the data

world_confirmed = world.loc[world['provinceName'] == world.head(15)['provinceName'][0]].pivot_table(index='updateTime', columns='provinceName', values='province_confirmedCount',aggfunc=np.mean)
for i in world.head(15)['provinceName'][1:]:
    draft_c = world.loc[world['provinceName'] == i].pivot_table(index='updateTime', columns='provinceName', values='province_confirmedCount',aggfunc=np.mean)
    world_confirmed = pd.merge(world_confirmed,draft_c,on='updateTime', how='outer',sort=True)
world_confirmed.fillna(0,inplace=True,limit = 1)
world_confirmed.fillna(method="ffill",inplace=True)
world_confirmed
provinceName United States Brazil Britain Russia Chile India Pakistan Peru Spain Bangladesh France Saudi Arabia Sweden South Africa Ecuador
updateTime
2020-01-27 5.000000e+00 0.00 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.00 0.00 3.000000 0.000000 0.000000 0.0 0.000000
2020-01-29 0.000000e+00 0.00 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.00 0.00 4.000000 0.000000 0.000000 0.0 0.000000
2020-01-30 0.000000e+00 0.00 0.000000 0.0 0.0 1.000000 0.000000 0.000000 0.00 0.00 5.000000 0.000000 0.000000 0.0 0.000000
2020-01-31 6.000000e+00 0.00 2.000000 2.0 0.0 1.000000 0.000000 0.000000 0.00 0.00 0.000000 0.000000 0.000000 0.0 0.000000
2020-02-01 6.000000e+00 0.00 2.000000 2.0 0.0 1.000000 0.000000 0.000000 4.00 0.00 5.500000 0.000000 1.000000 0.0 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-19 2.184912e+06 976906.50 300469.000000 563084.0 225103.0 371474.666667 162935.600000 243518.000000 245268.00 102292.00 158641.000000 145991.000000 55672.750000 83020.5 48256.400000
2020-06-20 2.221982e+06 1038568.00 302138.750000 573007.5 231393.0 390209.333333 169464.666667 247925.000000 245665.75 105535.00 159452.000000 151277.250000 56201.500000 87715.0 49519.666667
2020-06-21 2.253118e+06 1068977.25 303284.428571 579160.0 236748.0 399451.714286 174346.222222 251338.000000 245938.00 109657.75 160093.000000 154715.714286 56360.000000 92681.0 49731.000000
2020-06-22 2.279603e+06 1084312.25 304331.000000 587720.0 243276.6 416389.400000 179148.750000 254336.333333 246272.00 112306.00 160336.428571 158177.500000 57346.000000 96377.8 50092.600000
2020-06-23 2.299650e+06 1106470.00 305289.000000 592280.0 246963.0 425282.000000 182562.666667 257447.000000 246504.00 115786.00 160750.000000 161005.000000 59060.666667 101590.0 50487.666667

144 rows × 15 columns

Analysis: the epidemic situation of the top 15 countries over time

#plt.rcParams['font.sans-serif'] = ['SimHei']  
fig = plt.figure(figsize=(16,10))
plt.plot(world_confirmed)
plt.legend(world_confirmed.columns)
plt.title('Cumulative number of confirmed cases in the top 15 countries',fontsize=20)
plt.xlabel('date',fontsize=20)
plt.ylabel('Number of persons/million',fontsize=20);

Analysis: the completion of foreign data is relatively successful and has certain authenticity.
Analysis: since the end of March, the number of newly diagnosed patients in foreign countries has increased sharply. The epidemic situation in the top four countries has not been controlled. The trend of overseas epidemic situation is that the number of confirmed patients continues to increase.

(6) Based on your analysis results, what are your suggestions for individuals and society in combating the epidemic?

According to the broken line chart of domestic epidemic situation, the epidemic situation has tended to be gentle since the end of April. On the contrary, the epidemic situation in foreign countries has begun to break out since the beginning of April, and no gentle trend has been seen so far.
From the perspective of overseas imported cases, we need to be wary of overseas imported cases and prevent the re spread of domestic new crowns. We cannot relax our vigilance.
For individuals, we should avoid going to densely populated areas, wear masks when going out, and do comprehensive disinfection at home.
For the society, in the areas with developed traffic and densely populated areas, it is necessary to popularize virus detection and site disinfection measures, cut off the transmission path of the virus, and maintain the achievements of epidemic prevention and control in China.

Additional analysis (optional)

For additional analysis, there is no limit to the libraries used. For example, seaborn, pyecharts and other libraries can be used.

Limited to personal ability, didn't do it.

Tags: Python Data Analysis pandas Visualization numpy

Posted by joelhop on Wed, 01 Jun 2022 02:33:14 +0530