[Python] Using Python for data analysis--a case study of a new coronavirus epidemic
Important Notes
Submit only this document, and then submit it it to the "homework after completion" of the course.
1. File Naming:
Must be in number-name-major-class. Naming of ipynb, (e.g. 202011030101-Qiao Feng-Measurement and Control Technology and Instrument-Measurement and Control 20-1.ipynb)
In addition, the second line at the beginning of the document (marked in the document) is changed to your own number-name-major-class
2. Deadline:
24:00 p.m. on April 30, 2021
3. Scoring rules:
Analysis Document: Completeness: Code Quality 3:5:2
Analytical documents refer to the ideas, explanations, and explanations of the results of your analysis of data (short and concise, not for writing).
ps: You write your own code far better than anything else, no matter how beautiful or ugly it is, just ask today is better than yesterday! Come on.
Reminder:
The epidemic is still ravaging, please actively protect yourself
I wish you all good results
Since there is too much data, try using head() or tail() to view the data so that the program does not respond for a long time
=======================
Data for this project come from Syringa orchard. The main purpose of this project is to better understand the epidemic situation and its development through the analysis of historical epidemic data, and to provide data support for decision-making on fighting the epidemic situation.
1. Ask questions
From the national level, your provinces and cities, the epidemic situation abroad and other three aspects, the main research issues are as follows:
(1) What is the trend of cumulative diagnosis/suspicion/cure/death over time throughout the country?
(2) What is the situation in your province or city?
(3) What is the overall situation of the global epidemic?
(4) Give your judgment on the epidemic trend in the next half year based on your analysis results, and what suggestions do you have for individuals and society in fighting the epidemic?
2. Understanding data
Original dataset: AreaInfo.csv, import related packages, read data, and assign areas
import pandas as pd import numpy as np import matplotlib.pyplot as plt areas = pd.read_csv('data\AreaInfo.csv') areas
continentName | continentEnglishName | countryName | countryEnglishName | provinceName | provinceEnglishName | province_zipCode | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | updateTime | cityName | cityEnglishName | city_zipCode | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Asia | Asia | China | China | Macao | Macau | 820000 | 47 | 9.0 | 46 | 0 | 2021-01-22 23:40:08 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | North America | North America | U.S.A | United States of America | U.S.A | United States of America | 971002 | 24632468 | 0.0 | 10845438 | 410378 | 2021-01-22 23:40:08 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | South America | South America | Brazil | Brazil | Brazil | Brazil | 973003 | 8699814 | 0.0 | 7580741 | 214228 | 2021-01-22 23:40:08 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Europe | Europe | Belgium | Belgium | Belgium | Belgium | 961001 | 686827 | 0.0 | 19239 | 20620 | 2021-01-22 23:40:08 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Europe | Europe | Russia | Russia | Russia | Russia | 964006 | 3677352 | 0.0 | 3081536 | 68412 | 2021-01-22 23:40:08 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
429911 | Asia | Asia | China | China | Liaoning Province | Liaoning | 210000 | 0 | 1.0 | 0 | 0 | 2020-01-22 03:28:10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
429912 | Asia | Asia | China | China | Taiwan | Taiwan | 710000 | 1 | 0.0 | 0 | 0 | 2020-01-22 03:28:10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
429913 | Asia | Asia | China | Hongkong | Hong Kong | Hongkong | 810000 | 0 | 117.0 | 0 | 0 | 2020-01-22 03:28:10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
429914 | Asia | Asia | China | China | Heilongjiang Province | Heilongjiang | 230000 | 0 | 1.0 | 0 | 0 | 2020-01-22 03:28:10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
429915 | Asia | Asia | China | China | Hunan Province | Hunan | 430000 | 1 | 0.0 | 0 | 0 | 2020-01-22 03:28:10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
429916 rows × 19 columns
areas.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 429916 entries, 0 to 429915 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 continentName 429872 non-null object 1 continentEnglishName 429872 non-null object 2 countryName 429916 non-null object 3 countryEnglishName 403621 non-null object 4 provinceName 429916 non-null object 5 provinceEnglishName 403621 non-null object 6 province_zipCode 429916 non-null int64 7 province_confirmedCount 429916 non-null int64 8 province_suspectedCount 429913 non-null float64 9 province_curedCount 429916 non-null int64 10 province_deadCount 429916 non-null int64 11 updateTime 429916 non-null object 12 cityName 125143 non-null object 13 cityEnglishName 119660 non-null object 14 city_zipCode 123877 non-null float64 15 city_confirmedCount 125143 non-null float64 16 city_suspectedCount 125143 non-null float64 17 city_curedCount 125143 non-null float64 18 city_deadCount 125143 non-null float64 dtypes: float64(6), int64(4), object(9) memory usage: 62.3+ MB
View and statistics to get a general idea of the data
Related field meaning introduction:
Tips:
The provinceName of foreign data is not a province name, but is labeled with its country name, that is, the data is no longer subdivided into provinces.
There is also a record like'China'in the provinceName of China data, which represents the total of the provinces in the country on that day. If you use it well, it will be much easier to analyze the national situation.
3. Data cleaning
(1) Basic data processing
Data cleaning mainly includes: selecting subsets, missing data processing, data format conversion, outlier data processing, etc.
Tip: Because the data is reported by all countries, it is inevitable to conceal the missing report when the situation is critical, so there are many missing values, which can be completed or discarded. See "Treatment of missing values in Pandas.ipynb"
Selection of domestic epidemic data (the final selected data is named china)
-
Select domestic epidemic data
-
For the updateTime column, convert it to a date type, extract the year-month-day, and view the processing results. (Tip: dt.date)
-
Since the data is updated hourly every day and there are many duplicates in a day, please go back and keep only the latest data in a day.
Tip: df.drop_duplicates(subset=['provinceName','updateTime'], keep='first', inplace=False)
Where df is the DataFrame of your choice for domestic epidemic data
- Remove columns that are not in the scope of this study, leaving only ['continentName','countryName','provinceName','province_confirmedCount','province_suspectedCount','province_curedCount','province_deadCount','updateTime'] columns and index them with'updateTime'.
Tip: Both methods can: (1) Select these columns or (2) Remove the remaining columns
# The code is given here, and the acquisition of later provincial and global data is much the same. china = areas.loc[areas.countryName=='China',:].copy() china['updateTime'] = pd.to_datetime(china.updateTime,format="%Y-%m-%d",errors='coerce').dt.date china = china.drop_duplicates(subset=['provinceName', 'updateTime'], keep='first', inplace=False) # Convert Character Type Date Column (Index) to Datetime Index china['updateTime'] = pd.to_datetime(china['updateTime']) china.set_index('updateTime',inplace=True) china = china[['continentName','countryName','provinceName','province_confirmedCount','province_suspectedCount','province_curedCount','province_deadCount']] china = china[china.provinceName=='China'] china.head()
continentName | countryName | provinceName | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | |
---|---|---|---|---|---|---|---|
updateTime | |||||||
2021-01-22 | Asia | China | China | 99667 | 0.0 | 92275 | 4810 |
2021-01-21 | Asia | China | China | 99513 | 0.0 | 92198 | 4809 |
2021-01-20 | Asia | China | China | 99285 | 0.0 | 92130 | 4808 |
2021-01-19 | Asia | China | China | 99094 | 0.0 | 92071 | 4806 |
2021-01-18 | Asia | China | China | 98922 | 0.0 | 91994 | 4805 |
Check the data information for missing data/data type correctness. If there are missing values, you can complete or discard them, see ** "Processing missing values for Pandas. ipynb"**
# Check the data information for missing data/data type correctness. If there are missing values, they can be completed or discarded china.info() china.head(5)
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 305 entries, 2021-01-22 to 2020-03-15 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 continentName 305 non-null object 1 countryName 305 non-null object 2 provinceName 305 non-null object 3 province_confirmedCount 305 non-null int64 4 province_suspectedCount 305 non-null float64 5 province_curedCount 305 non-null int64 6 province_deadCount 305 non-null int64 dtypes: float64(1), int64(3), object(3) memory usage: 19.1+ KB
continentName | countryName | provinceName | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | |
---|---|---|---|---|---|---|---|
updateTime | |||||||
2021-01-22 | Asia | China | China | 99667 | 0.0 | 92275 | 4810 |
2021-01-21 | Asia | China | China | 99513 | 0.0 | 92198 | 4809 |
2021-01-20 | Asia | China | China | 99285 | 0.0 | 92130 | 4808 |
2021-01-19 | Asia | China | China | 99094 | 0.0 | 92071 | 4806 |
2021-01-18 | Asia | China | China | 98922 | 0.0 | 91994 | 4805 |
Selection of epidemic data in your province or city (the final selected data is named myhome)
This step can also be done later
-
Select epidemic data of the province or city (refined to city; if municipality directly under the Central Government, refined to district)
-
For the updateTime column, convert it to a date type, extract the year-month-day, and view the processing results. (Tip: dt.date)
-
Since the data is updated hourly every day and there are many duplicates in a day, please recreate and keep only the latest data in a day and index the rows with'updateTime'.
Tip: df.drop_duplicates(subset=['cityName','updateTime'], keep='first', inplace=False)
- Remove columns not covered by this study
Tip: df.drop(['continentName','continentEnglishName','countryName','countryEnglishName','provinceEnglishName',
'province_zipCode','cityEnglishName','updateTime','city_zipCode'],axis=1,inplace=True)
Where df is the DataFrame for the epidemic data of the province or city you selected
# Provincial and municipal data acquisition df = areas.loc[areas.provinceName=='Sichuan Province',:].copy() df df['updateTime'] = pd.to_datetime(df.updateTime,format="%Y-%m-%d",errors='coerce').dt.date df = df.drop_duplicates(subset=['provinceName', 'updateTime'], keep='first', inplace=False) # Convert Character Type Date Column (Index) to Datetime Index df['updateTime'] = pd.to_datetime(df['updateTime']) # df.set_index('updateTime',inplace=True) df.drop(['continentName','continentEnglishName','countryName','countryEnglishName','provinceEnglishName','province_zipCode','cityEnglishName','updateTime','city_zipCode'],axis=1,inplace=True) # df = df[df.provinceName=='Hubei Province'] df # df.head(10)
provinceName | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | cityName | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|
2538 | Sichuan Province | 865 | 14.0 | 852 | 3 | Chengdu | 468.0 | 13.0 | 455.0 | 3.0 |
7192 | Sichuan Province | 865 | 14.0 | 852 | 3 | Chengdu | 468.0 | 13.0 | 455.0 | 3.0 |
8912 | Sichuan Province | 863 | 14.0 | 852 | 3 | Chengdu | 466.0 | 13.0 | 455.0 | 3.0 |
10704 | Sichuan Province | 863 | 14.0 | 847 | 3 | Chengdu | 466.0 | 13.0 | 450.0 | 3.0 |
12175 | Sichuan Province | 862 | 14.0 | 847 | 3 | Chengdu | 465.0 | 13.0 | 450.0 | 3.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
427258 | Sichuan Province | 44 | 0.0 | 0 | 0 | Chengdu | 22.0 | 0.0 | 0.0 | 0.0 |
428070 | Sichuan Province | 28 | 0.0 | 0 | 0 | Chengdu | 16.0 | 0.0 | 0.0 | 0.0 |
428837 | Sichuan Province | 15 | 0.0 | 0 | 0 | Chengdu | 7.0 | 0.0 | 0.0 | 0.0 |
429846 | Sichuan Province | 8 | 2.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN |
429866 | Sichuan Province | 5 | 2.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN |
267 rows × 10 columns
df.info() df.head()
<class 'pandas.core.frame.DataFrame'> Int64Index: 267 entries, 2538 to 429866 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 provinceName 267 non-null object 1 province_confirmedCount 267 non-null int64 2 province_suspectedCount 267 non-null float64 3 province_curedCount 267 non-null int64 4 province_deadCount 267 non-null int64 5 cityName 265 non-null object 6 city_confirmedCount 265 non-null float64 7 city_suspectedCount 265 non-null float64 8 city_curedCount 265 non-null float64 9 city_deadCount 265 non-null float64 dtypes: float64(5), int64(3), object(2) memory usage: 22.9+ KB
provinceName | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | cityName | city_confirmedCount | city_suspectedCount | city_curedCount | city_deadCount | |
---|---|---|---|---|---|---|---|---|---|---|
2538 | Sichuan Province | 865 | 14.0 | 852 | 3 | Chengdu | 468.0 | 13.0 | 455.0 | 3.0 |
7192 | Sichuan Province | 865 | 14.0 | 852 | 3 | Chengdu | 468.0 | 13.0 | 455.0 | 3.0 |
8912 | Sichuan Province | 863 | 14.0 | 852 | 3 | Chengdu | 466.0 | 13.0 | 455.0 | 3.0 |
10704 | Sichuan Province | 863 | 14.0 | 847 | 3 | Chengdu | 466.0 | 13.0 | 450.0 | 3.0 |
12175 | Sichuan Province | 862 | 14.0 | 847 | 3 | Chengdu | 465.0 | 13.0 | 450.0 | 3.0 |
df_clean = df.dropna() df_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 265 entries, 2538 to 428837 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 provinceName 265 non-null object 1 province_confirmedCount 265 non-null int64 2 province_suspectedCount 265 non-null float64 3 province_curedCount 265 non-null int64 4 province_deadCount 265 non-null int64 5 cityName 265 non-null object 6 city_confirmedCount 265 non-null float64 7 city_suspectedCount 265 non-null float64 8 city_curedCount 265 non-null float64 9 city_deadCount 265 non-null float64 dtypes: float64(5), int64(3), object(2) memory usage: 22.8+ KB
Check the data information for missing data/data type correctness. If there are missing values, you can complete or discard them, see ** "Processing missing values for Pandas. ipynb"**
Global epidemic data selection (the final selected data is named world)
This step can also be done later
- Selecting epidemic data from abroad
Tip: Select foreign epidemic data (countryName!='China')
-
For the updateTime column, convert it to a date type, extract the year-month-day, and view the processing results. (Tip: dt.date)
-
Since the data is updated hourly every day and there are many duplicates in a day, please go back and keep only the latest data in a day.
Tip: df.drop_duplicates(subset=['provinceName','updateTime'], keep='first', inplace=False)
Where df is the DataFrame of your choice for domestic epidemic data
- Remove columns that are not in the scope of this study, leaving only ['continentName','countryName','provinceName','province_confirmedCount','province_suspectedCount','province_curedCount','province_deadCount','updateTime'] columns and index them with'updateTime'.
Tip: Both methods can: (1) Select these columns or (2) Remove the remaining columns
- Get global data
Tip: Use the concat function to connect the previous china with foreign data by "axis" to get global data.
# Provincial and municipal data acquisition o = areas.loc[areas.countryName!='China',:].copy() # df = areas.loc[areas.provinceName=='Hubei Province',:]. copy() o o['updateTime'] = pd.to_datetime(o.updateTime,format="%Y-%m-%d",errors='coerce').dt.date o = o.drop_duplicates(subset=['provinceName', 'updateTime'], keep='first', inplace=False) # Convert Character Type Date Column (Index) to Datetime Index o['updateTime'] = pd.to_datetime(o['updateTime']) # o.set_index('updateTime',inplace=True) o = o[['continentName','countryName','provinceName','province_confirmedCount','province_suspectedCount','province_curedCount','province_deadCount','updateTime']] o.set_index('updateTime',inplace=True) o
continentName | countryName | provinceName | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | |
---|---|---|---|---|---|---|---|
updateTime | |||||||
2021-01-22 | North America | U.S.A | U.S.A | 24632468 | 0.0 | 10845438 | 410378 |
2021-01-22 | South America | Brazil | Brazil | 8699814 | 0.0 | 7580741 | 214228 |
2021-01-22 | Europe | Belgium | Belgium | 686827 | 0.0 | 19239 | 20620 |
2021-01-22 | Europe | Russia | Russia | 3677352 | 0.0 | 3081536 | 68412 |
2021-01-22 | Europe | Serbia | Serbia | 436121 | 0.0 | 50185 | 5263 |
... | ... | ... | ... | ... | ... | ... | ... |
2020-01-27 | NaN | Malaysia | Malaysia | 3 | 0.0 | 0 | 0 |
2020-01-27 | NaN | France | France | 3 | 0.0 | 0 | 0 |
2020-01-27 | NaN | Vietnam? | Vietnam? | 2 | 0.0 | 0 | 0 |
2020-01-27 | NaN | Nepal | Nepal | 1 | 0.0 | 0 | 0 |
2020-01-27 | NaN | Canada | Canada | 1 | 0.0 | 0 | 0 |
62110 rows × 7 columns
world = pd.concat([china,o]) world
continentName | countryName | provinceName | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | |
---|---|---|---|---|---|---|---|
updateTime | |||||||
2021-01-22 | Asia | China | China | 99667 | 0.0 | 92275 | 4810 |
2021-01-21 | Asia | China | China | 99513 | 0.0 | 92198 | 4809 |
2021-01-20 | Asia | China | China | 99285 | 0.0 | 92130 | 4808 |
2021-01-19 | Asia | China | China | 99094 | 0.0 | 92071 | 4806 |
2021-01-18 | Asia | China | China | 98922 | 0.0 | 91994 | 4805 |
... | ... | ... | ... | ... | ... | ... | ... |
2020-01-27 | NaN | Malaysia | Malaysia | 3 | 0.0 | 0 | 0 |
2020-01-27 | NaN | France | France | 3 | 0.0 | 0 | 0 |
2020-01-27 | NaN | Vietnam? | Vietnam? | 2 | 0.0 | 0 | 0 |
2020-01-27 | NaN | Nepal | Nepal | 1 | 0.0 | 0 | 0 |
2020-01-27 | NaN | Canada | Canada | 1 | 0.0 | 0 | 0 |
62415 rows × 7 columns
world.info() world_clean = world.dropna() world_clean.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 62415 entries, 2021-01-22 to 2020-01-27 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 continentName 62377 non-null object 1 countryName 62415 non-null object 2 provinceName 62415 non-null object 3 province_confirmedCount 62415 non-null int64 4 province_suspectedCount 62415 non-null float64 5 province_curedCount 62415 non-null int64 6 province_deadCount 62415 non-null int64 dtypes: float64(1), int64(3), object(3) memory usage: 3.8+ MB <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 62377 entries, 2021-01-22 to 2020-01-29 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 continentName 62377 non-null object 1 countryName 62377 non-null object 2 provinceName 62377 non-null object 3 province_confirmedCount 62377 non-null int64 4 province_suspectedCount 62377 non-null float64 5 province_curedCount 62377 non-null int64 6 province_deadCount 62377 non-null int64 dtypes: float64(1), int64(3), object(3) memory usage: 3.8+ MB
Check the data information for missing data/data type correctness.
Tip: There are many missing values because the data is reported by all countries and it is unavoidable to conceal the missing report when the situation is critical. You can complete or discard the missing values. See ** "Handling missing values in Pandas.ipynb"**
4. Data analysis and visualization
For data analysis and visualization, select the required variables based on each question and create a new DataFrame for analysis and visualization so that the data is less cluttered and more organized.
Basic analysis
For basic analysis, only numpy, pandas, and matplotlib libraries are allowed.
Multiple coordinate systems can be displayed on one graph or on multiple graphs
Please select the type of graphics (line chart, pie chart, histogram, scatter chart, etc.) according to the analysis purpose. It is really no idea to go to Baidu epidemic map or other sites for epidemic analysis to inspire.
(1) What is the trend of cumulative diagnosis/cure/death over time throughout the country?
china
continentName | countryName | provinceName | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | |
---|---|---|---|---|---|---|---|
updateTime | |||||||
2021-01-22 | Asia | China | China | 99667 | 0.0 | 92275 | 4810 |
2021-01-21 | Asia | China | China | 99513 | 0.0 | 92198 | 4809 |
2021-01-20 | Asia | China | China | 99285 | 0.0 | 92130 | 4808 |
2021-01-19 | Asia | China | China | 99094 | 0.0 | 92071 | 4806 |
2021-01-18 | Asia | China | China | 98922 | 0.0 | 91994 | 4805 |
... | ... | ... | ... | ... | ... | ... | ... |
2020-03-19 | Asia | China | China | 81263 | 0.0 | 70561 | 3250 |
2020-03-18 | Asia | China | China | 81202 | 0.0 | 69777 | 3242 |
2020-03-17 | Asia | China | China | 81135 | 0.0 | 68820 | 3231 |
2020-03-16 | Asia | China | China | 81099 | 0.0 | 67930 | 3218 |
2020-03-15 | Asia | China | China | 81062 | 0.0 | 67037 | 3204 |
305 rows × 7 columns
# Accumulated diagnosis nationwide plt.figure() plt.plot(china.province_confirmedCount) plt.title('China') plt.xlabel('Time') plt.ylabel('Count') plt.show()
# Accumulated national cure plt.figure() plt.plot(china.province_curedCount) plt.title('China') plt.xlabel('Time') plt.ylabel('Count') plt.show()
# Accumulated national deaths plt.figure() plt.plot(china.province_deadCount) plt.title('China') plt.xlabel('Time') plt.ylabel('Count') plt.show()
(2) What is the situation in your province or city?
# Cumulative diagnosis in Sichuan plt.figure() plt.plot(df.province_confirmedCount) plt.title('Sichuan') plt.xlabel('Time') plt.ylabel('Count') plt.show()
# Cumulative cure in Sichuan plt.figure() plt.plot(df.province_curedCount) plt.title('China') plt.xlabel('Time') plt.ylabel('Count') plt.show()
# Cumulative Death in Sichuan plt.figure() plt.plot(df.province_deadCount) plt.title('China') plt.xlabel('Time') plt.ylabel('Count') plt.show()
Insert a picture description here
(3) What is the global epidemic situation?
-
What is the global TOP10 epidemic?
-
Contrast across continents?
-
Select a continent of interest to analyze the links, distribution, comparison, and composition of epidemics among countries.
Tip: Pay attention to the use of perspectives, grouping, and integrated knowledge
# What is the global TOP10 epidemic? # Maximum number of confirmed cases per country world_clean plt.rcParams['font.sans-serif']=['SimHei'] #Used for normal display of Chinese labels world_clean.sort_values("province_confirmedCount",inplace=False) world_clean world_clean = world_clean.drop_duplicates(subset=[ 'province_confirmedCount','countryName'], keep='first', inplace=False) world_clean name = world_clean['countryName'].drop_duplicates() name[:10] ww = world_clean.copy() ii = ww.loc[ww['countryName']=='China'] ii.province_confirmedCount for i in name[:10]: ii = ww.loc[ww['countryName']==i] plt.figure() plt.title(i) plt.plot(ii.province_confirmedCount) plt.show() # world_clean['provinceName'].drop_duplicates().values # world_clean['time'] = world_clean.index # world_clean = world_clean.drop_duplicates(subset=[ 'updateTime','countryName'], keep='first', inplace=False) # world_clean = world_clean.drop_duplicates(subset=[ 'province_confirmedCount','countryName'], keep='first', inplace=False) # world_clean.groupby(by='').max() # # df['city'].drop_duplicates() # # list(country_name[-5:]) # # world_clean # world_clean
# Contrast across continents? www = world_clean.copy() name = www['continentName'].drop_duplicates() name for i in name: ii = www.loc[www['continentName']==i] plt.figure() plt.title(i) plt.plot(ii.province_confirmedCount) plt.show()
[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-ygkpdj8j-1620734802831)(output_32_0.png)]
[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-Ux0n1XLf-1620734802833)(output_32_1.png)]
[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-kgAmTK1b-1620734802835)(output_32_2.png)]
[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-gL8Hqeku-1620734802839)(output_32_3.png)]
[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-ggRosVN2-1620734802843)(output_32_4.png)]
[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-GeDEsEFp-1620734802848)(output_32_5.png)]
[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-vzStKWuY-1620734802851)(output_32_6.png)]
# Select a continent of interest to analyze the links, distribution, comparison, and composition of epidemics among countries. ww1 = world_clean.copy() df_yazhou = ww1.loc[ww1['continentName']=='Asia'] df_yazhou
continentName | countryName | provinceName | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | |
---|---|---|---|---|---|---|---|
updateTime | |||||||
2021-01-22 | Asia | China | China | 99667 | 0.0 | 92275 | 4810 |
2021-01-21 | Asia | China | China | 99513 | 0.0 | 92198 | 4809 |
2021-01-20 | Asia | China | China | 99285 | 0.0 | 92130 | 4808 |
2021-01-19 | Asia | China | China | 99094 | 0.0 | 92071 | 4806 |
2021-01-18 | Asia | China | China | 98922 | 0.0 | 91994 | 4805 |
... | ... | ... | ... | ... | ... | ... | ... |
2020-02-01 | Asia | Thailand | Thailand | 19 | 0.0 | 5 | 0 |
2020-02-01 | Asia | Malaysia | Malaysia | 8 | 0.0 | 0 | 0 |
2020-02-01 | Asia | The United Arab Emirates | The United Arab Emirates | 4 | 0.0 | 0 | 0 |
2020-02-01 | Asia | India | India | 1 | 0.0 | 0 | 0 |
2020-01-29 | Asia | The United Arab Emirates | The United Arab Emirates | 1 | 0.0 | 0 | 0 |
10960 rows × 7 columns
# There is a lot of duplicate data in a day # o = o.drop_duplicates(subset=['provinceName', 'updateTime'], keep='first', inplace=False) # We only care about the largest number of people diagnosed, sick, treated in each country. df_yazhou_1 = df_yazhou.drop_duplicates(subset=['countryName'], keep='first', inplace=False) df_yazhou_1
continentName | countryName | provinceName | province_confirmedCount | province_suspectedCount | province_curedCount | province_deadCount | |
---|---|---|---|---|---|---|---|
updateTime | |||||||
2021-01-22 | Asia | China | China | 99667 | 0.0 | 92275 | 4810 |
2021-01-22 | Asia | Indonesia | Indonesia | 965283 | 0.0 | 781147 | 27453 |
2021-01-22 | Asia | Japan | Japan | 351020 | 0.0 | 279214 | 4830 |
2021-01-22 | Asia | Pakistan | Pakistan | 528891 | 0.0 | 482771 | 11204 |
2021-01-22 | Asia | The Philippines | The Philippines | 509887 | 0.0 | 467720 | 10136 |
2021-01-22 | Asia | Cyprus | Cyprus | 29472 | 0.0 | 2057 | 176 |
2021-01-22 | Asia | Georgia | Georgia | 251071 | 0.0 | 238101 | 2998 |
2021-01-22 | Asia | Jordan | Jordan | 318181 | 0.0 | 304200 | 4198 |
2021-01-22 | Asia | Armenia | Armenia | 165528 | 0.0 | 153857 | 3021 |
2021-01-22 | Asia | Oman | Oman | 132486 | 0.0 | 124730 | 1517 |
2021-01-22 | Asia | Kuwait | Kuwait | 159834 | 0.0 | 152826 | 951 |
2021-01-22 | Asia | Qatar | Qatar | 148521 | 0.0 | 144619 | 248 |
2021-01-22 | Asia | Thailand | Thailand | 13104 | 0.0 | 10224 | 71 |
2021-01-22 | Asia | Maldives | Maldives | 14765 | 0.0 | 13684 | 50 |
2021-01-22 | Asia | Cambodia | Cambodia | 456 | 0.0 | 399 | 0 |
2021-01-22 | Asia | India | India | 10610883 | 0.0 | 10162738 | 152869 |
2021-01-22 | Asia | Iran | Iran | 1354520 | 0.0 | 1144549 | 57150 |
2021-01-22 | Asia | Lebanon | Lebanon | 269241 | 0.0 | 158822 | 2151 |
2021-01-22 | Asia | Turkey | Turkey | 2412505 | 0.0 | 2290032 | 24640 |
2021-01-22 | Asia | Kazakhstan | Kazakhstan | 219527 | 0.0 | 121347 | 2956 |
2021-01-22 | Asia | Israel | Israel | 582869 | 0.0 | 497578 | 4245 |
2021-01-22 | Asia | The People's Republic of Bangladesh | The People's Republic of Bangladesh | 530271 | 0.0 | 475074 | 7966 |
2021-01-22 | Asia | Malaysia | Malaysia | 172549 | 0.0 | 130152 | 642 |
2021-01-22 | Asia | The United Arab Emirates | The United Arab Emirates | 267258 | 0.0 | 239322 | 766 |
2021-01-22 | Asia | Iraq | Iraq | 611407 | 0.0 | 576725 | 12977 |
2021-01-22 | Asia | Myanmar | Myanmar | 136166 | 0.0 | 119973 | 3013 |
2021-01-22 | Asia | The Republic of Korea | The Republic of Korea | 74262 | 0.0 | 61415 | 1328 |
2021-01-22 | Asia | Sri Lanka | Sri Lanka | 56076 | 0.0 | 47984 | 276 |
2021-01-22 | Asia | Azerbaijan | Azerbaijan | 228246 | 0.0 | 218387 | 3053 |
2021-01-22 | Asia | Syria | Syria | 13313 | 0.0 | 6624 | 858 |
2021-01-22 | Asia | Afghanistan | Afghanistan | 54403 | 0.0 | 46759 | 2363 |
2021-01-22 | Asia | Nepal | Nepal | 268646 | 0.0 | 262868 | 1979 |
2021-01-22 | Asia | Bahrain | Bahrain | 98573 | 0.0 | 95240 | 366 |
2021-01-22 | Asia | Kyrgyzstan | Kyrgyzstan | 83585 | 0.0 | 79509 | 1394 |
2021-01-22 | Asia | Saudi Arabia | Saudi Arabia | 365775 | 0.0 | 357337 | 6342 |
2021-01-22 | Asia | Republic of Yemen | Republic of Yemen | 2119 | 0.0 | 350 | 613 |
2021-01-22 | Asia | Uzbekistan | Uzbekistan | 78219 | 0.0 | 76624 | 620 |
2021-01-22 | Asia | Tajikistan | Tajikistan | 13714 | 0.0 | 12980 | 91 |
2021-01-22 | Asia | Mongolia | Mongolia | 1584 | 0.0 | 1046 | 1 |
2021-01-22 | Asia | Singapore | Singapore | 59197 | 0.0 | 58926 | 29 |
2021-01-22 | Asia | Bhutan | Bhutan | 851 | 0.0 | 660 | 1 |
2021-01-22 | Asia | Vietnam? | Vietnam? | 1546 | 0.0 | 1411 | 35 |
2021-01-22 | Asia | Timor-Leste | Timor-Leste | 53 | 0.0 | 30 | 0 |
2021-01-22 | Asia | Palestine | Palestine | 65802 | 0.0 | 32944 | 561 |
2021-01-22 | Asia | Brunei | Brunei | 174 | 0.0 | 169 | 3 |
2021-01-22 | Asia | Laos | Laos | 41 | 0.0 | 41 | 0 |
2020-05-03 | Asia | Turkmenistan | Turkmenistan | 0 | 0.0 | 0 | 0 |
2020-03-15 | Asia | Reunion | Reunion | 3 | 0.0 | 0 | 0 |
yazhou1 = df_yazhou_1[['countryName','province_confirmedCount']] yazhou1.set_index('countryName',inplace=True) yazhou1 list(yazhou1.index)
['China', 'Indonesia', 'Japan', 'Pakistan', 'The Philippines', 'Cyprus', 'Georgia', 'Jordan', 'Armenia', 'Oman', 'Kuwait', 'Qatar', 'Thailand', 'Maldives', 'Cambodia', 'India', 'Iran', 'Lebanon', 'Turkey', 'Kazakhstan', 'Israel', 'The People's Republic of Bangladesh', 'Malaysia', 'The United Arab Emirates', 'Iraq', 'Myanmar', 'The Republic of Korea', 'Sri Lanka', 'Azerbaijan', 'Syria', 'Afghanistan', 'Nepal', 'Bahrain', 'Kyrgyzstan', 'Saudi Arabia', 'Republic of Yemen', 'Uzbekistan', 'Tajikistan', 'Mongolia', 'Singapore', 'Bhutan', 'Vietnam?', 'Timor-Leste', 'Palestine', 'Brunei', 'Laos', 'Turkmenistan', 'Reunion']
# province_confirmedCount plt.figure(figsize=(6,9)) #Adjust graphic size plt.pie(yazhou1) plt.legend(list(yazhou1.index)) plt.show() # province_curedCount yazhou2 = df_yazhou_1[['countryName','province_curedCount']] yazhou2.set_index('countryName',inplace=True) yazhou2 list(yazhou2.index) plt.figure(figsize=(6,9)) #Adjust graphic size plt.pie(yazhou2) plt.legend(list(yazhou2.index)) plt.show() # province_deadCount yazhou3 = df_yazhou_1[['countryName','province_deadCount']] yazhou3.set_index('countryName',inplace=True) yazhou3 list(yazhou3.index) plt.figure(figsize=(6,9)) #Adjust graphic size plt.pie(yazhou3) plt.legend(list(yazhou3.index)) plt.show()
(4) Give your judgment on the epidemic trend in the next half year based on your analysis results, and what suggestions do you have for individuals and society in fighting the epidemic?
from pyecharts import Map,Geo import pandas as pd # df = pd.read_csv("Major Cities Annual Data. csv",encoding="gb2312") city = china['provinceName'] count = china['province_confirmedCount'] hospital_map = Geo("Number of urban areas","2021-01-22")#Header and subtitle hospital_map.use_theme=("white") hospital_map.add("",city,count,maptype="china",is_visualmap=True,is_lable_show=True,visual_range=[min(count),max(count)],visual_text_color="#000") hospital_map.render("hospital map.html")
Additional analysis (optional, but better results)
Additional analysis can be done using unlimited libraries, such as seaborn, pyecharts, and so on.
Children's shoes, free play!! (There's really no idea, so go to Baidu epidemic map or other sites for epidemic analysis to get inspiration.)
For example, such drops...
[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-a711p47D-1620734802865)(images/chinamap.png)]
Or, drop like this...
[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-BkjHqX7o-1620734802866)(images/worldmap.png)]
Or, drop like this...
[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-qwYlGObf-1620734802867)(images/meiguitu.png)]
Major man, wait for you to fight!