[Python] Using Python for data analysis--a case study of a new coronavirus epidemic

[Python] Using Python for data analysis--a case study of a new coronavirus epidemic

Important Notes

Submit only this document, and then submit it it to the "homework after completion" of the course.

1. File Naming:

Must be in number-name-major-class. Naming of ipynb, (e.g. 202011030101-Qiao Feng-Measurement and Control Technology and Instrument-Measurement and Control 20-1.ipynb)

In addition, the second line at the beginning of the document (marked in the document) is changed to your own number-name-major-class

2. Deadline:

24:00 p.m. on April 30, 2021

3. Scoring rules:

Analysis Document: Completeness: Code Quality 3:5:2

Analytical documents refer to the ideas, explanations, and explanations of the results of your analysis of data (short and concise, not for writing).

ps: You write your own code far better than anything else, no matter how beautiful or ugly it is, just ask today is better than yesterday! Come on.

Reminder:

The epidemic is still ravaging, please actively protect yourself

I wish you all good results

Since there is too much data, try using head() or tail() to view the data so that the program does not respond for a long time

=======================

Data for this project come from Syringa orchard. The main purpose of this project is to better understand the epidemic situation and its development through the analysis of historical epidemic data, and to provide data support for decision-making on fighting the epidemic situation.

1. Ask questions

From the national level, your provinces and cities, the epidemic situation abroad and other three aspects, the main research issues are as follows:

(1) What is the trend of cumulative diagnosis/suspicion/cure/death over time throughout the country?

(2) What is the situation in your province or city?

(3) What is the overall situation of the global epidemic?

(4) Give your judgment on the epidemic trend in the next half year based on your analysis results, and what suggestions do you have for individuals and society in fighting the epidemic?

2. Understanding data

Original dataset: AreaInfo.csv, import related packages, read data, and assign areas

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

areas = pd.read_csv('data\AreaInfo.csv')
areas
continentNamecontinentEnglishNamecountryNamecountryEnglishNameprovinceNameprovinceEnglishNameprovince_zipCodeprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCountupdateTimecityNamecityEnglishNamecity_zipCodecity_confirmedCountcity_suspectedCountcity_curedCountcity_deadCount
0AsiaAsiaChinaChinaMacaoMacau820000479.04602021-01-22 23:40:08NaNNaNNaNNaNNaNNaNNaN
1North AmericaNorth AmericaU.S.AUnited States of AmericaU.S.AUnited States of America971002246324680.0108454384103782021-01-22 23:40:08NaNNaNNaNNaNNaNNaNNaN
2South AmericaSouth AmericaBrazilBrazilBrazilBrazil97300386998140.075807412142282021-01-22 23:40:08NaNNaNNaNNaNNaNNaNNaN
3EuropeEuropeBelgiumBelgiumBelgiumBelgium9610016868270.019239206202021-01-22 23:40:08NaNNaNNaNNaNNaNNaNNaN
4EuropeEuropeRussiaRussiaRussiaRussia96400636773520.03081536684122021-01-22 23:40:08NaNNaNNaNNaNNaNNaNNaN
............................................................
429911AsiaAsiaChinaChinaLiaoning ProvinceLiaoning21000001.0002020-01-22 03:28:10NaNNaNNaNNaNNaNNaNNaN
429912AsiaAsiaChinaChinaTaiwanTaiwan71000010.0002020-01-22 03:28:10NaNNaNNaNNaNNaNNaNNaN
429913AsiaAsiaChinaHongkongHong KongHongkong8100000117.0002020-01-22 03:28:10NaNNaNNaNNaNNaNNaNNaN
429914AsiaAsiaChinaChinaHeilongjiang ProvinceHeilongjiang23000001.0002020-01-22 03:28:10NaNNaNNaNNaNNaNNaNNaN
429915AsiaAsiaChinaChinaHunan ProvinceHunan43000010.0002020-01-22 03:28:10NaNNaNNaNNaNNaNNaNNaN

429916 rows × 19 columns

areas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429916 entries, 0 to 429915
Data columns (total 19 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   continentName            429872 non-null  object 
 1   continentEnglishName     429872 non-null  object 
 2   countryName              429916 non-null  object 
 3   countryEnglishName       403621 non-null  object 
 4   provinceName             429916 non-null  object 
 5   provinceEnglishName      403621 non-null  object 
 6   province_zipCode         429916 non-null  int64  
 7   province_confirmedCount  429916 non-null  int64  
 8   province_suspectedCount  429913 non-null  float64
 9   province_curedCount      429916 non-null  int64  
 10  province_deadCount       429916 non-null  int64  
 11  updateTime               429916 non-null  object 
 12  cityName                 125143 non-null  object 
 13  cityEnglishName          119660 non-null  object 
 14  city_zipCode             123877 non-null  float64
 15  city_confirmedCount      125143 non-null  float64
 16  city_suspectedCount      125143 non-null  float64
 17  city_curedCount          125143 non-null  float64
 18  city_deadCount           125143 non-null  float64
dtypes: float64(6), int64(4), object(9)
memory usage: 62.3+ MB

View and statistics to get a general idea of the data

Related field meaning introduction:

Tips:

The provinceName of foreign data is not a province name, but is labeled with its country name, that is, the data is no longer subdivided into provinces.

There is also a record like'China'in the provinceName of China data, which represents the total of the provinces in the country on that day. If you use it well, it will be much easier to analyze the national situation.

3. Data cleaning

(1) Basic data processing

Data cleaning mainly includes: selecting subsets, missing data processing, data format conversion, outlier data processing, etc.

Tip: Because the data is reported by all countries, it is inevitable to conceal the missing report when the situation is critical, so there are many missing values, which can be completed or discarded. See "Treatment of missing values in Pandas.ipynb"

Selection of domestic epidemic data (the final selected data is named china)

  1. Select domestic epidemic data

  2. For the updateTime column, convert it to a date type, extract the year-month-day, and view the processing results. (Tip: dt.date)

  3. Since the data is updated hourly every day and there are many duplicates in a day, please go back and keep only the latest data in a day.

Tip: df.drop_duplicates(subset=['provinceName','updateTime'], keep='first', inplace=False)

Where df is the DataFrame of your choice for domestic epidemic data

  1. Remove columns that are not in the scope of this study, leaving only ['continentName','countryName','provinceName','province_confirmedCount','province_suspectedCount','province_curedCount','province_deadCount','updateTime'] columns and index them with'updateTime'.

Tip: Both methods can: (1) Select these columns or (2) Remove the remaining columns

# The code is given here, and the acquisition of later provincial and global data is much the same.
china = areas.loc[areas.countryName=='China',:].copy()
china['updateTime'] = pd.to_datetime(china.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
china = china.drop_duplicates(subset=['provinceName', 'updateTime'], keep='first', inplace=False)

# Convert Character Type Date Column (Index) to Datetime Index
china['updateTime'] = pd.to_datetime(china['updateTime'])
china.set_index('updateTime',inplace=True)

china = china[['continentName','countryName','provinceName','province_confirmedCount','province_suspectedCount','province_curedCount','province_deadCount']]
china = china[china.provinceName=='China']
china.head()
continentNamecountryNameprovinceNameprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCount
updateTime
2021-01-22AsiaChinaChina996670.0922754810
2021-01-21AsiaChinaChina995130.0921984809
2021-01-20AsiaChinaChina992850.0921304808
2021-01-19AsiaChinaChina990940.0920714806
2021-01-18AsiaChinaChina989220.0919944805

Check the data information for missing data/data type correctness. If there are missing values, you can complete or discard them, see ** "Processing missing values for Pandas. ipynb"**

# Check the data information for missing data/data type correctness. If there are missing values, they can be completed or discarded
china.info()
china.head(5)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 305 entries, 2021-01-22 to 2020-03-15
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   continentName            305 non-null    object 
 1   countryName              305 non-null    object 
 2   provinceName             305 non-null    object 
 3   province_confirmedCount  305 non-null    int64  
 4   province_suspectedCount  305 non-null    float64
 5   province_curedCount      305 non-null    int64  
 6   province_deadCount       305 non-null    int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 19.1+ KB
continentNamecountryNameprovinceNameprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCount
updateTime
2021-01-22AsiaChinaChina996670.0922754810
2021-01-21AsiaChinaChina995130.0921984809
2021-01-20AsiaChinaChina992850.0921304808
2021-01-19AsiaChinaChina990940.0920714806
2021-01-18AsiaChinaChina989220.0919944805

Selection of epidemic data in your province or city (the final selected data is named myhome)

This step can also be done later

  1. Select epidemic data of the province or city (refined to city; if municipality directly under the Central Government, refined to district)

  2. For the updateTime column, convert it to a date type, extract the year-month-day, and view the processing results. (Tip: dt.date)

  3. Since the data is updated hourly every day and there are many duplicates in a day, please recreate and keep only the latest data in a day and index the rows with'updateTime'.

Tip: df.drop_duplicates(subset=['cityName','updateTime'], keep='first', inplace=False)

  1. Remove columns not covered by this study

Tip: df.drop(['continentName','continentEnglishName','countryName','countryEnglishName','provinceEnglishName',
'province_zipCode','cityEnglishName','updateTime','city_zipCode'],axis=1,inplace=True)

Where df is the DataFrame for the epidemic data of the province or city you selected

# Provincial and municipal data acquisition
df = areas.loc[areas.provinceName=='Sichuan Province',:].copy()
df
df['updateTime'] = pd.to_datetime(df.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
df = df.drop_duplicates(subset=['provinceName', 'updateTime'], keep='first', inplace=False)

# Convert Character Type Date Column (Index) to Datetime Index
df['updateTime'] = pd.to_datetime(df['updateTime'])
# df.set_index('updateTime',inplace=True)

df.drop(['continentName','continentEnglishName','countryName','countryEnglishName','provinceEnglishName','province_zipCode','cityEnglishName','updateTime','city_zipCode'],axis=1,inplace=True)
# df = df[df.provinceName=='Hubei Province']
df
# df.head(10)
provinceNameprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCountcityNamecity_confirmedCountcity_suspectedCountcity_curedCountcity_deadCount
2538Sichuan Province86514.08523Chengdu468.013.0455.03.0
7192Sichuan Province86514.08523Chengdu468.013.0455.03.0
8912Sichuan Province86314.08523Chengdu466.013.0455.03.0
10704Sichuan Province86314.08473Chengdu466.013.0450.03.0
12175Sichuan Province86214.08473Chengdu465.013.0450.03.0
.................................
427258Sichuan Province440.000Chengdu22.00.00.00.0
428070Sichuan Province280.000Chengdu16.00.00.00.0
428837Sichuan Province150.000Chengdu7.00.00.00.0
429846Sichuan Province82.000NaNNaNNaNNaNNaN
429866Sichuan Province52.000NaNNaNNaNNaNNaN

267 rows × 10 columns

df.info()
df.head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 267 entries, 2538 to 429866
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   provinceName             267 non-null    object 
 1   province_confirmedCount  267 non-null    int64  
 2   province_suspectedCount  267 non-null    float64
 3   province_curedCount      267 non-null    int64  
 4   province_deadCount       267 non-null    int64  
 5   cityName                 265 non-null    object 
 6   city_confirmedCount      265 non-null    float64
 7   city_suspectedCount      265 non-null    float64
 8   city_curedCount          265 non-null    float64
 9   city_deadCount           265 non-null    float64
dtypes: float64(5), int64(3), object(2)
memory usage: 22.9+ KB
provinceNameprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCountcityNamecity_confirmedCountcity_suspectedCountcity_curedCountcity_deadCount
2538Sichuan Province86514.08523Chengdu468.013.0455.03.0
7192Sichuan Province86514.08523Chengdu468.013.0455.03.0
8912Sichuan Province86314.08523Chengdu466.013.0455.03.0
10704Sichuan Province86314.08473Chengdu466.013.0450.03.0
12175Sichuan Province86214.08473Chengdu465.013.0450.03.0
df_clean = df.dropna()
df_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 265 entries, 2538 to 428837
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   provinceName             265 non-null    object 
 1   province_confirmedCount  265 non-null    int64  
 2   province_suspectedCount  265 non-null    float64
 3   province_curedCount      265 non-null    int64  
 4   province_deadCount       265 non-null    int64  
 5   cityName                 265 non-null    object 
 6   city_confirmedCount      265 non-null    float64
 7   city_suspectedCount      265 non-null    float64
 8   city_curedCount          265 non-null    float64
 9   city_deadCount           265 non-null    float64
dtypes: float64(5), int64(3), object(2)
memory usage: 22.8+ KB

Check the data information for missing data/data type correctness. If there are missing values, you can complete or discard them, see ** "Processing missing values for Pandas. ipynb"**

Global epidemic data selection (the final selected data is named world)

This step can also be done later

  1. Selecting epidemic data from abroad

Tip: Select foreign epidemic data (countryName!='China')

  1. For the updateTime column, convert it to a date type, extract the year-month-day, and view the processing results. (Tip: dt.date)

  2. Since the data is updated hourly every day and there are many duplicates in a day, please go back and keep only the latest data in a day.

Tip: df.drop_duplicates(subset=['provinceName','updateTime'], keep='first', inplace=False)

Where df is the DataFrame of your choice for domestic epidemic data

  1. Remove columns that are not in the scope of this study, leaving only ['continentName','countryName','provinceName','province_confirmedCount','province_suspectedCount','province_curedCount','province_deadCount','updateTime'] columns and index them with'updateTime'.

Tip: Both methods can: (1) Select these columns or (2) Remove the remaining columns

  1. Get global data

Tip: Use the concat function to connect the previous china with foreign data by "axis" to get global data.

# Provincial and municipal data acquisition
o = areas.loc[areas.countryName!='China',:].copy()
# df = areas.loc[areas.provinceName=='Hubei Province',:]. copy()
o
o['updateTime'] = pd.to_datetime(o.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
o = o.drop_duplicates(subset=['provinceName', 'updateTime'], keep='first', inplace=False)

# Convert Character Type Date Column (Index) to Datetime Index
o['updateTime'] = pd.to_datetime(o['updateTime'])
# o.set_index('updateTime',inplace=True)

o = o[['continentName','countryName','provinceName','province_confirmedCount','province_suspectedCount','province_curedCount','province_deadCount','updateTime']]

o.set_index('updateTime',inplace=True)
o
continentNamecountryNameprovinceNameprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCount
updateTime
2021-01-22North AmericaU.S.AU.S.A246324680.010845438410378
2021-01-22South AmericaBrazilBrazil86998140.07580741214228
2021-01-22EuropeBelgiumBelgium6868270.01923920620
2021-01-22EuropeRussiaRussia36773520.0308153668412
2021-01-22EuropeSerbiaSerbia4361210.0501855263
........................
2020-01-27NaNMalaysiaMalaysia30.000
2020-01-27NaNFranceFrance30.000
2020-01-27NaNVietnam?Vietnam?20.000
2020-01-27NaNNepalNepal10.000
2020-01-27NaNCanadaCanada10.000

62110 rows × 7 columns

world = pd.concat([china,o])
world
continentNamecountryNameprovinceNameprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCount
updateTime
2021-01-22AsiaChinaChina996670.0922754810
2021-01-21AsiaChinaChina995130.0921984809
2021-01-20AsiaChinaChina992850.0921304808
2021-01-19AsiaChinaChina990940.0920714806
2021-01-18AsiaChinaChina989220.0919944805
........................
2020-01-27NaNMalaysiaMalaysia30.000
2020-01-27NaNFranceFrance30.000
2020-01-27NaNVietnam?Vietnam?20.000
2020-01-27NaNNepalNepal10.000
2020-01-27NaNCanadaCanada10.000

62415 rows × 7 columns

world.info()
world_clean = world.dropna()
world_clean.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 62415 entries, 2021-01-22 to 2020-01-27
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   continentName            62377 non-null  object 
 1   countryName              62415 non-null  object 
 2   provinceName             62415 non-null  object 
 3   province_confirmedCount  62415 non-null  int64  
 4   province_suspectedCount  62415 non-null  float64
 5   province_curedCount      62415 non-null  int64  
 6   province_deadCount       62415 non-null  int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 3.8+ MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 62377 entries, 2021-01-22 to 2020-01-29
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   continentName            62377 non-null  object 
 1   countryName              62377 non-null  object 
 2   provinceName             62377 non-null  object 
 3   province_confirmedCount  62377 non-null  int64  
 4   province_suspectedCount  62377 non-null  float64
 5   province_curedCount      62377 non-null  int64  
 6   province_deadCount       62377 non-null  int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 3.8+ MB

Check the data information for missing data/data type correctness.

Tip: There are many missing values because the data is reported by all countries and it is unavoidable to conceal the missing report when the situation is critical. You can complete or discard the missing values. See ** "Handling missing values in Pandas.ipynb"**

4. Data analysis and visualization

For data analysis and visualization, select the required variables based on each question and create a new DataFrame for analysis and visualization so that the data is less cluttered and more organized.

Basic analysis

For basic analysis, only numpy, pandas, and matplotlib libraries are allowed.

Multiple coordinate systems can be displayed on one graph or on multiple graphs

Please select the type of graphics (line chart, pie chart, histogram, scatter chart, etc.) according to the analysis purpose. It is really no idea to go to Baidu epidemic map or other sites for epidemic analysis to inspire.

(1) What is the trend of cumulative diagnosis/cure/death over time throughout the country?

china
continentNamecountryNameprovinceNameprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCount
updateTime
2021-01-22AsiaChinaChina996670.0922754810
2021-01-21AsiaChinaChina995130.0921984809
2021-01-20AsiaChinaChina992850.0921304808
2021-01-19AsiaChinaChina990940.0920714806
2021-01-18AsiaChinaChina989220.0919944805
........................
2020-03-19AsiaChinaChina812630.0705613250
2020-03-18AsiaChinaChina812020.0697773242
2020-03-17AsiaChinaChina811350.0688203231
2020-03-16AsiaChinaChina810990.0679303218
2020-03-15AsiaChinaChina810620.0670373204

305 rows × 7 columns

# Accumulated diagnosis nationwide
plt.figure()
plt.plot(china.province_confirmedCount)
plt.title('China')
plt.xlabel('Time')
plt.ylabel('Count')
plt.show()

# Accumulated national cure
plt.figure()
plt.plot(china.province_curedCount)
plt.title('China')
plt.xlabel('Time')
plt.ylabel('Count')
plt.show()

# Accumulated national deaths
plt.figure()
plt.plot(china.province_deadCount)
plt.title('China')
plt.xlabel('Time')
plt.ylabel('Count')
plt.show()

(2) What is the situation in your province or city?

# Cumulative diagnosis in Sichuan
plt.figure()
plt.plot(df.province_confirmedCount)
plt.title('Sichuan')
plt.xlabel('Time')
plt.ylabel('Count')
plt.show()

# Cumulative cure in Sichuan
plt.figure()
plt.plot(df.province_curedCount)
plt.title('China')
plt.xlabel('Time')
plt.ylabel('Count')
plt.show()
# Cumulative Death in Sichuan
plt.figure()
plt.plot(df.province_deadCount)
plt.title('China')
plt.xlabel('Time')
plt.ylabel('Count')
plt.show()

Insert a picture description here

(3) What is the global epidemic situation?

  1. What is the global TOP10 epidemic?

  2. Contrast across continents?

  3. Select a continent of interest to analyze the links, distribution, comparison, and composition of epidemics among countries.

Tip: Pay attention to the use of perspectives, grouping, and integrated knowledge

#  What is the global TOP10 epidemic?

# Maximum number of confirmed cases per country
world_clean
plt.rcParams['font.sans-serif']=['SimHei'] #Used for normal display of Chinese labels
world_clean.sort_values("province_confirmedCount",inplace=False)
world_clean
world_clean = world_clean.drop_duplicates(subset=[ 'province_confirmedCount','countryName'], keep='first', inplace=False)
world_clean
name = world_clean['countryName'].drop_duplicates()
name[:10]
ww = world_clean.copy()
ii = ww.loc[ww['countryName']=='China']
ii.province_confirmedCount
for i in name[:10]:
    ii = ww.loc[ww['countryName']==i]
    plt.figure()
    plt.title(i)
    plt.plot(ii.province_confirmedCount)
    plt.show()

# world_clean['provinceName'].drop_duplicates().values
# world_clean['time'] = world_clean.index
# world_clean = world_clean.drop_duplicates(subset=[ 'updateTime','countryName'], keep='first', inplace=False)
# world_clean = world_clean.drop_duplicates(subset=[ 'province_confirmedCount','countryName'], keep='first', inplace=False)
# world_clean.groupby(by='').max()
# # df['city'].drop_duplicates()
# # list(country_name[-5:])
# # world_clean
# world_clean
# Contrast across continents?
www = world_clean.copy()
name = www['continentName'].drop_duplicates()
name
for i in name:
    ii = www.loc[www['continentName']==i]
    plt.figure()
    plt.title(i)
    plt.plot(ii.province_confirmedCount)
    plt.show()

[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-ygkpdj8j-1620734802831)(output_32_0.png)]

[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-Ux0n1XLf-1620734802833)(output_32_1.png)]

[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-kgAmTK1b-1620734802835)(output_32_2.png)]

[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-gL8Hqeku-1620734802839)(output_32_3.png)]

[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-ggRosVN2-1620734802843)(output_32_4.png)]

[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-GeDEsEFp-1620734802848)(output_32_5.png)]

[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-vzStKWuY-1620734802851)(output_32_6.png)]

# Select a continent of interest to analyze the links, distribution, comparison, and composition of epidemics among countries.

ww1 = world_clean.copy()
df_yazhou = ww1.loc[ww1['continentName']=='Asia']
df_yazhou
continentNamecountryNameprovinceNameprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCount
updateTime
2021-01-22AsiaChinaChina996670.0922754810
2021-01-21AsiaChinaChina995130.0921984809
2021-01-20AsiaChinaChina992850.0921304808
2021-01-19AsiaChinaChina990940.0920714806
2021-01-18AsiaChinaChina989220.0919944805
........................
2020-02-01AsiaThailandThailand190.050
2020-02-01AsiaMalaysiaMalaysia80.000
2020-02-01AsiaThe United Arab EmiratesThe United Arab Emirates40.000
2020-02-01AsiaIndiaIndia10.000
2020-01-29AsiaThe United Arab EmiratesThe United Arab Emirates10.000

10960 rows × 7 columns

# There is a lot of duplicate data in a day
# o = o.drop_duplicates(subset=['provinceName', 'updateTime'], keep='first', inplace=False) 
# We only care about the largest number of people diagnosed, sick, treated in each country.
df_yazhou_1 = df_yazhou.drop_duplicates(subset=['countryName'], keep='first', inplace=False) 
df_yazhou_1
continentNamecountryNameprovinceNameprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCount
updateTime
2021-01-22AsiaChinaChina996670.0922754810
2021-01-22AsiaIndonesiaIndonesia9652830.078114727453
2021-01-22AsiaJapanJapan3510200.02792144830
2021-01-22AsiaPakistanPakistan5288910.048277111204
2021-01-22AsiaThe PhilippinesThe Philippines5098870.046772010136
2021-01-22AsiaCyprusCyprus294720.02057176
2021-01-22AsiaGeorgiaGeorgia2510710.02381012998
2021-01-22AsiaJordanJordan3181810.03042004198
2021-01-22AsiaArmeniaArmenia1655280.01538573021
2021-01-22AsiaOmanOman1324860.01247301517
2021-01-22AsiaKuwaitKuwait1598340.0152826951
2021-01-22AsiaQatarQatar1485210.0144619248
2021-01-22AsiaThailandThailand131040.01022471
2021-01-22AsiaMaldivesMaldives147650.01368450
2021-01-22AsiaCambodiaCambodia4560.03990
2021-01-22AsiaIndiaIndia106108830.010162738152869
2021-01-22AsiaIranIran13545200.0114454957150
2021-01-22AsiaLebanonLebanon2692410.01588222151
2021-01-22AsiaTurkeyTurkey24125050.0229003224640
2021-01-22AsiaKazakhstanKazakhstan2195270.01213472956
2021-01-22AsiaIsraelIsrael5828690.04975784245
2021-01-22AsiaThe People's Republic of BangladeshThe People's Republic of Bangladesh5302710.04750747966
2021-01-22AsiaMalaysiaMalaysia1725490.0130152642
2021-01-22AsiaThe United Arab EmiratesThe United Arab Emirates2672580.0239322766
2021-01-22AsiaIraqIraq6114070.057672512977
2021-01-22AsiaMyanmarMyanmar1361660.01199733013
2021-01-22AsiaThe Republic of KoreaThe Republic of Korea742620.0614151328
2021-01-22AsiaSri LankaSri Lanka560760.047984276
2021-01-22AsiaAzerbaijanAzerbaijan2282460.02183873053
2021-01-22AsiaSyriaSyria133130.06624858
2021-01-22AsiaAfghanistanAfghanistan544030.0467592363
2021-01-22AsiaNepalNepal2686460.02628681979
2021-01-22AsiaBahrainBahrain985730.095240366
2021-01-22AsiaKyrgyzstanKyrgyzstan835850.0795091394
2021-01-22AsiaSaudi ArabiaSaudi Arabia3657750.03573376342
2021-01-22AsiaRepublic of YemenRepublic of Yemen21190.0350613
2021-01-22AsiaUzbekistanUzbekistan782190.076624620
2021-01-22AsiaTajikistanTajikistan137140.01298091
2021-01-22AsiaMongoliaMongolia15840.010461
2021-01-22AsiaSingaporeSingapore591970.05892629
2021-01-22AsiaBhutanBhutan8510.06601
2021-01-22AsiaVietnam?Vietnam?15460.0141135
2021-01-22AsiaTimor-LesteTimor-Leste530.0300
2021-01-22AsiaPalestinePalestine658020.032944561
2021-01-22AsiaBruneiBrunei1740.01693
2021-01-22AsiaLaosLaos410.0410
2020-05-03AsiaTurkmenistanTurkmenistan00.000
2020-03-15AsiaReunionReunion30.000
yazhou1 = df_yazhou_1[['countryName','province_confirmedCount']]
yazhou1.set_index('countryName',inplace=True)
yazhou1
list(yazhou1.index)
['China',
 'Indonesia',
 'Japan',
 'Pakistan',
 'The Philippines',
 'Cyprus',
 'Georgia',
 'Jordan',
 'Armenia',
 'Oman',
 'Kuwait',
 'Qatar',
 'Thailand',
 'Maldives',
 'Cambodia',
 'India',
 'Iran',
 'Lebanon',
 'Turkey',
 'Kazakhstan',
 'Israel',
 'The People's Republic of Bangladesh',
 'Malaysia',
 'The United Arab Emirates',
 'Iraq',
 'Myanmar',
 'The Republic of Korea',
 'Sri Lanka',
 'Azerbaijan',
 'Syria',
 'Afghanistan',
 'Nepal',
 'Bahrain',
 'Kyrgyzstan',
 'Saudi Arabia',
 'Republic of Yemen',
 'Uzbekistan',
 'Tajikistan',
 'Mongolia',
 'Singapore',
 'Bhutan',
 'Vietnam?',
 'Timor-Leste',
 'Palestine',
 'Brunei',
 'Laos',
 'Turkmenistan',
 'Reunion']
# province_confirmedCount

plt.figure(figsize=(6,9)) #Adjust graphic size
plt.pie(yazhou1)
plt.legend(list(yazhou1.index))
plt.show()


# province_curedCount
yazhou2 = df_yazhou_1[['countryName','province_curedCount']]
yazhou2.set_index('countryName',inplace=True)
yazhou2
list(yazhou2.index)
plt.figure(figsize=(6,9)) #Adjust graphic size
plt.pie(yazhou2)
plt.legend(list(yazhou2.index))
plt.show()


# province_deadCount
yazhou3 = df_yazhou_1[['countryName','province_deadCount']]
yazhou3.set_index('countryName',inplace=True)
yazhou3
list(yazhou3.index)
plt.figure(figsize=(6,9)) #Adjust graphic size
plt.pie(yazhou3)
plt.legend(list(yazhou3.index))
plt.show()

(4) Give your judgment on the epidemic trend in the next half year based on your analysis results, and what suggestions do you have for individuals and society in fighting the epidemic?

from pyecharts import Map,Geo
import pandas as pd
 
# df = pd.read_csv("Major Cities Annual Data. csv",encoding="gb2312")
city = china['provinceName']
count = china['province_confirmedCount']
hospital_map = Geo("Number of urban areas","2021-01-22")#Header and subtitle hospital_map.use_theme=("white")
hospital_map.add("",city,count,maptype="china",is_visualmap=True,is_lable_show=True,visual_range=[min(count),max(count)],visual_text_color="#000")
hospital_map.render("hospital map.html")

Additional analysis (optional, but better results)

Additional analysis can be done using unlimited libraries, such as seaborn, pyecharts, and so on.

Children's shoes, free play!! (There's really no idea, so go to Baidu epidemic map or other sites for epidemic analysis to get inspiration.)

For example, such drops...

[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-a711p47D-1620734802865)(images/chinamap.png)]

Or, drop like this...

[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-BkjHqX7o-1620734802866)(images/worldmap.png)]

Or, drop like this...

[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-qwYlGObf-1620734802867)(images/meiguitu.png)]

Major man, wait for you to fight!



Tags: Big Data Python Data Analysis Visualization data visualization

Posted by Afrojojo on Thu, 02 Jun 2022 20:19:38 +0530