Blogs
Our Use Case and Objective
In this article, we demonstrate the data discovery process on a COVID19 dataset, data discovery process is a necessary milestone in any data science project.
We will cover the following topics

What is Data Science In A Big Data World?

Why Become Data Scientist?

What are the most Frequently mentioned skills in job postings for Data science positions?

What is Data Science use cases?

What are the five stages of Data Analysis?
First let’s talk about Big Data, Big Data is a blanket term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data management techniques such as, for example, the RDBMS (relational database management systems), RDBMS is a type of database management system (DBMS) that stores data in a rowbased table structure which connects related data elements. An RDBMS includes functions that maintain the security, accuracy, integrity, and consistency of the data”
Now Let’s talk about Data Science, Data Science is the solution that involves using methods to analyze massive amounts of data and extract the knowledge it contains. You can think of the relationship between big data and data science as being like the relationship between crude oil and an oil refinery.
Why Become Data Scientist ?
This is the first question you want to ask so let me show you a couple of reasons and how data science now is shaping the world by solving complex realworld issues.
– Every company now has the ability to collect data, and the amount of data is growing larger and larger. This has led to a higher demand for employees with specific skills who can effectively organize
and analyze this data to collect business insights.
– Data Scientists now in all organizations leading the data driven and digital transformations strategies, and they became important member of the team of any product
– Median Base Salary for Data scientists is 12,000 EGP/Month for Egypt and 131,850$ per year for US according to Glassdoor
Skills are required to be a data scientist:
The Five most Frequently mentioned skills in job postings for data science positions:
1 Python
2 Machine Learning
3 R
4 SQL
5 Mathematics and Statistics
Following image groups the needed skills and tools that used by any data scientist
Data Science Realworld applications
Following use cases show the some of the realworld issues that data science helped to resolve
1 Credit/Debit card fraud detection
If you interested in Finance this Topic for you
Problem
In recent years with the development of technology and the digitization of services number of bank transactions via credit cards raised significantly. Importantly, the increase in credit card payments is also accompanied by an increase in identified fraud,we hear about credit card frauds raised drastically
Solution:
this project for providing insight into fraud prevention, detection and response, by using Data Science with Machine learning to detect fraud,
2 Healthcare (Drug Development)
The healthcare sector, especially, receives great benefits from data science applications
Problem:
In times of infectious diseases(like COVID19), there is no time for informing the discovered drug, as the process of drug discovery is very complex and includes many specialties. Often the greatest ideas are limited by billions of tests, huge financial spending, and time. On average, it takes twelve years for an official submission to be submitted.
Solution:
Data science applications and machine learning algorithms simplify and shorten this process, adding a perspective to each step from the initial screening of drug compounds to the prediction of the success rate based on the biological factors. Such algorithms can forecast how the compound will act in the body using advanced mathematical modeling and simulations instead of the “lab experiments”.
COVID19 Analysis
So I chose the COVID19 data set to analyze and walk you through the data discovery phase of this dataset to increase awareness against disease, and to help you to give an example on how to analyze your problem and describe data in meaningful ways, to extract and visualize your data
Scope of this analysis is to do Causal and Epidemiological Analysis on the give COVID19 dataset
Steps of Data Analysis Project:
1 Identify the Problem
COVID19, first identified in Wuhan City, Hubei Province, China, has spread worldwide with over 40 million infected and 1 million deceased. Amongst governments worldwide fighting COVID19, The United States, has been remarkably effective at the New York times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak. The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data.
our Goal of this analysis is to answer this questions In United States:
 Where new cases are increase and staying high?
 Where new cases are going down?
 Where new cases are the lowest?
 Where new cases are the highest?
 How many hospitalized Covid19 patients by day for South Dakota state?
 What is the ratio of positive and total Test Results of South Dakota state?
 What is the ratio of positive people become recovered in Arkansas state?
 What is the ratio of positive people who become recovered in the Top 10 recovered states?
2 Getting data
We have two types of data
1 Historical dataset
cumulative counts of coronavirus cases and deaths until today but not including the current day
2 Live dataset
have figures that may be a partial count released during the day but cannot necessarily be considered the final, endofday tally.
Data Set Description
1 Provincial data:
the number of COVID19 screening centers, area, and population density of each municipal district
Attribute  Description 

date  represent date of each record 
state  represent all states of United State 
positive  represent accumulated count of positive cases 
negative  represent accumulated count of negative cases 
totalTestResultsSource  represent all sources made covid19 test 
totalTestResults  represent all result of covid19 test from test sources 
hospitalizedCurrently  represent accumulated count of current hospitalized cases 
recovered  represent accumulated count of recovered cases 
death  represent accumulated count of death cases 
hospitalized  represent accumulated count of all hospitalized cases 
positiveIncrease  represent current positive cases count in this date 
negativeIncrease  represent current negative cases count in this date 
totalTestResultsIncrease  represent current Covid19 tests result in this date 
deathIncrease  represent current death cases count in this date 
hospitalizedIncrease  represent current hospitalized cases count in this date 
First Step Install all libraries need
#install libraries need
import sys
!{sys.executable} m pip install numpy
!{sys.executable} m pip install pandas
!{sys.executable} m pip install matplotlib
Second Step Import all libraries need
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Third Step is reading data using Data frame & Present first 5 row of Data
#read data from csv file (excel sheet)
df = pd.read_csv('Daily.csv')
#show first 5 rows of data set
df.head()
3 Explore and Clean Your Data
Before doing anything do you think the date column in right format
#formate date
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
lets show all general information about dataset
#desplay all information of data frame and how many not null values in each column
df.info()
Show all statistical values of data set
#describe all statistical values
df.describe()
After Show data set you can see many None values in my data set so let’s check, How many numbers of None variables in each column? So I use this function to calculate numbers of None
#check if any Columns contain NaN and How many NaN ?
def checkNa(df):
#init variable to count na
sumna = 0
#loop to iterate over all columns
for i in df.columns:
#count all an in each column
sumna = df[i].isna().sum()
#check if sumna value grater than 0 then it print value with column
if sumna > 0: print('find {} na in Column {} '.format(sumna,i))
checkNa(df)
You can see many columns have many none values I never used it so I well drop it
#Drops unnecessary column
df.drop(['pending','dataQualityGrade','onVentilatorCurrently','onVentilatorCumulative','hash','commercialScore','negativeRegularScore','negativeScore','positiveScore','score','grade','inIcuCumulative','hospitalizedDischarged','probableCases','positiveTestsViral'],axis=1,inplace=True)
df.drop(['negativeTestsViral','positiveCasesViral','deathConfirmed','deathProbable','totalTestEncountersViral','totalTestsPeopleViral','totalTestsPeopleAntigen','positiveTestsPeopleAntigen','totalTestsAntigen','dateChecked','dateModified'],axis=1,inplace=True)
df.drop(['totalTestsAntibody','positiveTestsAntibody','negativeTestsAntibody','totalTestsPeopleAntibody','positiveTestsPeopleAntibody','negativeTestsPeopleAntibody','positiveTestsAntigen','inIcuCurrently','hospitalizedCumulative'],axis=1,inplace=True)
df.head()
You still have None values so you can’t remove all columns, I think you have two options to deal with none values:
1 Remove none values from important columns:
#remove all NaN in Columns Positive, Negative using And gate
df = df[np.logical_and(df['positive'].notna , df['negative'].notna())]
checkNa(df)
2 Fill all none values with initial values:
#fill all na in all columns with 0 value
df.fillna(value=0, inplace=True)
#display first 5 rows
df.head()
4 Data Analysis & Data Visualization:
Here is the point that you get Answer for all questions you asked above
Q1 Where new cases are increase and staying high ?
to answer this question you want to select 3 columns (date, state, positive increase) so I made the function select a positive increase column depend on the date column and state column and used Matplotlib library to show the result on the screen
def plotincreaseOnDate(var='positiveIncrease',date='20200326'):
"""
steps:
1 select all rows in specific date (20200326)
2 select 2 columns like [positiveIncrease , state]
3 plot bar graph which xaxis (state) and yaxis (positiveIncrease)
"""
y = df[df['date']==date][var]
x = df[df['date']==date]['state']
plt.figure(figsize=(12,4))
plt.title("Plot of \\"{}\\" for {}".format(var,'state'),fontsize=18)
plt.bar(x=x,height=y,edgecolor='k',color='blue')
plt.grid(True)
plt.xticks(fontsize=14,rotation=45)
plt.yticks(fontsize=14)
plt.show()
plotincreaseOnDate()
So as you see positive increase in New Jersey in this date
Q2 Where new cases are going down ?
to answer this question you want to select 3 columns (date, state, negative increase) so I made the function select a negative increase column depend on the date column and state column and used Matplotlib library to show the result on the screen
def plotdecreaseOnDate(var='negativeIncrease',date='20200326'):
"""
steps:
1 select all rows in specific date (20200326)
2 select 2 columns like [negativeIncrease , state]
3 plot bar graph which xaxis (state) and yaxis (negativeIncrease)
"""
y = df[df['date']==date][var]
x = df[df['date']==date]['state']
plt.figure(figsize=(12,4))
plt.title("Plot of \\"{}\\" for {}".format(var,'state'),fontsize=18)
plt.bar(x=x,height=y,edgecolor='k',color='blue')
plt.grid(True)
plt.xticks(fontsize=14,rotation=45)
plt.yticks(fontsize=14)
plt.show()
plotdecreaseOnDate()
So as you see negative increase in Florida in this date
Q3 Where new cases are the lowest ?
So now I think I should select 3 columns (date, state, negative)
def plotnegativeOnDate(var='negative',date='20200326'):
"""
steps:
1 select all rows in specific date (20200326)
2 select 2 columns like [negative , state]
3 plot bar graph which xaxis (state) and yaxis (negative)
"""
y = df[df['date']==date][var]
x = df[df['date']==date]['state']
plt.figure(figsize=(12,4))
plt.title("Plot of \\"{}\\" for {}".format(var,'state'),fontsize=18)
plt.bar(x=x,height=y,edgecolor='k',color='blue')
plt.grid(True)
plt.xticks(fontsize=14,rotation=45)
plt.yticks(fontsize=14)
plt.show()
plotnegativeOnDate()
so as you see the lowest state is Florida state
Q4 Where new cases are the highest ?
so now I think I should select 3 columns (date, state, positive)
def plotpositiveOnDate(var='positive',date='20200326'):
"""
steps:
1 select all rows in specific date (20200326)
2 select 2 columns like [positive , state]
3 plot bar graph which xaxis (state) and yaxis (positiveIncrease)
"""
y = df[df['date']==date][var]
x = df[df['date']==date]['state']
plt.figure(figsize=(12,4))
plt.title("Plot of \\"{}\\" for {}".format(var,'state'),fontsize=18)
plt.bar(x=x,height=y,edgecolor='k',color='blue')
plt.grid(True)
plt.xticks(fontsize=14,rotation=45)
plt.yticks(fontsize=14)
plt.show()
plotpositiveOnDate()
So as you see the highest state is Arkansas state in this date
Q5 How many hospitalized Covid19 patients by day for South Dakota state ?
So you can select 2 columns (date, hospitalized) in South Dakota sate
def plotDepOnState(var='hospitalized', state='SD'):
"""
steps:
1 select all rows in specific state (South Dakota state)
2 select 2 columns like [positive , date]
3 plot bar graph which xaxis (date) and yaxis (positive)
"""
y = df[df['state']==state][var]
x = df[df['state']==state]['date']
plt.figure(figsize=(12,4))
plt.title("Plot of \\"{}\\" for {}".format(var,state),fontsize=18)
plt.bar(x=x,height=y,edgecolor='k',color='blue')
plt.grid(True)
plt.xticks(fontsize=14,rotation=45)
plt.yticks(fontsize=14)
plt.show()
plotDepOnState()
Q6 What is the ratio of positive and total Test Results in South Dakota state ?
so you must select date, positive and total test then you must select positive and total value depends on state and time
def ratios(state='SD',ratioabout=['positive','totalTestResults']):
"""
steps:
1 select column date and get all values
2 select specific state (South Dakota)
3get all state values of 2 columns in this date ['positive','totalTestResults']
4 divide all value of two columns and multiply by 100
"""
date = df.iloc[0]['date']
try:
r = float(df[(df['state']==state) & (df['date']==date)][ratioabout[0]])
p = float(df[(df['state']==state) & (df['date']==date)][ratioabout[1]])
return (round(r/p,3) * 100)
except:
return 1
print(str(ratios())+" %")
7Q What is the ratio of positive people become recovered in Arkansas state ?
so you can select 2 columns (positive , recovered) depend on date and state
def ratios(state='AR',ratioabout=['recovered','positive']):
"""
steps:
1 select column date and get all values
2 select specific state (Arkansas)
3get all state values of 2 columns in this date ['recovered','positive']
4 divide all value of two columns and multiply by 100
"""
date = df.iloc[0]['date']
try:
r = float(df[(df['state']==state) & (df['date']==date)][ratioabout[0]])
p = float(df[(df['state']==state) & (df['date']==date)][ratioabout[1]])
return (round(r/p,3) * 100)
except:
return 1
print(str(ratios())+" %")
Q8 What is the ratio of positive people who become recovered in the Top 10 recovered states?
so you can think that first, you want to select a specific date, select recovered column to sort it in descending order then get the Top 10 states using NumPy to change the datatype to an array
def getmaxtenstates(date = '20210304',column='recovered'):
"""
steps:
1 select specific column date and get all corresponding values
2 sort all recovered column in descending order
3 select top 10 unique state that have high recovered
"""
dfs = df[df['date']==date]
dfs = dfs.sort_values(by=[column],ascending=False).head(10)
dfs = np.array(dfs['state'])
return dfs
getmaxtenstates()
Second, you must think to use ratio function to calculate the ratio about positive people become recovered in the top 10 states
def ratios(state='AR',ratioabout=['recovered','positive']):
"""
steps:
1 select column date and get all values
2 select specific state (Arkansas)
3get all state values of 2 columns in this date ['recovered','positive']
4 divide all value of two columns and multiply by 100
"""
date = df.iloc[0]['date']
try:
r = float(df[(df['state']==state) & (df['date']==date)][ratioabout[0]])
p = float(df[(df['state']==state) & (df['date']==date)][ratioabout[1]])
return (round(r/p,3)*100)
except:
return 1
then using function to calculate the ratio of each state and then plot it using bar chart
"""
steps:
1 call function getmaxtenstates to get top 10 states
2 call function ratio to calculate ratio between ['recovered','positive'] in all 10 states
3if it get any error return 1 else append ratios data with state in 2 lists
4 plot bar graph of 10 states with xaxis(state) and yaxis(ratios)
"""
states = getmaxtenstates()
tp,x = [],[]
for s in states:
data = ratios(s,['recovered','positive'])
if(data != 1):
tp.append(data)
x.append(s)
plt.figure(figsize=(8,4))
plt.title("Testpositive ratio chart",fontsize=18)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.bar(x=x,height=tp,color='blue', edgecolor='k',linewidth=2)
plt.show()
Now for all the code and analysis we did here you can find it in our DataValley Market place through this link
Cool and interesting article and open a lot of doors to me, I am now interesting in this field
Thanks for this powerful article