Data preparation and data discovery consume a great amount of time in any data science or data analytics job, one of the solutions is to write a template script that you can use in this phase of your job, but what about adding interactive controls and dynamic controls into your scripting wouldn’t that be great?, Fortunately, we have this option now using Streamlit framework, which is an open-source application framework, which enables data scientists and machine learning engineers to create beautiful, performant apps. In this article, we will use this framework to automate the most common tasks in the data discovery and data analysis phase, and we will use Streamlit powerful framework to make dynamic as possible.
Our application will have the following features:
- Allow users to select csv file which will have the required dataset
- Show preview of the dataset data
- show general statistics of the dataset data
- Allow users to select one column of the dataset and do profiling on column values
- Allow users to select one or more columns and show general statistics of columns data
At the end our application will look like the following
Install Streamlit
Before we dive into the code, let’s see how we can install Streamlit, it is very easy and straightforward as following
pip install streamlit
Note: if you are installing on Windows it is recommended to install using Anaconda CLI
Application Code
First, we need to import libraries we will need which will be Streamlit, and Pandas library
import streamlit as st
import pandas as pd
In Streamlit write method is the magical method, we can use it with almost any type of object such as graphs, data frames, Streamlit widget,…etc. When we pass any object to this method it will print it in our application with the format that is suitable with this object type, for example, when we pass Pandas dataframe write method will print tabular preview of data, when we pass Streamlit widget it will show the widget with the input data we pass to this widget.
First, we will use the title method to print a title header to our application as following
# Set title for the Application
st.title("Dataset Discovery")
Then we will use file_uploader widget to allow the users to upload the csv with the dataset they need to discover
# Upload data
upload_file = st.file_uploader("Choose Sample data", type='csv')
After that, we will check if the file has been selected or not then we will pass this file to read_csv method from Pandas to load file data into a dataframe
if upload_file is not None:
# read source data
source_data = pd.read_csv(upload_file)
In this section of the code, we will use the subheader method to print a small header to indicate the data presented in this section of our app, then we will use the write method to print the output
# Preview data
st.subheader("Preview data")
st.write(source_data)
Next, we will use the describe method from Pandas library to view general statistics of the dataframe
# Print data summary
st.subheader("Summary Statistics")
st.write(source_data.describe())
In this step, we will use selectbox widget from Streamlit to allow the users to select one of the dataframe columns to perform value profiling on the data of this column and to get dataframe columns we will use columns method from Pandas
# Column values profiling
st.subheader("Column profiling")
column = st.selectbox('Select Column',source_data.columns)
st.write(source_data[column].value_counts())
The final step is to allow users to select multiple columns from the dataframe to run summary statistics on one or more columns and to do that we will use multiselect widget from Streamlit, then apply describe method on the set of columns selected
# Column(s) summary statitics
st.subheader("Column Summary Statitics")
columns = st.multiselect("Choose Column(s):", source_data.columns)
if len(columns) > 0:
st.write(source_data[columns].describe())
else:
st.write("Please Select columns to show summary statistics")
To avoid errors, we will check the list of selected columns, if the user didn’t select any columns, we will print a message to the user to ask him to choose certain columns
Here is the full code for our application
import streamlit as st
import pandas as pd
# set title for the Application
st.title("Dataset Discovery")
# Upload data
upload_file = st.file_uploader("Choose Sample data", type='csv')
if upload_file is not None:
# read source data
source_data = pd.read_csv(upload_file)
# Preview data
st.subheader("Preview data")
st.write(source_data)
# Print data summary
st.subheader("Summary Statistics")
st.write(source_data.describe())
# Column values profiling
st.subheader("Column profiling")
column = st.selectbox('Select Column',source_data.columns)
st.write(source_data[column].value_counts())
# Column(s) summary statitics
st.subheader("Column Summary Statitics")
columns = st.multiselect("Choose Column(s):", source_data.columns)
if len(columns) > 0:
st.write(source_data[columns].describe())
else:
st.write("Please Select columns to show summary statistics")
Now to run any Streamlit application, use the following syntax streamlit run <python file name> to run your application from a command terminal
streamlit run data-discovery-app.py
Now a new browser window will open with the application
Of course, there are way more than this you can do with Streamlit framework, but this was a simple demo to show you the power and dynamicity of the framework.