Build Data Analysis and Data Discovery Web Application for Data Science projects

Data preparation and data discovery consume a great amount of time in any data science or data analytics job, one of the solutions is to write a template script that you can use in this phase of your job, but what about adding interactive controls and dynamic controls into your scripting wouldn’t that be great?, Fortunately, we have this option now using Streamlit framework, which is an open-source application framework, which enables data scientists and machine learning engineers to create beautiful, performant apps. In this article, we will use this framework to automate the most common tasks in the data discovery and data analysis phase, and we will use Streamlit powerful framework to make dynamic as possible.

Our application will have the following features:

  1. Allow users to select csv file which will have the required dataset 
  2. Show preview of the dataset data
  3. show general statistics of the dataset data
  4. Allow users to select one column of the dataset and do profiling on column values
  5. Allow users to select one or more columns and show general statistics of columns data

At the end our application will look like the following

Install Streamlit

Before we dive into the code, let’s see how we can install Streamlit, it is very easy and straightforward as following

pip install streamlit

Note: if you are installing on Windows it is recommended to install using Anaconda CLI

Application Code

First, we need to import libraries we will need which will be Streamlit, and Pandas library

import streamlit as st
import pandas as pd 

In Streamlit write method is the magical method, we can use it with almost any type of object such as graphs, data frames, Streamlit widget,…etc. When we pass any object to this method it will print it in our application with the format that is suitable with this object type, for example, when we pass Pandas dataframe write method will print tabular preview of data, when we pass Streamlit widget it will show the widget with the input data we pass to this widget.

First, we will use the title method to print a title header to our application as following

# Set title for the Application
st.title("Dataset Discovery")

Then we will use file_uploader widget to allow the users to upload the csv with the dataset they need to discover

# Upload data
upload_file = st.file_uploader("Choose Sample data", type='csv')

After that, we will check if the file has been selected or not then we will pass this file to read_csv method from Pandas to load file data into a dataframe

if upload_file is not None:
    # read source data
    source_data = pd.read_csv(upload_file)

In this section of the code, we will use the subheader method to print a small header to indicate the data presented in this section of our app, then we will use the write method to print the output

    # Preview data
    st.subheader("Preview data")
    st.write(source_data)

Next, we will use the describe method from Pandas library to view general statistics of the dataframe

    # Print data summary
    st.subheader("Summary Statistics")
    st.write(source_data.describe())

In this step, we will use selectbox widget from Streamlit to allow the users to select one of the dataframe columns to perform value profiling on the data of this column and to get dataframe columns we will use columns method from Pandas

    # Column values profiling
    st.subheader("Column profiling")
    column = st.selectbox('Select Column',source_data.columns)
    st.write(source_data[column].value_counts())

The final step is to allow users to select multiple columns from the dataframe to run summary statistics on one or more columns and to do that we will use multiselect widget from Streamlit, then apply describe method on the set of columns selected

    # Column(s) summary statitics
    st.subheader("Column Summary Statitics")
    columns = st.multiselect("Choose Column(s):", source_data.columns)
    if len(columns) > 0:
        st.write(source_data[columns].describe())
    else:
        st.write("Please Select columns to show summary statistics")

To avoid errors, we will check the list of selected columns, if the user didn’t select any columns, we will print a message to the user to ask him to choose certain columns

Here is the full code for our application

import streamlit as st
import pandas as pd 

# set title for the Application
st.title("Dataset Discovery")

# Upload data
upload_file = st.file_uploader("Choose Sample data", type='csv')

if upload_file is not None:
    # read source data
    source_data = pd.read_csv(upload_file)

    # Preview data
    st.subheader("Preview data")
    st.write(source_data)

    # Print data summary
    st.subheader("Summary Statistics")
    st.write(source_data.describe())

    # Column values profiling
    st.subheader("Column profiling")
    column = st.selectbox('Select Column',source_data.columns)
    st.write(source_data[column].value_counts())

    # Column(s) summary statitics
    st.subheader("Column Summary Statitics")
    columns = st.multiselect("Choose Column(s):", source_data.columns)
    if len(columns) > 0:
        st.write(source_data[columns].describe())
    else:
        st.write("Please Select columns to show summary statistics")

Now to run any Streamlit application, use the following syntax streamlit run <python file name> to run your application from a command terminal

streamlit run data-discovery-app.py

Now a new browser window will open with the application

Of course, there are way more than this you can do with Streamlit framework, but this was a simple demo to show you the power and dynamicity of the framework.

Facebook
Twitter

Unlimited access to educational materials for subscribers

Ask ChatGPT
Set ChatGPT API key
Find your Secret API key in your ChatGPT User settings and paste it here to connect ChatGPT with your Tutor LMS website.
Hi, Welcome back!
Forgot?
Don't have an account?  Register Now