In this article, we will depict some skills and concepts that must be learned in the journey of becoming a data scientist but first, what is data science?
Data Science is the art of uncovering the insights and trends in data. It has been around since ancient times. The ancient Egyptians used census data to increase efficiency in tax collection and they accurately predicted the flooding of the Nile river every year. Since then, people working in data science have carved out a unique and distinct field for the work they do.
Now, and because of the massive advancements in storing large datasets and processing/learning algorithms. The term Data Science has gained a bright exposure and the giant companies are now in poor need of data professionals who can help them to learn more about their business and establish new future strategies.
What are the key skills for a data scientist?
Although the answer to this question can vary from business to business. We can do some summarizations here to know the most common and required skills in the field.
1. Mathematics
Mathematics is essential in the field of data science and, for sure, if you are not into Math then this field is not suitable for your career. To be specifically speaking there are some topics in Math that are inevitable for learning data science.
Linear Algebra
Linear Algebra is concerned with the study of matrices and several mathematical modeling methods and later on, you will see that every data format is nothing but a matrix, and every model for any type of data is nothing but a mathematical model.
Calculus
Calculus is concerned with optimizing any mathematical model. For any data problem reaching a quantity that describes the system will require some optimization for denoting the best performance i.e. the best description of the system.
Statistics
Statistics is a discipline of Maths that is concerned with data collection, interpretation, inferring, presentation, and conducting experiments, learning statistics will help any data professional to think of these aspects in a scientific and evidence-driven way.
2. Development skills
Now, as we learn some basics about the nature of data. We need to think about a platform that will enable us to deal with data sources, design data experiments, and build models. These skills can vary between programming languages and programming-related skills.
Programming Languages
The most widely used programming languages in the field are Python and R. There are few differences between the two.
Python is more flexible and can be used in any programming paradigm like object-oriented programming or functional programming and is also easier to be connected to multiple data sources.
On the other hand, R is an incredible tool when it comes to applying statistical procedures to some data. Also, it has very catchy data visualization packages integrated. Some tools will help you to develop your code with Python and R such as:
- Jupyter Notebook (Python)
- Spyder (Python)
- PyCharm (Python)
- RStudio (R)
Programming Related Skills
Skills related to software development are countless but if we can mention the most related skills to data science then I encourage you to learn Object-Oriented programming as it’s the key to writing clean, efficient, and modular code.
Besides, a great advantage for a data scientist is to learn version control which has tremendous benefits when it comes to editing, developing, or enhancing code. These skills are unavoidable when integrating code with other teams who are working in the back and front end for any product.
3. Querying Data
There is an indefinite number of data sources around in different shapes, formats, and scales. Having said that, every data scientist needs to be able to query these sources and get the required data. Below are the basic data querying skills.
Database Design and SQL
Databases are the main source of data in the world. They are used mainly for structured data and any data scientist should learn a little bit about the design of the database to be able to build a perception about how the data is stored.
The next step is to query that database using SQL for the required data as the data may exist in very complicated formats so, the ability to write an efficient and fast SQL statement is inventable. After learning the basics of SQL, it is preferable to have a look at the SQL differences between the database vendors around such as:
– Teradata
– Oracle
– Microsoft SQL Server
– PostgreSQL
All these database vendors will follow SQL standards, so you can write standard SQL statements on almost all of them, however for each vendor you may find some specific SQL syntax for this vendor.
Big Data Ecosystem
Another important source of data is big data platforms. Big Data platforms are used for unstructured data like text and images also, these platforms are very suitable for data streaming i.e. click flow rate of a website for example. Dealing with these platforms requires some familiarity with the big data ecosystem. Below are some building blocks of that ecosystem that every data scientist should know about.
– HDFS
– Sqoop
– Apache Spark
– Apache Hive
– Impala
Cloud Solutions
Currently, some companies are beginning to use cloud infrastructures for their data storage. This way of storing data implies several benefits but the most valuable of them is continuous data availability. Also, the data loss chance due to any failure like the conventional sources is slim to none.
There are many vendors around for cloud services, and each of these vendors provides different solutions for data science, top 3 vendors are:
Google Cloud Platform (GCP), which provides the following solutions for data scientists and machine learning engineers
AI Platform: https://cloud.google.com/ai-platform
BigQuery: https://cloud.google.com/bigquery
Tensorflow: https://www.tensorflow.org/
Amazon Web Services (AWS), which provides the following solutions for data scientists and machine learning engineers
AWS SageMaker: https://aws.amazon.com/sagemaker/
Microsoft Azure: which provides the following solutions for data scientists and machine learning engineers
Data Science Virtual Machines: https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/
4. Machine Learning
Machine Learning is the most mandatory skill by a data scientist as it is the heart and soul of any solution for a data problem.
Even though Machine Learning concepts to be learned are countless. In the next few lines, I will try to mention the most essential pillars concerning Machine Learning.
Data Cleaning and Preparation
Data cleaning and preparation is the most time-consuming part of any data science project. That’s because data in its original state, in most cases, will not be suitable for the modeling process so, being skilled in data cleaning will make your life as a data scientist much easier. Below is a list of the most widely used data cleaning packages.
Pandas (Python)
NumPy (Python)
scikit learn (Python)
dplyr (R)
data.table (R)
ggplot2 (R)
reshape2 (R)
readr (R)
tidyr (R)
lubridate (R)
Algorithms
For any data scientist, mastering Machine Learning Algorithms is inevitable. These algorithms will vary between Regression Algorithms and Classification Algorithms and will enable any data science to work on vast areas of real-world problems and use cases. The most basic set of algorithms can be listed as follows.
– Linear Regression
– Logistic Regression
– K-nearest Nearest Neighbours
– Support Vector Machines
– K-means Clustering
– Hierarchical Clustering
– Decision Trees
And their implementation is integrated on a lot of packages and libraries such as:
– scikit-learn ( Python )
– vowpal wabbit ( Python )
– CARET ( R )
– e1071 ( R )
Feature Engineering
Besides Data Cleaning, Feature Engineering is another important data-related skill that every data scientist should know about it. The main focus of Feature Engineering is to transform the data in a way that will be fully beneficial to the Machine Learning Algorithms and as a consequence, a good feature engineering procedure can change the performance of your Machine Learning algorithms a lot.
Model Improvement
After cleaning the data, building a model, and even doing some feature engineering in the way, you will notice that your model will reach a bottleneck in the performance and you have to add some additional tweaks to boost this performance up. In the next few lines, I will present some additional topics that are widely used to maximize any machine learning model performance.
– Regularization
– Ensemble Methods
– Dimensionality Reduction Methods
– Hyperparameter Tuning
5. Deep Learning
Although Machine Learning techniques can solve a lot of issues when it comes to tabular data, In some cases where the problem is very complex or the data is not in the tabular format like text and images, deep learning algorithms will be a must. Here are the most important Deep Learning skills for a data scientist.
Neural Network
The neural network structure is one of the most important aspects of any deep learning algorithm. It implies some advanced mathematical topics like optimization with gradient descent, forward propagation, and backpropagation.
Conventional Neural Networks
This is an extension of the neural network architecture which is specialized in image and video analytics. It is widely used in algorithms for object detection, face recognition, semantic segmentation, and much more. Perhaps you have heard about the Computer Vision field which is heavily related to the use of Convolutional Neural Network (CNN).
Recurrent Neural Network
Another useful extension of the Neural Networks architecture is specialized in handling text data. The advancements in that type of algorithms led to interesting applications like chatbots, machine translation, sentiment detection, and much more. Also, the use of Recurrent Neural Networks (RNNs) coined a term of Natural Language Processing which is a specialized field in text analytics and prediction.
The most widely used packages in deep learning are:
– TensorFlow
– PyTorch
– OpenCV ( Computer Vision )
– NLTK ( NLP)
6. Visualization
One of the key skills in the data science profession is to be able to summarize the information denoted in the data simply and figuratively. That’s because the human eye can be easily distracted and overwhelmed by the numbers so, being able to communicate your findings as a data scientist in neatly graphs is essential. There are different visualization platforms in the following I will try to sectionize them.
Programming related visualization libraries
There are a bunch of packages and libraries that are fully integrated with Python and R and have the functionality be dynamically created subjectively to code they can be listed as follows.
Python Visualization Libraries:
– Seaborn
– ggplot
– Bokeh
– Plotly
– Pygal
– Matplotlib
– Geoplotlib
R Visualization Libraries:
– ggplot2
– Lattice
– highcharter
– Leaflet
– RColorBrewer
– Plotly
– sunburstR
– RGL
– dygraphs
Standalone Visualization Tools
1- Excel
Excel is an incredible tool for fast and easy data manipulation and of course data visualization. The most important aspect of Excel is that it is widely used by different domains, not only data scientists or data analysts.
2- Tableau
A tableau is a great tool when creating complex and interactive dashboards. A tableau is a great tool when it comes to creating data pipelines, starting from data collection, modeling, and finally interactively presenting the results.
2- Microsoft Power BI
PowerBI is pretty similar to Tableau but it’s used by some corporations, so learning its basics will be an edge when working for these companies.
7. Personal Skills
This section might be surprising for all readers but I have to admit that every data scientist should try to enhance his intrapersonal skills. In particular, communication skills and presentation skills. That’s because the nature of the data scientist job implies a lot of communication, especially with business people. After all, the ultimate aim for any solution should serve the business need. So the more this communication pipeline is efficient the more productive the whole process will be. Besides, being able to present the data and your thoughts and assumptions about it is a key skill especially because this process is done for non-technical people so it’s a huge plus to be able to communicate with them.