The concepts used in data science can be complex and overwhelming, but the process of managing the project and getting to the right answers is quite straightforward. To get you started, here are some fundamental ideas.
The following is a typical data science scenario: You have been contacted by a premium airline to conduct an investigation on their route network efficiency. The Low-Cost Carriers (LCC) expand massively, especially on the Asian market, and the airline wants to know how to prevent passengers from switching to LCC’s. They want to know which routes will be the most profitable in the future. Therefore, the airline needs a robust, efficient model where it can dynamically adapt to the current and future market environment.
This could be one of the scenarios data scientists is trained for and hired to solve. So let’s see which steps the data science team should follow to deal with such a project.
What is data science?
But firstly, let’s have a quick look at what data science is and what are its related concepts.
Data science itself tends to be a difficult concept to grasp. When you ask five experts for the definition, you will get at least six different explanations. But let’s stick to this one:
Data science is a combination of different fields of work in statistics and programming that seeks to extract insights from large amount of data
If your company relies heavily on data (and that is pretty much every company), then understanding data science, and its related concepts is the very first step to solving many of your problems.
Despite its technical nature, conducting a data science project is not a purely technical task. It is a mixture of technical and soft skills. And unless you are a data science “unicorn”, those skills have to be spread among all the project team members.
Like any other project, a data science project needs a method that will guide you through the process of problem-solving. And it must be independent of technologies and tools and provide a way of getting to the answers the data scientist is looking for.
Following these simple rules will save you from burning your company’s valuable resources.
The 5 Step Data Science Process
The Data Science Process (DSP) is a framework for delivering data science projects. Joe Blitzstein and Hanspeter Pfister introduced it the first time in their Harvard CS109 data science class. The goal of this class is to introduce the process of data science investigation. Like its predecessors, it is a non-linear, iterative process that stresses the importance of asking questions and iterating on your research, as more data becomes available.
If you are a consultant doing work for external clients, then this may be your preferred way of managing the data science work.
Let’s have a closer look at how DSP handles projects.
Step 1: Frame the problem
Every data science project should start with an understanding of how the business works and what its challenges are. The stakeholders who need your advice play a critical role in defining the goals. This step is one of the hardest, and yet, when done correctly, it will save you a lot of rework in later stages.
Ask your sponsors, what is it they are trying to achieve. This step requires a great amount of domain expertise. The biggest challenge here is to translate the ambiguous requirements into a concrete, well-defined problem.
Input from your sponsors is viable, but in many cases not extensive. This is where talking to experts can help you close the gaps.
A clear and concise storyline should be the main outcome at this stage.
While the framework is iterative and going back to this stage will occur, try to form a solid baseline that everyone understands and agrees on.
Step 2: Get the data
In this step, you focus on getting the data. Structured, unstructured, semi-structured, all types can be useful. It can come from a lot of different sources. There is a lot of creativity involved in this step.
Potential sources are websites, social media, open data, or enterprise data. These are then merged into data pools and form the basis for the following evaluations.
Once the data has been identified and is available for consumption, it has to go through several process steps – from importing and cleaning, to splitting and aggregating.
Cleaning the data makes sure the data fits well into whatever software you’re using and that you check it for errors and anomalies. But most importantly make sure that what you’re working on is valid and reliable!
Some websites contain useful data, but often no API’s exist. Web scraping is a powerful method, but remember, that in many cases scrapping a website for data is not permitted and you could be violating data privacy laws. Especially the European regulations tend to be very harsh.
Getting the data is the most time-consuming phase in this framework. It could account for 70-90 percent of the overall project time. When data sources are well managed and integrated, it can drop as low as 50 percent. You could optimize even further by automating some steps in data preparation.
Step 3: Explore the data
Once you are sure about the quality of the collected data, you can then start to explore it. See what the distributions are like. See what the associations look like. Visualization techniques may assist you in understanding the data content and discover initial insights.
After the first glance at the data, revisiting the previous step, data collection might be necessary to close gaps in understanding.
The difficulty in this step is to test those ideas that are likely to turn into insights. Revisit the outcome from the first phase (Framing the problem). Here you will most likely find valuable questions or hypotheses that will help you to scope the data exploration. Data science project, like any other project, has fixed deadlines too, so you’ll have to focus your efforts. You can’t solve every problem.
A good sign is when you start seeing patterns in the data. Try to trace them and analyze them more deeply. After that, you are ready for modeling your data.
Step 4: Model the data
In this part, you do the actual modeling of the data. Data scientists use a training set, a historical dataset where the outcome is known, to develop predictive models. This stage is highly iterative.
To fit the model, data scientists tune the parameters so that it fits the data as good as possible. Fitting means tuning it in a way that it can be used successfully with other, new data.
Once you’ve created a model (or several), you need to validate it. That is, you need to make sure that the model is accurate and that it’s going to generalize well. You try to see how accurate it is and how much it actually tells you about the question you’re trying to answer (stick to the storyline).
Based on the evaluations, you may want to make some tweaks and to make sure it is informative and as easy to implement as possible. To check whether it addresses the business problem appropriately, revisit the first stage.
During the modeling activities, it is helpful to have George Box’s quote in mind:
“All models are wrong, but some are useful”.
Step 5: Communicate the results
The last step involves presenting your model. You usually have a client (internal or external) and you’re going to have to present the results in a way that makes sense to them. And they need to know what to do with it. The more precise and clear the message is the better your chances are of seeing your model at work generating valuable insights. Show what the real business value of your work is.
Putting all your complex formulas on a slide deck and using sophisticated wording to explain it won’t get you far.
Communication of the results is not always the last step in a data science project. It depends on the agreement between you and your sponsor. In some cases, you and your team will pass over the developed model(s) to those who will deploy it in a productive environment. If you’re developing a predictive model that will be used, for instance, for an e-commerce website, you actually have to stick it on the server and you have to be able to generate insights and make predictions based on new customer data.
If it is not part of your contract, then make sure you document your work in a way others will understand it. Archive all of the assets that you’ve used. This includes the data sets in every step, from raw data to cleaned to the final analysis, the code that you used, the presentations, the notes. That way both of you can find out what you did before, your client understands it, and if anybody needs to go back and verify the analysis, it becomes possible.
Boost Your Data Science Skills
This is the Data Science Process – easy to follow and straightforward, applicable to nearly every scenario. But as in every project, if the necessary skills are missing, even the best methodology won’t help.
As with traditional project management, the better you know the domain the project is aiming for, the easier it gets. For example, solving problems on genome sequencing requires knowledge of that field. Knowing only the process of conducting such projects is helpful, but insufficient.
To be considered a data science all-rounder (not necessarily a “unicorn”) get yourself familiar with data science-related concepts. There is a lot of literature on learning the technical aspects of it. Check out these introductory books:
- Statistics – Statistics in Plain English
- Programming with Python – Python for Data Analysis
- Programming with R – Software for Data Analysis: Programming with R
- Algorithm design – Introduction to Algorithms
- Machine Learning – Introduction to Machine Learning with Python
As for the communication skills, you may want to consult the following books:
- Storytelling – Make it stick: Why Some Ideas Survive and Others Die
- Public speaking – The Art of Public Speaking
- Writing – HBR Guide to Better Business Writing
In addition to that, consider reviewing the “older” data science methodologies: KDD (Knowledge Discovery in Database), CRISP-DM (Cross Industry Standard Process for Data Mining), or SEMMA (Sample, Explore, Modify, Model and Assess).