*Stages in the Analytics/Data Science Process *

*Stages in the Analytics/Data Science Process*

May 1, 2013

By Koo Ping Shung - Data Scientist/Analytics Instructor at Singapore Management University

In this post I will be sharing what the Analytics or Data Science Process would consist of in companies that I have observed. What are the stages, what is inside each stage and so on. Hopefully, it can serve as a good reference for companies that are embarking on the Data Science journey.

Setting Business Objectives/Questions

The value of Data Science comes from the Business Questions. Thus it is important that companies choose the business questions that, at the current circumstances, provides the highest value, and value need not be monetary alone but could be competitive advantages over competitors. From setting the right business question to answer, the data scientist would then need to know how to convert the question into one of modelling, know what kind of mathematical model(s) to be used.

Collecting & Preparing Data

With an understanding of both the business question and the modelling question at hand, the data scientist can proceed to collect the amount of data needed. Questions like how much many months of data to collect and what are the variables to collect would be asked and answered.

After the collection of data, the next step is to start preparing the data for modelling. This is because each type of modelling would require different structure of data.

Exploratory Data Analysis (EDA)

This is to find out and get familiar with the data, understand what are the patterns in the data and at this stage we usually do missing data analysis, correlations, distribution analysis, scatterplots, frequency analysis and so on. Through the EDA, we also lookout for data errors. For instance, if we know that the value of Gender is "M" and "F" but we see the value of "f" and "m" as well, there might be some errors in the way the gender data is captured and if that is the case, it should be flagged out.

The EDA results would form the basis for the modelling part later. Through the EDA we can find out what are the potential pitfalls later when we start working on the mathematical models.

Building Mathematical Models

At this point,we would try to build the model that can be implemented and has the highest possible predictive power (if we are building a statistical model). Usually a few models are built with various number of independent variables. We would also check back the results from the model with that of the EDA to make sure the models are making sense.

Select the Mathematical Model for implementation

With several models built we would start looking at which models can be implemented and also what are the pros and cons of each model. Predictive power would definitely be one factor for consideration but other factors include the costs of implementation and maintenance during deployment.

Deployment of Models (if any)

At this stage, there is the preparation of the different test to implement the models. Test such as System Integration Test (SIT) and User Acceptance Test(UAT) would need to be done to ensure the smooth deployment of models. At this stage, setting up the test cases are important so that the many different types of scenarios are tested and working fine.

Continuous Model Validation

As most models would suffer from model decay due to environment changes, there is a constant need to ensure that the model is functioning well (i.e. maintain an acceptable level of predictive power). Thus models are validated with enough frequency and when it consistently falls below the accepted level of predictive power, processes will have to be activated to either re-calibrate the model or re-build the model with current data.

Conclusion

Throughout the whole process of answering data science questions, the quality of data would be a constant question that needs to be answered so as to ensure that the model built and/or implemented is of value. Thus to have a good start in data science, data management is of utmost important and a suitable data management strategy would need to be devised to ensure the data is of the highest possible quality.