8 decisions that can make or break your next Data Science project

Have you already undertaken a Data Science project? Are you looking to get into the field and hit the ground running? Here are some of the more important things you need to consider while getting into the field, building your brand, or expanding into different domains and markets.

Pre-Sales

A project doesn’t start with acquiring data and building algorithms. The pre-sales stage needs to be given equal importance in order to lock-in a project for yourself.

  • Manage expectations correctly – While performing pre-sales activities, even though you are trying to sell the product by making it look as good as possible, you can’t oversell it. You need to put yourself in the shoes of the actual data scientists who will be delivering the product, and think about what you are promising to a potential client.
    Are you promising 98% accuracy when the industry maximum is 86%? You can’t have promising results when you’re potentially setting up your data science team to fail.
    This can be improved by having close collaboration between technical and non-technical personnel in a business. This makes it easier for the salesperson(s) to be more aware of what is and isn’t possible on a technical level, as well as manage expectations for the data scientists to indicate that mediocrity will not be acceptable.

 

  • Decide your metric(s) for success – Furthermore, the approach of always promising a high value of accuracy is not always the best idea. Say you promise the client 90% accuracy while predicting weather in a specific region at a specific time. Say you deliver on your promise, too. Then what? How do you proceed from there? Understanding how your predictions translate into success and defining the end goal is very important in these scenarios.
    The metric for success isn’t always getting high accuracy. You need to go a step further to help your client. Providing consulting to make important business decisions and how those should be influenced by predicted data can go a long way when it comes to building your reputation. This ties into my next point as well –

  • Identify real world benefits – You need to understand the correlation between predicting data and actual future results. Consider the following use case – You are predicting future product sales (distribution and retail) for a certain company in 30 different markets. Using your predictions for the next 4 quarters, you advise the client to move more of their salespersons to the top 5 markets which emerge from your predictions.
    However, you need to realize that if those markets do tend to provide good numbers in the future, those could be due to your manual intervention and moving more people there. You could potentially improve sales of any 5 markets by diverting people there and focusing your attention on those markets. The benefits that you have gained in this case weren’t actually provided by data science.

  • Choose the right depth of analytics – The well known 4 levels of analytics are : descriptive, diagnostic, predictive and prescriptive. These increase in both difficulty and value, and provide you hindsight, insight, or foresight, based on the depth you select.
    Generally, we aim for prescriptive analytics to provide us better foresight, even when our data isn’t mature enough or doesn’t have the ability to provide us with that. It could be worth it to limit your scope to hindsight for a while, until your data matures, and you get to a level where you can employ prescriptive analytics.

Data Selection and Preprocessing

 

Does your client have a massive data store? Or do they have a minimal scattered amount of data that needs work before it can be used effectively? Here are some of the steps you may need to take before you can run algorithms on the data.

Manage and organize data stores – If the client’s data is not organized and/or is too much to effectively use right away, you may need to perform intermediate steps first. Consider that the client has their data in the following data stores –

In this case, the client’s data is split across several different sources. This is pretty normal, and occurs at varying different levels of scale. Now, consider an additional step before the data is used for any Data Science work – This step would involve creating a data warehouse and data marts to query and join data effectively. The data marts would be based on different lines of business and upstream use cases. It would also ideally be scalable to make up for future data.

  • Make sure the data you are using is not inferred directly from what you are predicting – In other words, if certain features or columns are inferred from the value you are predicting, those cannot be used for prediction. Also, if what you are using to predict is not available at the time of prediction, eliminate that from your dataset.
    Consider the following scenario – You are trying to predict flight delays at the destination, 24 hours before the flight takes off. Using predicted weather, location details, etc. is all well and good, and it would help your algorithm. However, if the historical data contains a column with details of the flight delays at the source, those cannot be used as a feature, since they’re not available at the time of prediction.

Feature Selection

Once you’ve selected all the relevant data for your algorithm, you need to make sure what you’re using actually contributes to what you’re trying to achieve. This could take the form of 2 problems – Not having enough features, or having too many features.

  • Proper feature engineering – Make sure you are employing enough methods to extract additional information from the (possibly limited) selection of features in your original data. These include, but are not limited to –
    1. Numerical
      • Add rolling window features
      • Add expanding window features
      • Add lags (values of variables on the previous day, previous month, etc.)
      • Add trends (change in variables over certain periods)
      • Binning data
      • Power transformations (sqrt, log, box-cox)
      • Normalization (min-max, z-score, log, linear)
    2. Temporal
      • Month (can be further processed as a categorical feature)
      • Part of day (4/6/8-hour intervals)
      • Weekend flag
      • Holiday flag
    3. Categorical
      • One-hot encoding (best with nominal data, where categories don’t have any underlying order)
      • Encode from categorical to numerical (best for ordinal data, where categories do have underlying order)
    4. Text
      • TF (Term Frequency) or TFIDF (Term Frequency Inverse Document Frequency)
      • Using NLP libraries like SpaCy to perform lemmatization, tokenization, grammatical error corrections, etc.
      • Cosine similarity for document matching
      • Topic Modeling
      • Document Clustering
      • Count mentions/hashtags when processing twitter feeds
    5. Location
      • Altitude
      • Distance from the coast
      • Weather features (if the data contains date/time features)
    6. Proper feature elimination – Too many features can add noise to the model, and worsen your results. You can perform the following steps to pick/eliminate features –
      • ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots can be used to determine which lagged features should be used.
      • PCA (Principal Component Analysis) can be used to reduce dimensionality while minimizing information loss.
      • RFE (Recursive Feature Elimination) can be used to remove features which aren’t contributing to accuracy.
      • Use correlation/chi-square tests to eliminate features.

On a departing note, most of these points mentioned above need additional time and effort. However, they are all worth spending time on since they can inherently make or break your project. At the end of the day, it comes down to prioritizing speed in a PoC, or leaving no stone unturned in a full-fledged project.

~ Riaz Moradian

Senior Consultant