Do you need data engineering before data science?

By Angela, March 23, 2017

Last week, I shared some lessons learned from a Domino Data Science Pop-up that I attended a month ago.  There were some very important discussions surrounding the world of data science today. One thread explored the differences between data science and data engineering.

I’ll admit that I was completely unaware of the engineering behind data science when we first launched Insight Data Science back in 2012. And I don’t believe that I was alone. We data scientists were too enamoured with the idea of having the sexiest job of the 21st century.

However, quietly under our radar, data engineering (our “slightly younger sibling”) was emerging, stretching its wings, and undergoing its own evolution. You can read about this from Maxime Beauchemin, data engineer at Airbnb).

So, what is the difference between a data scientist and data engineer?  Companies often overlap these positions but understanding the distinction is essential to building your team and hiring the right resources.

Since Insight added Data Engineering Program in recent years, we can compare it to the Data Science program to shed some light on these two important roles.

Data Science

The key responsibilities for a data scientist are:

  • Asking the right questions on any given dataset
  • Being able to answer those questions – either through statistical analysis, machine learning, and/or data mining
  • Clearly and effectively communicating any results to interested parties (either verbally or in writing)

Data scientists have a PhD because “it demonstrates that s/he has spent roughly 5 intense years in graduate training to either ask the right questions about data, performing data analysis, create statistical or mathematical models, and present results.”

Data Engineering

A good data engineer:

  • Gathers data, stores it, does batch/real-time processing on it, and serves it via an API to a data scientist (some companies may call this data infrastructure or data architecture)
  • Has extensive knowledge on databases and best engineering practices

Data engineers should have very strong software engineering skills. They need to be able to quickly learn to use any of the big data tools on the market, as well as be able to improve the available tools if needed.

With all that said, the easy way to look at the two roles: data engineers enable data scientists to do their jobs more effectively.

So, for those of you looking to build out your data science team: before you hire your first data scientist, ask yourself, will he or she have the infrastructure to be successful? It just might be that you need to hire a data engineer first.

  • Steven McAuley

    Uber’s founder, Travis Kalanick, in Forbes affirmed the “existential” importance of data to that company. Having gathered reams of behaviour based data about how Sellers and Buyers interact and communicate with each other in online automotive marketplaces…I can attest to the importance of both the data science and data engineering roles.

  • Even if you put data engineers before data scientists it doesn’t mean that data scientists do not need good engineering practices. Today there is a lack of good tools for data scientists. Domino does pretty good job in creating such kind of tooling. However, I don’t believe that a proprietary platform can be successfully consumed by the entire data science community.

    With this ideas in mind I’ve created an open source tool http://dataversioncontrol.com which aims to bring good engineering practices in the data science community. This is like pipeline management system but not for data engineers, rather for data scientists. It makes the pipelines reproducible and shareable. And these are very basic engineering practices that are missing in the data science world.

Subscribe to our Blog via Email