Three weeks ago, I had the pleasure of attending the Domino Data Science Pop-up, which was one of the best days of talks on data science that I have attended. You can check out recordings of the talks as well as copies of the presentations on SlideShare.
I won’t dive into all the nitty-gritty discussions on math and statistics, but I do want to share some insightful discussions surrounding the definition and role of data scientists in today’s startups.
What’s a data scientist? It’s what we do, not who we are
It’s always tricky to define trendy terms – and that’s certainly true about data science. Whipple Neely, Director of Data Science at EA, gave us a better way to define a data scientist: it isn’t who we are but rather what we do.
This distinction is important for several reasons. First, it helps us better understand why and when we need data scientists (and what we can expect of them). And two, it sheds light on what we should be looking for when recruiting for data science. In the past, I have shared some insight on recruiting and hiring data scientists from my time at Insight Data Science.
Source: Data Scientists Are Analysts are Also Software Engineers, William Whipple Neely
It turns out that a data scientist is both analyst and software engineer (and everything in between) as Neely’s presentation was aptly titled “Data Scientists Are Analysts Are Also Software Engineers.” Building models is the least important part of the job. Data science isn’t just about optimization, but also communication. Taking a black box approach to data science makes it harder to explain the insights – and the findings can come off as less believable, especially when your company’s core business isn’t machine learning.
Finding that data scientist unicorn
How do we go about finding these elusive people who are technical builders with deep mathematical knowledge, who are both product and sales-driven, and who think like a CEO with an understanding of internal and external stakeholders?
As Kimberly Shenk, Director of Data Science Solutions at Domino Data Labs, explained: “It’s hard ‘to do all and be all things’.” Therefore, she suggests the following strategy when hiring and developing the team:
- First, think about which area people on your team naturally gravitate to: are they sales pitchers (i.e. good at convincing people in the organization about what work brings value), problem interpreters (good at deciphering what the problem really is), mini-strategists / CEOs (good at prioritizing what will drive the business forward), data engineering socialites (good at mobilizing others to implement your work)?
- Then, start by hiring individuals who are strong in a few of these categories and continue to build out their expertise, while developing other skills over time (if they are so interested and inclined). It’s all about “divide and conquer,” so make sure you don’t hire the same kind of individuals who all have the same subset of skills. Every addition to the team should be synergistic.
You might be wondering if it is better to hire someone who is stronger technically or business-wise? It really depends on your needs and company, as you can certainly learn technical/math skills just like you can pick up business skills. Ultimately, the consensus of the crowd was that good intuition (specifically a scientific thinking process) is important. And, curiosity is the most critical attribute that can’t be taught.
On a side note, I would love to talk to anyone who has a “curriculum” or process for fostering and cultivating data science talent? Ping me and we can compare notes.
Where does data science fit in an organization?
Another great discussion focused on how to structure a data science team within your company. Is a centralized or decentralized approach better? To summarize the two:
- Centralized: individuals are on a data science team, working with product teams on a per-project basis (almost like “consultants”)
- Decentralized: individuals are on different product/engineering teams
As with anything, there’s always a tradeoff. A centralized approach creates distance between team and product, while a decentralized approach creates a risk of redundant work.
The consensus from the group is that most companies start with data science embedded with the product team, as there’s a natural fit there. Then, as the organization grows, it needs to find ways to implement a hub-and-spoke model. For more thoughts on this, you can read my friend Clare Corthell’s blog post.
Best practices in data science and engineering
Last month, I wrote about my mission to create a crowdsourced repository for startup data – stay tuned for more on this next week.
One of the key questions I want to answer is: what are the best practices when it comes to data science and engineering. Ironically, many of the decisions made on infrastructure and analysis are made through anecdotes, i.e. asking other data scientists and engineers questions like “what do you use?” and “do you build your own tools?”.
Of course, it’s not enough to consider each database, computing platform, etc. separately because everything is intertwined (as an example, just look at data at Atlassian!).
But if it is possible for us to bring some data forward on the data pipeline to complement qualitative anecdotes we get from friends, I have no doubt that that would be valuable. So if you’re interested in working on this with me, please ping me and stay tuned!