A month ago, we hosted the second virtual hangout with data teams across our portfolio. Our goal is to link up our data teams to build a peer network, share best practices, collaborate on solutions and push the bar on data science innovation. In case you missed the summary of our first session, here’s a recap.
We kept the same format as last time since our teams are distributed and still in the process of getting to know one another. We covered a wide range of topics but I want to share three questions that surfaced surrounding the theme of data management. As you’ll read below, our group didn’t form a strong consensus but rather had general recommendations for each question.
Question 1: How do you maintain datasets? Do you do anything for versioning of data?
Data is changing all the time: data is added, some is edited and deleted. Given that fact, the challenge for data teams is how to maintain integrity and reproducibility when working with dynamic data systems.
Our group still has a lot of questions around how to tie versioning systems with their stack. Some are exploring products like Pachyderm. Another product I’ve recently seen is Quilt (they position themselves as Github for data) and there is an open-source project called Dat.
Question 2: What type of third party data are you paying for?
Many data teams look to augment their products with data that is not collected themselves – whether that data is publically free or from third-party providers that charge a fee.
When data isn’t readily available via APIs or isn’t free, our teams opt to collect as much as they can by scraping. There are great tools out there like Scrapinghub as well as the Python library, Beautiful Soup.
Question 3: Where does information on your data live?
In the first hangout, we discussed how the data team can and should be the company’s oracle of data. But where does this information actually live? It turns out, much of our portfolio has some sort of wiki for their data – whether that’s on Google Docs or Github.
Our teams are big fans of Github Pages vs. Github Wiki itself. With Github Pages, you can pull requests and edit like a wiki. One of our teams is trying Notejoy. It’s like Evernote: you can embed spreadsheets, tables and docs, and it’s also searchable.
Was there a key takeaway from the hangout? Not exactly. We had a productive and collaborative discussion around the important topic of data management, but there was no strong consensus on any of the questions. Perhaps that’s because we’re all still in the early stages in the data science journey. Or perhaps, there’s just no single approach that will work for every business.
We’d love to know what your team and/or organization is doing around any of these points. Do you pay for data? Are there some data/API marketplaces that you use? Are you versioning your data? Do you have some kind of wiki for your data? If you have any feedback or insights, please leave it in the comments section and we can all share, learn and grow together.