What’s your data strategy? Defining the data hierarchy

Over the past decade, startups and enterprises have devoted hefty resources to collecting and analyzing huge volumes of big data. For some, data is used to fine-tune a product; in other cases, data forms the foundation of the product itself.

When it comes to building a startup around data, the more unique that data, the better. As I wrote last week in reference to machine learning startups, algorithms have mainly become a commodity these days. Building a company around publicly available data just isn’t defensible.

What are the different types of data and where do they rank on the ‘uniqueness hierarchy’? We see four major classifications (in order of increasing uniqueness):

1. Accessible public data:

This is data that’s readily accessible on the web and you just need an API to access it. Examples are Google Maps and open government data like Europe’s public data and San Francisco data.

2. Raw public data:

This is data that’s available to the public, but requires a lot of legwork (e.g. cleaning and scrubbing) to be usable. For this reason, its accessibility is limited to those with the technical know-how and resources.

3. Proprietary user data:

This is data that users create or share and can be used according to the site’s Terms of Service. In all cases, users are ‘opting in’ to share their data, whether by creating a product review on Amazon, liking a post on Facebook, or sharing their bank account activity with Mint.com. Keep in mind that while this data is proprietary, it’s not necessarily exclusive as users can share the same kind of content on multiple platforms.

4. Exclusive user data:

This is behavioral data that tracks how a user interacts with a product/site. Such data is typically captured in the background and is site-specific – and is hence a very valuable and exclusive feed-back loop that can be used to improve the product. An example is tracking a user’s search behavior to deliver better search results in the future (more on this below).

How are companies using data?

Here are two examples that best illustrate how companies are using data across the various levels of the ‘uniqueness hierarchy.’

The first example is Google. Google started with publicly available data (type 2), but as they developed their product, they had access to exclusive user data (type 4). They used to data to refine and personalize a user’s search results to create a vastly superior product. They became the de facto search product. And, as more people use the product, they get more user data –further strengthening their moat.

Google was able to build their initial product with publicly available data, since no one else was aggressively pursuing the same space at the time. They then built a daily use case product that throws off tons of exclusive data to fuel their growth.

The second example are user reviews on Amazon (as well as sites like Netflix, TripAdvisor…). Amazon has found a way to incentivize its users to share lots of proprietary data in the form of product reviews (type 3). The real-added value for Amazon, however, is when you combine these reviews (type 3) with behavioral data (type 4) – e.g. what does the user buy, what do they look at but not buy. This has enabled Amazon to develop truly personalized and effective recommendations.

Can anyone else recreate a recommendation system at Amazon’s level? It depends on how individual tastes are and how many data points there are to start with. In my opinion, it’s much easier to build a recommendation system for movies than a broad product marketplace, since personal tastes for films are more mainstream and the underlying dataset is much smaller. In addition, companies that only have reviews, but no transaction data (e.g. Yelp), have less valuable datasets.

What does all this mean for your startup and data strategy?

There are a few takeaways from all this:

  1. Building a startup on publicly available data is hard unless you can come up with a killer daily use case and very quickly accumulate user data that helps improve the product significantly and enables you to build a moat (the Google example).
  2. Access to unique data is crucial, but combining it with user data is even more important. The incremental value of this depends on how unique the personalized tastes/preferences are and how complex the underlying data set is.
  3. And lastly, true data network effects can be built with data types 3 and 4.

Read Next