Are data more like oil or sunlight?

The question highlights the many different faces of data

Feb 20th 2020

PASSIONATE GRAMMARIANS have long quarrelled over whether data should be singular or plural (contrary to common usage, this newspaper is sticking with the latter, for now). A better question is why are data so singularly plural? That is, why do they have so many different faces?

Listen to this story.

Enjoy more audio and podcasts on iOS or Android.

For an answer, start with the many metaphors used to describe flows of data. Originally they were likened to oil, suggesting that data are the fuel of the future. More recently, the comparison has been with sunlight because soon, like solar rays, they will be everywhere and underlie everything. There is also talk of data as infrastructure: they should be seen as a kind of digital twin of roads or railways, requiring public investment and new institutions to manage them.

The multiplication of metaphors reflects the malleable economics of data. First, they are “non-rivalrous”: since they are infinitely copyable, they can be used by many people without limiting the use by others. But they are also “excludable”: technologies like encryption can control who has access to them. Depending on where one sets the cryptographic slider, data can indeed be private goods like oil or public goods like sunlight—or something in between, known as a “club good”.

This in turn means that there is not just one data economy, but three more or less distinct ones, each with its own ideology. And the big question is whether one will come to dominate, or whether the mirror world will be as much of a mixture as the real one.

If oil is still the most-used metaphor, it is because comparing data to the black stuff is easy. Like oil, data must be refined to be useful. In most cases they need to be “cleansed” and “tagged”, meaning stripped of inaccuracies and marked to identify what can be seen, say, on a video. This has spawned a global industry employing hundreds of thousands of people, mostly in low-wage countries. Scale AI, a startup in San Francisco, employs 30,000 taggers around the world who review footage from self-driving cars and ensure the firm’s software has correctly classified things like houses and pedestrians.

Before data can power AI services, they also need to be fed through algorithms, to teach them to recognise faces, steer self-driving cars and predict when jet engines need a check-up. And different data sets often need to be combined for statistical patterns to emerge. In the case of jet engines, for instance, mixing usage and weather data helps forecast wear and tear.

The oil metaphor also rings true because some types of data and some of the insights extracted from them are already widely traded. Online advertising is perhaps the biggest marketplace for personal data: clicks are bought and sold based on a detailed digital profile of each viewer. It was worth $178bn globally in 2018, according to Strategy&, a consultancy. Data brokers, which can track thousands of data points for each individual, do brisk business with personal information, too. They sell it to everyone from banks to telecoms carriers, generating annual revenue of more than $21bn, says Strategy&.

Offering insights from mining data can be very profitable, too. On Kaggle, a website owned by Google that hosts machine-learning contests, thousands of teams of data scientists compete against each other to see who can come up with the best algorithms to predict a building’s energy consumption or to detect “deepfake” videos, with prizes sometimes exceeding $1m. That is also Facebook’s and Google’s way to make money. They hardly ever sell data, but they do sell insights about who is the best target for advertising.

Yet data have failed to become “a new asset class”, as the World Economic Forum, a conference-organiser and think-tank, predicted in 2011. Most data never change hands, and attempts to make them more tradable have not taken off. To change this, especially in Europe, manufacturers are pushing to secure property rights for the data generated by their products. Others want consumers to own the data they create, so they can sell them and get a bigger cut from their information.

Again, economics gets in the way. Although data are often thought of as a commodity, corporate data sets, in particular, tend not to be fungible. Each is different in the way it was collected, and in its purpose and reliability. This makes it difficult for buyers and sellers to agree on a price: the value of each sort is hard to compare and changes over time. A further barrier to trading is that the value of a data set depends on who controls it. What might simply be data exhaust to one firm could be digital gold to another. “There is no true value of data,” says Diane Coyle of the University of Cambridge.

As for personal data, defining property rights is tricky, because much information cannot be attributed to one person. Who, for instance, owns the fact that a dating site has matched a couple? The couple themselves? Or the service? Complicating matters, data have plenty of externalities, both positive and negative, meaning that markets often fail. Why should a social network, say, buy the data of an individual if it can make quite accurate predictions about him by crunching data from other users?

Although data are unlikely ever to be traded as widely as oil, tech firms keep trying to make this easier. Amazon Web Services (AWS), the cloud-computing arm of the e-commerce giant recently launched a marketplace that aims to make trading in data as easy as possible. It works a bit like an online store for smartphone apps: buyers subscribe to feeds, agree to licensing conditions, and AWS processes the payment.

Light stuff not black stuff

Champions of the “open-data” movement push organisations to give away their data

As the oil metaphor is seen as increasingly problematic, the comparison to sunlight or similar resources, such as air and water, has risen in favour. Many people who prefer this metaphor ask if data do not really lend themselves to be turned into a tradable good, then why even try? Would it not instead be better to ensure that data are used as much as possible? After all, this will maximise social wealth. In other words, nobody puts up curtains and tries to charge for sunlight.

This line of argument has already given birth to what is known as the “open-data” movement. Its champions push organisations and universities to give away their data so they can be widely used, for instance by startups. Today, most governments, national or otherwise, boast an open-data project, although the quality of the data made available varies greatly.

More recently, companies have started to publish their data, too. Several firms that work on self-driving cars have shared some of the information collected by their vehicles. “For researchers to ask the right questions, they need the right data,” according to Dragomir Anguelov, principal scientist at Waymo, a firm owned by Alphabet, Google’s parent, that is one of the companies that has done this. Others are working on technology to make such data-sharing easier: Microsoft and other software makers will soon start to implement what it calls the “open-data initiative”.

Some see such efforts as the beginning of an open-source movement for data, much like the approach that now rules large parts of the software industry. And Microsoft, in particular, is keen to see this happen. “We need to democratise AI and the data on which it relies,” writes Brad Smith, the firm’s president and chief legal officer in his recently published book, “Tools and Weapons”. Unsurprisingly, this position also smacks of self-interest: Microsoft does not make much money from data directly, but does from tools and services that handle data.

Like the oil comparison, however, the data-as-sunlight analogy breaks down: open data, too, can go only so far. For personal data, the main limitation is increasingly strict privacy laws, such as the EU’s General Data Protection Regulation (GDPR), as well as the California Consumer Privacy Act (CCPA), which will start being enforced in July. For corporate data the checks are economic in nature: generating good data is expensive and they can reveal too much about a firm’s products. “Companies will make very strategic decisions about what data sets they will make public and which ones they will keep to themselves,” explains Michael Chui of the McKinsey Global Institute, a consultancy think-tank.

Separating what can be safely shared from what should be closely guarded will be tricky, but technology should, in time, make such decisions easier. Something called “differential privacy”, for instance, replaces one data set with another that includes different information, but has the same statistical patterns. “Homomorphic encryption” allows algorithms to crunch data without decrypting them. And blockchains, which are the special databases of the sort that underlie many digital currencies, enable people and companies to manage in minute detail who is allowed to access what data and to track who has done so.

Slowly these technologies are being deployed. DECODE, an initiative financed until last year by the European Union, has used a combination of them to create tools that allow people to control the data they generate and collect about their environment, for instance, on noise levels and air quality. They are being tested in Amsterdam and Barcelona. Oasis Labs, another startup in San Francisco, has built something similar for health data. Its first service, which will launch soon, will let users donate genetic information to research projects.

Such data-dividing technologies are also grist to the mill of those who liken data to infrastructure. You have to travel many digital roads—and combine many data sets and streams—to get to new insights, says Jeni Tennison, who heads the Open Data Institute, a research outfit based in Britain. Some will be private toll roads, others public multi-lane highways, but many need to be operated as shared digital resources managed in a “club” by users.

Yet technology alone will not be enough to create these “club goods”. They also need institutions that provide what Ms Tennison calls “data stewardship”. Data trusts, data co-operatives, personal data stores—all are different in detail, but the idea is essentially the same: they provide a governance structure to organise access to data in a way that takes into account the interests of those producing and using a particular sort of data.

It is early days, but such data clubs have started to pop up in many places. MIDATA is a Swiss co-operative that collects and manages members’ health-care data. In Taiwan Audrey Tang, the digital minister, has created an ongoing “Presidential Hackathon” to set up “data collaboratives”, including several for environmental data. In Finland, Sitra, a policy outfit, has launched a similar competition to help get “fair data exchanges” off the ground.

New thing on the old continent

Most projects are still small and live on the public dime, which raises doubts about whether they will ever be a big part of the data economy. But whether they are successful or not is a question of political will, says Francesca Bria, the founder of the DECODE project. Cities in particular, she argues, need to create alternatives to the big online platforms, which treat data they collect as their own. A former chief technology officer of Barcelona, she turned the city into a model of what is possible, which is now copied elsewhere in Europe. Not only can Barcelona’s citizens control the data the city holds on them, but its suppliers must add the information they gather while delivering services to the municipal data commons.

Given their respective limitations, none of the three sorts of data economies will dominate, but they are likely to have strongholds. In America data are treated like oil: whoever extracts them owns them. China—although it, too, has data-hungry online platforms of its own, including Alibaba and Tencent—is an extreme example of a place where data are public goods. They are ultimately controlled by the government, which is pushing firms to pool certain types, such as health data. In Europe, many regulators have come to see data as infrastructure. The new European Commission in Brussels has big plans to support the creation of data trusts.

This sounds as if the EU is about to condemn itself to remaining a tech laggard. But this need not be the case. A “fair data-economy”—one that takes into account the interests of citizens and consumers, who will generate much of the fuel of the future—may prove to be quite competitive, says Luukas Ilves, the co-author of a report for Sitra in Finland. If people, as well as firms, can trust the continent’s data infrastructure, they will be willing to share more and better data, which means better services for everyone. If such a “virtuous cycle” were to take off, it would be quite a reversal of the old world’s fortunes.■

This article appeared in the Special report section of the print edition under the headline "Digital plurality"

From the February 22nd 2020 edition

Discover stories from this section and more in the list of contents

Explore the edition

Reuse this content