The meaningful use of big data throws up a host of challenges, but the potential benefits are massive, write Axel Heitmueller and Sandy Pentland

There has been much rhetoric surrounding big data. The vagary of its definition has allowed for it to be hailed as the magic bullet for a surprisingly diverse set of problems. But as with any hype, that has also allowed for a colourful mix of myths to develop providing a free-for-all for those who see it as a significant threat to privacy.

‘There is an implicit assumption that if only we had more and better data we would be able to turn it into something useful’

A case in point is the recent health warning in the Financial Times by economist Tim Harford, which takes issue with the apparent claim that big data constitutes the end of theory as we know it.

First big data was praised for operating outside the rules of statistical analysis and now it is being criticised for doing just that. But is this really the fundamental point we should be focused on?

The simple fact is that big data is no different from small data; the basic rules of statistics are not suspended and never have been.

We would argue that this debate, while valid, has distracted from a more fundamental issue: the real barriers to realising big data’s full potential. Let’s explore this point more through the use of two examples.

Linking data sets

For more than a decade we have known that one of the most powerful interventions for patients with chronic obstructive airways disease (COPD) is a community rehabilitation programme of activity and education. Chances are that if you get discharged from a hospital in the UK today with an acute episode of COPD you will be referred to such a programme. In reality, what happens is that you will be discharged into a void.

Currently, the NHS has no effective way of tracking and proactively supporting you through your community programme and back into primary care because this would involve linking at least three different data sets – that of the hospital, your GP and the community rehabilitation provider which may be a NHS organisation, a private or voluntary provider.

The consequences for patients, healthcare professionals and commissioners of services can be profound, resulting from a lack of communication and coordination as well as accountability.

Digital breadcrumbs

The last time you used your loyalty card in a supermarket you will have left “digital breadcrumbs” of some significance both with the supermarket but also your bank.

‘Why would a supermarket share data with public health bodies, effectively giving away years of internal investment and innovation?’

If you and others do this often enough they can work out the optimal product range for your local supermarket, the days they stock particular items and what loyalty rewards will make you most likely to come back for more. What they could also do is understand your health or that of your fellow shoppers in some detail and work with you or your local health authority on target public health campaigns or individualised support. In all likelihood they are not and you may not even want them to.

Two very different examples of big data posing a common trade-off: do the benefits outweigh the potential risks to personal privacy?

How each of us answers this question will differ from person to person and more significantly from society to society, and be intrinsically linked to the prevailing social norms. This is a core dilemma for big data.

Articles of faith

But of course there is also an implicit assumption that if only we had more and better data we would be able to turn it into something useful.

According to Hartford we better be careful. His critique calls into question what he labels the four “articles of faith” of big data. Running through a list of popular examples such as Google Flu Trends he makes a simple observation: big data does not operate outside the realm of statistical theory. He’s right.

Not having a theory to explain data patterns, confusing correlation with causation and ignoring sampling errors and biases remain cardinal sins for big data as much as they are for traditional data analysis.

Big data is not special in that respect, what is special is the hype that surrounds it.

‘Social norms, such as the level of trust and privacy preferences, form a powerful context for the big data debate. Ignore them at your peril’

Alex C. Madrigal, senior editor of The Atlantic, summarised it neatly when he said: “New technology comes along. The hype that surrounds it exceeds that which its creators intended. The technology fails to live up to that false hope and is therefore declared a failure in the court of public opinion.”

His article is well worth reading if you want a more nuanced story of how Flu Trends came into being and interact with and complement traditional government flu prediction systems.

But statistical viability is the easy bit. We have to get this right but it is not a reason for calling into question the overall concept. If we did, we would have to do the same for small data. Instead, we have developed mechanisms to policy statistical analysis such as peer reviews and greater transparency of data and methods used. This has not always been sufficient and there are a number of high profile cases where analysis has gone wrong.

The more pertinent risks to achieving the potential benefits of big data lie elsewhere. We argue there are at least three fundamental challenges standing in the way.

Social norms

First, social norms, such as the level of trust and privacy preferences, form a powerful context for the big data debate. Ignore them at your peril. We have chosen the COPD example carefully. This is not what most people would immediately regard as big data. But it is a very good demonstration of the power of linked data.

How difficult it is to even achieve this small step in a national health system has been illustrated by the recent debacle of Care.data, which after months of public backlash has been “paused”.

In our view, one key problem has been that policymakers took it for granted that the benefits outweigh the risks and that this is widely understood by patients and the wider public, instead of making a careful public case within the constraints of the prevailing social norms.

Wider societal benefits

Second, a key assumption in the big data debate, particularly in the healthcare context, is that there are wider societal benefits from linking and sharing data in new and more extensive ways. These wider benefits (or “positive externalities”) may be in conflict with private interests of those owning or controlling the data in the first place. Our second example may just be a case in point: why would a supermarket share data with public health bodies, effectively giving away years of internal investment and innovation?

‘Most information governance frameworks were written well before big data came on the scene’

Finally, there are range of “technocratic barriers” in the way of linking and sharing data. These include technology, common data standards and existing legislation governing the use of data. But this may also include the wider capability of governments to safely process and store large datasets with sensitive personal or commercial information.

Governments have a role to play in all three areas but while most countries have made significant strides in opening up government data (see: data.gov.fr, data.gov.uk, the Open Data Institute and the Open Data Research Network) few have a comprehensive policy strategy addressing these challenges. More importantly, most information governance frameworks were written well before big data came on the scene.

We are making a big mistake if we dismiss big data on the basis of statistical viability. Its potential is huge, let’s focus on the real issues, not misplaced expectations that were never intended by the creators.

Dr Axel Heitmueller is an honorary fellow at Imperial College London and director of strategy at Imperial College Health Partners, and Professor Sandy Pentland is Toshiba professor of media arts and sciences and director of the Media Lab entrepreneurship programme at MIT