How health researchers are already using pseudonymisation of data at source

Data scramble: QREARCH DATABASE, NOTTINGHAM UNIVERSITY

As a GP and an academic, Julia Hippisley-Cox has a better idea than most about the value of good patient information - and the more rounded the data the better.

Data scramble

The GP record is valuable, of course, but if you can add hospital episodes, cancer registrations, and even cause of death then you can get a really good picture of what’s going on all along the patient pathway.

What you don’t need, at least in the vast majority of cases, is information that identifies the patient.

Professor Hippisley-Cox, professor of clinical epidemiology and general practice in the division of primary care at the University of Nottingham, is a pioneer in pseudonymised data.

Back in 2002, she co-founded the QResearch database, with pseudonymised data from around 700 GP practices which is a joint not-for-profit partnership between Nottingham University and EMIS. This was linked to cause of death data in 2007, and then to hospital episode statistics and cancer registration data in 2011-12.

Each new dataset linkage added a further dimension to the available information, offering more opportunities for research.

So randomised

It was the need to find a practical way of linking the data, while retaining patient confidentiality, that gave rise to the pseudonymisation at source software.

Essentially, the process involves taking the NHS number and adding a random password to it. This is then converted using a one way hashing algorithm to create a unique string of 128 characters.

You then apply the same process to each of the data sources. When the data is sent to a third party - in this case, Nottingham University - there is no way of identifying an individual patient.

‘It doesn’t have real-world meaning. But it allows you to link the datasets together’

“It doesn’t have real-world meaning,” says Professor Hippisley-Cox. “But it allows you to link the datasets together.”

The result is a treasure trove of data that can be used for myriad research purposes. For example, you could look at patients who have been admitted to hospital for stroke, then check the GP record to see if there had been any signs and symptoms, or if they had been on a particular drug, then you could check the outcome in terms of cause of death.

“It allows you to capture events that might only be recorded on one data source,” explains Professor Hippisley-Cox. “For example, you could look at patients started on a new tablet, and you can follow them through the system and look for outcomes or side-effects.”

Linking data has practical, real world benefits too, she says. She points to QRisk, a system which identifies people at high risk of heart attack, and which is now recommended by NICE as the standard tool.

Pseudonymisation at source was developed when Professor Hippisley-Cox wanted to find a way of scrambling the NHS number, and she spoke to the GP software developer EMIS, which was also keen.

“We wanted to get a standard way of doing it so that any legitimate organisations that wanted to link data for patient benefit could get together.

“We held a series of workshops and there was a phenomenal response from other GP software providers, including TPP and In Practice, as well as the Department of Health and all sorts of other organisations. It was quite remarkable.”

What people wanted, she explains, was a way of pseudonymising data before it left the system.

“If the data is protected and has the right information governance controls, it’s better for patient confidentiality,” she adds. “And from the perspective of GPs, as data controllers, this isn’t identifiable data, so it doesn’t fall under the Data Protection Act.”

Although it falls outside the provisions of the act, there is still a need to ensure that patients know about it and have a chance to opt out, she adds - but this is a matter of good practice rather than law.
The workshops were oversubscribed, she says, and the pseudonymisation at source software is free to use.

“We wanted to remove barriers to using it,” she says, adding that it will work on any platform.

It was for her work on this project that Professor Hippisley-Cox won the John Perry prize last year - an award set up to recognise outstanding contributions to primary care computing.

Care.data concerns

So is she pleased with the achievement? Up to a point.

“I’m pleased with the project but we need to people to get over the idea that they need patient identifiers.”

She feels that the software would answer many of the concerns that led to the postponement of Care.data.

“I’m a bit frustrated at the moment,” she says. “To me it’s a no-brainer. There’s so much benefit to be had from using patient information. If [NHS England] had started working on this earlier then we could have seen some progress by now.”

 

Patients with no names: CANCER EPIDEMIOLOGY GROUP LEEDS UNIVERSITY

Data server

Without access to linked patient datasets, Amy Downing’s job simply wouldn’t be possible. And that would be a shame, both for the NHS and for the population as a whole.

Dr Downing is a research fellow in cancer epidemiology at the Leeds University Institute of Cancer and Pathology. Her research focuses on the analysis of linked datasets, such as cancer registry and hospital admissions data, to follow patient pathways and relate these to outcomes.

The work of her department means that, for example, trusts have a better idea of the factors that might lead to good - or bad - outcomes, and give them a change to adapt the pathway accordingly.

‘The analysis of the linked data has allowed us to look at a wide variety of topics such as variations in post-operative mortality across trusts’

Essentially, the more information about what happens to each patient, at each stage, the better. Crucially, however, she never needs to know the patient’s name.

Even where her group has permission to hold identifiable data, they do not actually use it in this way, choosing to use data which is pseudonymised at source. This is mainly to minimise risks, but is also possible because identifying the patient wouldn’t necessarily add anything to the research.

“Our work in the cancer epidemiology group at Leeds University consists primarily of the analysis of linked datasets,” she explains.

Bigger picture

The more information sources the better, in many ways.

“It helps us get a more rounded picture. Some of the data isn’t very good, and the more sources we have, the truer a picture we can get.”

As researchers, the group are given extracts of data, which are processed and stored according to the ethical approvals for the specific project.

“For a lot of the research we do we have full section 251/relevant ethical approvals to hold identifiable data but to minimise risks we would carry out analyses on anonymised/pseudonymised versions of the data,” she says.

‘A lot of the time we have data where the identifiers (for example, date of birth, postcode, NHS number) have been replaced with an ID number’

“A lot of the time we have data where the identifiers (for example, date of birth, postcode, NHS number) have been replaced with an ID number. This number can be related back to the original data by the database analysts who performed the original linkage.

The team’s work has been well received, she says, and certainly the results are impressive and potentially useful.

“The analysis of the linked data has allowed us to look at a wide variety of topics such as variations in post-operative mortality across trusts, factors associated with early mortality after cancer diagnosis, management and treatment of the elderly. We are also looking at extending the use of the data, for example linking patient reported outcomes data with treatment and survival data.”

Although she is using the data for research, Dr Downing’s work has real life NHS applications. For example, much of her activity has focused on breast cancer, bringing together information about care and outcomes, with a particular emphasis on applying novel methodologies to lead to a better understanding of what is often complex data.

Since 2011, she has been funded by a Cancer Research UK grant looking at national colorectal cancer intelligence. Again, the idea is to find appropriate statistical methods of looking in detail at the care and outcomes of people with this disease.

Recently the department has also completed a piece of work which involved analysing the results of a survey of colorectal cancer survivors about their quality of life after treatment.
This has allowed the researchers to feed back “toolkits” to trusts so that they can use the information to improve the patient experience - or to build on good practice.

‘Sharing patient information has the potential to do so much good, and our work shows that it doesn’t have to be identifiable patient information’

Dr Downing has no doubt that the sharing of data, and the linking of different datasets, is a positive development, but says she understands the fears of people who are concerned about the confidentiality of their patient information and health records.

“People read media reports about patient information being sold to private companies and are concerned about that, which is a shame.

“Sharing patient information has the potential to do so much good, and our work shows that it doesn’t have to be identifiable patient information. That’s a message that we really need to get across.”

HSJ/NHS Confederation supplement: 'Pseudonymised' records and an inspiring hospital model