Privacy matters: When is personal data truly de-identified?

25.07.2009

The (HHS) is about to rule whether health care entities will need to notify patients if their de-identified data -- patient data that has been stripped of all potential for identifying individuals, which is often used for research and development -- is breached. As it stands now, de-identified data is not subject to the new breach-notification rules imposed by the HITECH privacy provisions of the 2009 American Recovery and Reinvestment Act (ARRA) stimulus package. The debate pits privacy activists on the one side -- who often support notification -- with health care organizations on the other, which say the quality of health care hangs in the balance.

This debate hasn't been getting much attention. That's unfortunate, because the outcome could have broader implications within the U.S. and even around the world. Validating that personal data can be de-identified in a way that still retains commercial and social usefulness could set a precedent for many other privacy-related standards and debates.

The ruling will come amid a flood of medical data-breach notifications in California, the first state to impose this requirement. Since January of this year, 823 medical-data breaches were reported to the state government, of which the state investigated 122 and confirmed 116 as breaches. One of them -- inappropriate staff access into the files of the so-called Octomom -- resulted in the statute's maximum fine of $250,000. So, the stakes are high on how the question of de-identified data is resolved.

Let me confess my bias. I think the people who came up with the de-identification rule should be given a Presidential Medal of Freedom. I'm exaggerating only a little. The rule is one of the most practical innovations in privacy regulations worldwide and arguably has saved lives.

How does de-identification work? According to part 164.514 of the Code of Federal Regulations, a HIPAA-covered entity has two choices if it wants to de-identify patient data and use it for any purpose such as research and development:

HIPAA also provides a third "limited data set" method. Under these criteria, the covered entity can remove 16 of those 18 identifiers, but guard the remaining data with additional security precautions. It can use the dataset for research and development, but the remaining data is still regarded as protected health information (PHI) subject to HIPAA.

No other country has developed a more rigorous or detailed guidance for how to convert personal data covered by privacy regulations into non-personal data (see Table 2). Indeed, clinical-research organizations rarely use the "safe harbor" and "limited data set" options because they must strip out so much information -- particularly dates of service and discharge -- that it makes the remaining data almost worthless from a clinical-research perspective.

What's the benefit to America of this obscure de-identification rule? It enables health care organizations that otherwise wouldn't have been able to use patient data to convert it into a format they can use for a range of other purposes. These other purposes include improving the efficacy of drugs and medical devices and identifying the optimal places to build new health care facilities.

And I haven't heard of a single case of a de-identified data set being breached by criminals and re-identified. I checked the major running tallies of data breaches -- , , the and -- and came up empty.

It's probably because there's far less economic incentive for a criminal to go after medical data instead of credit card information. It's harder to monetize the fact that I know that Judy Smith of Peoria has heart disease -- by filing false claims in her name, for example -- than to have Judy's credit card number and expiration date. If I'm a criminal with advanced data skills and I have a day to spend, I'm going to go after financial data and not health data.

That's why I'm biased in favor of the HIPAA de-identification criteria. I think they advance public health without compromising privacy. But the activists calling for change make some arguments worth considering.

They often point to four cases:

And last month, two Carnegie Mellon researchers made headlines when they released the results of a where they were able in fewer than 1,000 attempts to identify all nine digits of the Social Security Numbers of 8.5% of deceased people who were born after 1988.

This academic version of the -- where hackers try to outdo each other -- has led de-identification purists to gravitate around the so-called "k-anonymity" method of statistical de-identification. Hopefully HHS will back-burner this option, because k-anonymity is to data what chemotherapy is to human tissue: It destroys the good when going after the bad.

According to Columbia University epidemiologist and statistical de-identification expert Daniel Barth-Jones, "The problem with certain de-identification approaches [such as k-anonymity] is that they can badly distort the accuracy of statistical analyses."

"Progress on numerous goals for the government's health IT agenda like quality improvement, patient safety and reducing health disparities could be seriously stunted, or even do more harm than good," he added, "if we aren't conducting our analyses with data that has been de-identified with a rigorous approach for preserving statistical accuracy."

But the growing availability of data on people and improvements in re-identification methodology have nonetheless convinced several privacy advocates that it's time to change the HIPAA de-identification rule. Some have recently submitted public comments on the impending changes to HIPAA. What are they saying?

I wonder what social good would be accomplished by sending out breach-notification letters for de-identified or limited data sets that were mishandled. I can just see it:

"Dear Grandma Cline, this is the hospital you just visited. I hope you had a pleasant stay. We regret to inform you that there has been an incident. One of our hospital staff recently lost a USB drive we believe may have contained a set of statistics that included only the date of discharge from the hospital and ZIP code. We use these data for research purposes according to the Authorization for Research form you signed when you were in. We aren't certain whether your date of discharge and ZIP code were included in these statistics, but we were obliged to re-identify everyone who may have been on the USB drive to notify them. If your data was on this USB drive, we estimate there is less than a 1% chance your data could be re-identified by anyone other than researchers at Carnegie Mellon University who may have found the USB drive, which is a small device that is inserted into a computer that you can save data on. We apologize for this oversight."

Hopefully, HHS will heed the lesson of the past six years of data-breach notification and not contribute to the overnotification and needless worry and concern of the American public.

I recently caught up with Judith Beach, chief privacy officer of Durham, N.C.-based Quintiles. When she was of the chief privacy officer at Synergy -- a former subsidiary of Quintiles -- the company commissioned the first-ever statistical de-identification of a HIPAA-covered data set. Since then, Beach has become a national expert on de-identification. What's her take on the situation?

"We all want to know if there has been a serious risk to our personal-health data," she told me. "But if we get notified for all incidents, including those of very low risk, we will become inured to the numerous notifications we are bound to receive."

So what should be done? Three things:

The path HHS takes will be closely watched by other jurisdictions that have not yet defined their own de-identification parameters. If we arrive in a world where personal data is never truly de-identified, we're going to need a risk-based approach to guide our way forward.

Jay Cline is a former chief privacy officer at a Fortune 500 company and is now president of . You can reach him at .