Networking: A Case for Data Masking

By Jeff Kabachinski, MS-T, BS-ETE, MCNE

A health care provider can generate large quantities of data on a daily basis. It comes from not only the business of providing health care, but also from other health care providers, insurance companies, and various functions or internal departments. Not only is this data deposited across a range of systems and in many places, it must also comply with continually stiffening regulatory standards. We have all heard of HIPAA and HL7, but there are others like the National Council for Prescription Drug Programs (NCPDP) transaction standards. Compliance must be enforced, but health care organizations also want to make good analytical use of the torrents of protected health information (PHI) being collected. Effective data management is a must, but there is also the need to analyze the data for a myriad of reasons.

The Privacy Rule at 164.502 (b)(1) of the HIPAA regulation states, “When using or disclosing protected health information or when requesting protected health information from another covered entity, a covered entity must make reasonable efforts to limit protected health information to the minimum necessary to accomplish the intended purpose of the use, disclosure, or request.”

For every production-level database there is an average of five copies made that an organization will use in nonproduction environments. In database (DB) parlance, a production DB is the live, active, golden version of the data. It is the one everyone uses, it is logged into all day long, and it is constantly updated with new or revised PHI. The copies are made (known as nonproduction DBs) for testing, development, quality assurance, data migration testing or staging purposes, and data analytics. There are many reasons why you would want a copy of real data, but you do not want to mess around in the live environment with the “pure” data. In the health care facility, we are talking mainly about the electronic health record (EHR) DB.

Follow the Rules

Just about every DB has rules and polices for use and security. In the credit card industry, it is the Payment Card Industry Data Security Standard, or PCI DSS. For financial institutions, the Gramm-Leach-Bliley Act safeguards sensitive data. The Sarbanes–Oxley Act has been around for 10 years now to protect corporate information and require truthfulness in financial disclosures. For health care, we well know it is HIPAA, or the Health Insurance Portability and Accountability Act of 1996. HIPAA was updated and given some extra teeth with the adoption of the American Recovery and Reinvestment Act of 2009. It expanded HIPAA to impose new privacy and security requirements. Among the added policies are not only the expanded scope of privacy and security, but also new penalty provisions, the opening of the enforcement to state attorney generals, and the newly established federal data breach notification regulations—heavy stuff indeed. Consider that CVS Caremark was fined $2.25 million for violating HIPAA privacy regulations. Or consider the Alaska medical agency—the Alaska Department of Health and Social Services—which was penalized for the loss of a hard drive containing PHI. The fine was upped to $1.7 million when the regulators found that the department failed to conduct a risk analysis exercise as required. It had not sufficiently trained its staff on regulations and security procedures, and also did not apply encryption for data and media! HIPAA has been tougher since 2009, and these are just two examples.

Data Masking

The nonproduction data is more vulnerable as it probably does not have the same security applied as the production DB. For example, all nonproduction DB users are most likely provided the highest security and maximum access clearance. It is the same data, but it is now open for hacking, mayhem, or data theft. The two de-identifying means as indicated in HIPAA’s Privacy Rule are data masking and safe harboring.

Data masking is used to de-identify data by replacing or removing sensitive information. Data masking is a necessary security measure, along with encryption and access control. However, if you are using the data for research, you will want to replace the PHI with similar context.

Safe harboring simply removes names, social security numbers, e-mail addresses, and residential addresses. See the sidebar for the complete list of the 18 data types.

The recurring theme is the trade-off that must be made between removal of Personally Identifiable Information (PII), and PHI, and also having good enough meaningful data to get meaningful research results when needed.

“To analyze the data, you need to provide information,” says Xiaobai Li, professor at the University of Massachusetts in the thick of data masking research. “But at the same time, you want to protect the individuals—like patients or even doctors.” You can always use data encryption, but, “If you encrypt the data, you cannot do any statistical analysis,” Li says. In the end, the intention is to find a way for research to succeed while still maintaining appropriate privacy protections for individuals. According to Li, that balance becomes possible with data masking.

Gartner Inc, Stamford, Conn, defines data masking as, “A set of techniques and technologies aimed at preventing the abuse of sensitive data by hiding it from users.” OK, but that is a little vague. I think a better summary comes from Forrester Research Inc, Cambridge, Mass: “The process of concealing private data in nonproduction environments such that application developers, testers, privileged users, and outsourcing vendors do not get exposed to such data.”

Researchers must decide if the masked data still provides enough realistic information to run tests and experiments reliably. They should be confident of their results as it pertains to health care action for detecting disease spreading, for example. Since data usefulness is critical, you must mask data in such a way to maintain data integrity.

De-identified masked confidential data is safe to use for application development, testing, and training. PII is transformed via algorithms to produce fictional contextually accurate data. This transformed information is substituted for the original production data. Masked data provides an effective way to protect privacy and support compliance initiatives, while supplying meaningful data for analysis and development.

Safeguarding Personal Health Information

The safe harbor method of de-identification to guard protected health information involves removing the 18 data types listed here:

• Names

• Street address, city, county, and zip code

• All dates related to the patient, including birth, admission, discharge, and death.

In addition, all ages over 89 and all dates (including year) indicative of such age.

• Telephone numbers

• Fax numbers

• E-mail addresses

• Social security numbers

• Medical record numbers

• Health plan beneficiary numbers

• Account numbers

• Certificate/license numbers

• Vehicle identifiers and serial numbers

• Device identifiers and serial numbers

• URLs

• IP addresses

• Biometric identifiers, such as fingerprints and voiceprints

• Full-face photographic images

• All other unique identifying numbers, characteristics, or code

That should cover it! 24×7 Networking April 2013

Jeff Kabachinski, MS-T, BS-ETE, MCNE, has more than 20 years of experience as an organizational development and training professional. He is the director of technical development for Aramark Healthcare Technologies in Charlotte, NC. For more information, contact [email protected].