De-Identification & Coding

Definitions

Confidentiality of Human Subject Information (HSI) can be achieved by receiving a de-identified dataset or creating a key sheet and coding your data results. In addition to the below overview, a full breakdown of the differences between identifiable and other types of data can be found in Appendix J (Identifiable/Coded/De-Identified/Anonymous Data and/or Specimens) of Harvard's Investigator Manual.

De-identified Data: Information that does not identify an individual. Coded data is not de-identified. The IRB recommends that researchers consult the OHRP/OCR Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule when seeking recommendations on methods for de-identification of research data. Standards can be found at: https://nvlpubs.nist.gov/nistpubs/ir/2015/nist.ir.8053.pdf

Coded Data: Coded-collected data and samples are unidentified by use of random code, but the samples may still be linked to their source through the use of a key code by the investigator. Additionally, Harvard affiliates can participate in the Principles of Research Data Confidentiality course within the Harvard Training Portal for an overview of data confidentiality protection, the risk of re-de-identification of data, and data management strategies for minimizing the risk of inadvertent disclosure.

Additional Best Practices & Tips

Watch or read "How to De-identify Your Data: Balancing statistical accuracy and subject privacy in large social-science data sets" by Olivia Angiuli, Joe Blitzstein, and Jim Waldo.

Consent Forms

The storage and collection method of the consent form ideally in a paper form. This paper would be kept under lock and key away from collected data and samples. Shred the consent form when no longer needed.  If the consent form is in an electronic format you would keep this form separate from collected data and samples. Permanently delete the file when no longer needed. 

Anonymization & Key Sheets

Personally identifiable information (PII) such as name, phone number, etc. should not be used outside of the key sheet. The use of phone numbers can potentially identify a research participant and should be stored separate from the dataset. As more individuals are using their phone numbers for access authorization and recovery this data element identifying an individual is increasing. Best practice is to store names and phone numbers on the key sheet leaving your data coded.​

​When web-based survey tools are used in collecting coded research responses, it is important that these results be anonymized and IP addresses removed that could identify participants. Please also review HGSE's Tool Classification Matrix to help choose an appropriate survey tool solution (e.g., Harvard Qualtrics, HSPH RedCap) according to your type of data to ensure compliance with University policies.