* all geographic subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes (three digit zip codes may be used if the geographical area contains more than 20 000 people or, for areas with less, the three digit zip code is changed to 000);
* all elements of date (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death, and all ages over 89 and all elements of dates indicative of such age, unless aggregated into a single category of age 90 or older;
* telephone numbers;
* fax numbers;
* e-mail addresses;
* social security numbers;
* medical record numbers;
* health plan beneficiary numbers;
* account numbers;
* certificate and license numbers;
* vehicle identifiers and serial numbers, including license plate numbers;
* device identifiers and serial numbers;
* web universal resource locators (''URLs'');
* Internet Protocol (''IP'') address numbers;
* biometric identifiers, including finger and voice prints;
* full face photographic images and any comparable images; and
* any other unique identifying number, characteristic, or code.
Some of the data fields in the list, such as social security number, e-mail address, telephone number and the like, offer a fairly ready way to find out who a data subject is.33 The other fields chosen for stripping appear to be a list of fields that a statistician would find to be useful for
33 The irony, of course, is that within a set of health care or health benefits data, even the patient's name, address, and telephone number are not necessarily adequate to know that one is looking at the same individual in different records of health encounters. The same household may have many individuals named John Smith, Maria Hernandez, or Sally Wong. As a result, date of birth or social security number—or some other unique code that is known to be associated with a single individual over time—is almost always needed for health information systems to perform at an acceptable level of accuracy in identifying individuals.
triangulating databases in order to zero in on identified cases. Removal of all of the fields listed in the regulation is the only ''safe harbor'' for any data to be outside the regulation's prohibitions on use or disclosure.
The only alternative to the safe harbor is for a statistician to find that the ''risk is very small that the information could be used ... by an anticipated recipient to identify an individual who is the subject of the information'' 42 C.F.R. 164.514(a)(1)(i). Under this "statistical" method, a database can be considered ''de-identified'' if:
[a] person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable: (i) Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and (ii) Documents the methods and results of the analysis that justify such determination.34
If the regulation is taken at face value, then it leads to preposterous results. For example, a report of the frequencies stating the total number of admissions to each of ten hospitals on a given day of the year would be ''protected health information"—even if that were the only information transmitted about any given patient's case. As the rule is constructed, the inclusion of a patient-related date of any kind in a data set appears automatically to transform the data into protected health information. As a result, unless a statistician made the risk finding, transmission of such tabular data to anyone would be a technical violation of the regulation even if the number of daily admissions for each hospital were in the hundreds. Likewise, a table showing how many of the total number of inpatient admissions in a given county went to each of several hospitals would be a violation of the regulation, as would a table for a given hospital showing how many of its admissions in a year come from which zip
34 See id. at 82 818 (codified at 45 C.F.R. § 164.514(b)(1)).
code. ''County'' and ''zip code'' are in the list of fields that are automatically considered to be ''identifiers'' that must be removed in order for data to fit the de-identification ''safe harbor''. Therefore, unless each patient authorizes the disclosure or unless a statistician renders a risk opinion, the regulation makes disclosure of a table of frequencies that includes any of the suspect fields a disclosure of protected health information. As a result, data that meet the de-identification safe harbor is virtually useless for sound and informative epidemiologic or outcomes research.
Was this article helpful?