Assessing an NHI Checksum Weakness

The National Health Index is a collection of demographic data about New Zealander's who have come into contact with the health system. It is used to identify individuals as well as make contact with them. It uses a unique identifying code called, often called the NHI Number, or NHI Health User Code (HUC), or simply the NHI for short. For all intents an purposes, almost all New Zealander's will have one of these identifiers and an associated demographic record on the national system. The identifier itself is made up of 3 letters, and 4 numbers (AAA0000). In order to prevent data entry errors when using the number, the final digit is a check digit. An algorithm can work out if an NHI identifier has been miskeyed if the last digit does not match the pre-determined pattern.

We undertook a simulation model to consider the way in which the NHI identifier algorithm operates and how there may be some inherent weaknesses within the algorithm for detecting transcription and transposition errors during manual data entry. We noted that the NHI identifier algorithm has an inherent weakness because of its modulus-11 operation allowing more than one character in any given position within the first 3 of an NHI identifier to result in the same check-digit. This allows in some cases for transcription errors to occur without detection by the check-digit algorithm. We determined that the algorithm performs well to prevent transposition errors mainly due to the way in which each character is weighted differently depending on its position within the NHI identifier.

We found that the layout of the QWERTY keyboard likely further exacerbates the weakness of the algorithm. Eight keys in the keyboard lay adjacent to at least one other key when substitution results in an identical check-digit. This same weakness was not found in the less common Dvorak keyboard layout. We estimate that the rate of NHI identity errors from manual data entry that remain undetected by the check-digit algorithm is in the order of 2 in 1000 NHI identifiers entered. Our experiment used a simulation model approach and makes many assumptions.

Further research in this area using empirical data will help to clarify the operational implications of our findings. We suggest an alternate modulus calculation and a check-digit with a wider range of values would likely protect against the weaknesses in the algorithm that we have identified here. Our experiment highlights the prudence in cross-checking additional demographic fields to further establish that NHI identifiers are assigned correct in data sets where they are entered manually and particularly those that are ancillary to front-line services where identity checking may not be done as a matter of course.

2022