NHI Check Digit Scheme Collision Stats

In Uncategorised by Jayden MacRae

Summary of performance

We have assessed the failure rate of each check-digit scheme for both single character substitutions and two-character transpositions.

The below table summarises the performance of fives schemes. The Classic scheme is the one currently being used by the existing NHI. The Extended A scheme is the currently proposed scheme for the extended NHI range, using a modulus 24 but with the existing sequence of factors. Extended B and C are alternative schemes which use different factor sequences to avoid collisions. Extended D scheme uses a modulus 23 operator. In all schemes, the minuend of the subtraction operation is the same as the modulus value used.

The failure rate for each scheme is presented for both substitution and transposition (of 2 adjacent characters of the same types), and the relative failure rate is shown using the Classic scheme as a reference. Any relative failure rate greater than one represents the number of times more likely a collision will occur. Any value lower than one is a decrease in the likelihood of a collision.

    SubstitutionTransposition
SchemeModulusMinuendSequenceFailure Rate %RelativeFailure Rate %Relative
Classic11117,6,5,4,3,22.91.003.21.00
Extended A24247,6,5,4,3,27.02.410.00.00
Extended B242417,13,11,10,7,50.00.008.62.69
Extended C242414,13,11,10,7,50.70.246.82.13
Extended D23237,6,5,4,3,20.20.070.30.09

The Classic NHI check-digit algorithm has a failure rate of a single character substitution not being detected of 2.9% and of a character transposition of and 3.2%.

The Extended A scheme using modulus 24 and a sequence of 7,6,5,4,3 and 2, has more than double the chance of a substitution going undetected but has no detected transposition collisions for two-character adjacent transpositions.

Using the Extended B scheme with modulus 24 and sequence of 17,13,11,10,7 and 5 has no collisions resulting from a single character substitution, but over two and a half times the transposition collision rate of the Classic scheme.

The Extended C scheme using modulus 24 with a sequence of 14,13,11,10,7 and 5 has a 0.7% chance of a character substitution not being detected, with over twice the increase of transposition collisions.

The Extended D scheme using modulus 23 with the original sequence of 7,6,5,4,3 and 2 has remarkably low failure rates for both substitution and transposition at 0.2% and 0.3% respectively. This represents a significant improvement over the Classic scheme in both situations.

The failure rates for all schemes assume that all substitutions and transpositions are equally likely to occur. In real world situations this is unlikely to be the case, with the similarities of some written letters or the relationships of letters to each other on QWERTY keyboards likely to be some of the factors that may influence certain combinations of substitutions or transpositions to be more or less likely than others. Assessing this is beyond the scope of the current analysis.

Failure rates were determined by taking a uniform random sample of 5,000 NHIs from the available range of NHIs within each scheme and applying a substitution or transposition at random parts of the NHI. Transpositions were only calculated for valid formats (e.g. a transposition of a character to a numeric field was not factored into the calculation as this will have a 100% failure rate). The failures rates presented here are therefore approximations and do not represent a calculation over the entire NHI range for all permutations of substitution or transposition.

Methods

Each scheme defines a modulus and secondary factors which vary the check-digit and collision frequency. The minuend in each scheme is the same as the modulus. The sum of products operation is modified to use the defined secondary factors. The modulo operation is modified to use the specified modulus.

Collision frequencies were calculated across a sample of 5000 random NHI selections without replacement. For each candidate NHI, a check digit was calculated for both a random substitution and a random transposition.

Each substitution was made by randomly selecting the position at which the substitution would occur, and then randomly assigning a different letter to substitute. Substitutions were only made for characters at position 1-6. Substituted characters were selected from a pool of only valid characters (e.g. an alpha was only ever substituted with another alpha, and a number was only ever substituted with another number). It was not possible to substitute an existing character for the same character. The new check-digit was evaluated and compared to the original check digit. In the case that both check-digits were equal, the substitution was determined to be a collision.

Each transposition was made by first randomly selecting a position to transpose. The second position with which the first character would be transposed was selected based on a position adjacent to first selected character of the same character type (alpha or numeric) For example, selecting position 2 as the first transposition character allows for the second character for transposition to be at position 1 or 3. In this calculation, the check digit (position 7) was also allowed to be transposed with the sixth character and vice-versa.

Collision matrices were calculated using the same sampled NHIs, but with an exhaustive substitution scheme (all characters were substituted within each NHI) to calculate and check the affected spaces.

Conclusion

The currently implemented Extended A scheme is significantly worse than the Classic scheme, failing to detect 7% of substitutions. It does however improve on the Classic scheme for transposition detection; having no detected collisions compared with 4.2%. Overall however, this likely provides less protection than the Classic scheme. If possible, this should be modified to either Extended B to provide better overall detection of NHI substitution and transposition.

Likely the best scheme for protecting against substitution and transposition errors in the NHI is the Extended D scheme, using a modulus of 23 with factor sequence of 7,6,5,4,3 and 2. It does have a small rate of collision for both substitution and transposition however, these are at least an order of magnitude less than the Classic scheme, translating into a failure rate of 0.2% and 0.3% for substitution and transposition detection respectively.

The Extended B scheme using modulus 24 and a factor sequence of 17,13,11,10,7 and provides perfect protection against single character substitution, with 0% failure. Detection of transpositions however is worse than the Classic scheme, failing at a rate of 8.6%.