Hierachical Analysis of SNOMED CT

In Technical by jayden.macrae

The Systematized Nomenclature of Medicine — Clinical Terms (SNOMED CT) is a vocabulary of clinical terminology that is being introduced into New Zealand for use in electronic health records. It will eventually replace the use of the Read coding system used in the primary care. It contains more than 300,000 concepts. These terms are arranged into relationships with one another, representing hierarchies of common concepts and themes.

SNOMED CT uses common but arbitrary numerical reference numbers to identify each term. This is in contrast to Read codes, which use a structured identifier which identifies the position within the classification in which the code exists. For more detailed information on SNOMED CT, see the SNOMED International website.

For example, in the Read code system, Diabetes Mellitus has a code of ‘C10..’, while the SNOMED CT Concept ID is 73211009. The ‘C’ indicates the main branch of the classification system in which the term belongs, being ‘Endocrine and Metabolic Disorders’. ‘Type I Diabetes Mellitus with Maturity Onset’ has a Read code of ‘C1089.’ and SNOMED CT Concept ID of 190372001. It is possible to understand the hierachy from the Read code, deducing that ‘Type I Diabetes Mellitus with Maturiy Onste’ is a more specific code related to ‘Diabetes Mellitus’ because of the commonality in the ‘C10’ component of each code. It is not possible to do this alone from the SNOMED CT Concept ID.

What SNOMED CT gives up in sentinel codes, it makes up for in the rich way in which codes are inter-related. Read codes have only linear relationships through the hierarchical tree and in the above example, it can only belong to ‘Other endocrine gland diseases (disorder)’ (C1…). In SNOMED CT, ‘Diabetes Mellitus’ belongs to both ‘Disorder of endocrine system’ (CID : 362969004) and ‘Disorder of glucose metabolism’ (CID : 126877002).

SNOMED CT stores these complex relationships in a separate data structure, referred to as ‘Concept Relationships’. These are simply a record of which SNOMED CT concepts are related to each other in parent-child pairs. This data includes other relationships other than parent-child pairs, for example synonyms.

A linear hierarchy with sentinel codes makes it simple to identify cohorts with a particular disorder or sub-disorder. The typical SQL pattern is to use a wildcard search for the code and any child codes, such as:

SELECT * FROM ReadCodes WHERE Code LIKE 'G10%'

This would find all records which have a Diabetes Mellitus code, or any code that is more specific than Diabetes Mellitus within the hierarchical tree (e.g. C1089).

This same search strategy does not work however for SNOMED CT. Because the SNOMED CT Concept IDs are arbitrary numeric values, they have no inherent encoding of relationship. This means that the closest similar strategy for selecting all terms related to Diabetes Mellitus would involve explicitly identifying each Concept ID and selecting for that. For Diabetes Mellitus, this is currently 132 codes.

SELECT
  *
FROM
  SNOMEDCTTerms
WHERE
  ConceptID In (73211009,609568004,609569007,105401000119101,199223000,703136005,46635009,44054006,111552007,716362006,123763000,8801005,609561005,49817004,24203005,237651005,111307005,127012008,33559001,199225007,199227004,76751001,199226008,199228009,11687002,609563008,609566000,237627000,703137001,703138006,31321000119102,23045005,28032008,426875007,71791000119104,199229001,609566000,190330002,703137001,190372001,314893005,190369008,314771006,190368000,313435000,359642000,81531005,237599002,199230006,237627000,9859006,190331003,703138006,314903002,190390000,314772004,314902007,190389009,313436004,313435000,313436004,709147009,426705001,5969009,59079001,51002006,42954008,75682002,276560009,5368009,408539000,75524006,445260006,237601000,427089005,70694009,237619009,237613005,237618001,33559001,446641003,427089005,91352004,2751001,10754881000119104,4783006,408540003,190447002,40801000119106,10753491000119101,75022004,46894009,40791000119105,237600004,190406000,190411003,190410002,190407009,190412005,199231005,57886004,609562003,237604008,609577006,609578001,609570008,609571007,609572000,609573005,609574004,609575003,609576002,609565001,237603002,609564002,609567009,609564002,609567009,609564002,609564002,609567009,609567009,314904008,1481000119100,1481000119100,413183008,106281000119103,237600004,190416008,609579009,609580007,609581006)

In order to determine these 132 codes, it is necessary to traverse the non-linear hierarchy of Concepts. This requires a recursive approach, to explore and exhaust all hierarchy branches. There are a couple of strategies that you can use to achieve this:

  • Self Joining Queries
  • Cursors
  • Recursive Common Table Expressions

The Recursive Common Table Expression approach is

create function [dbo].[ChildRelationships] ( @ParentConceptID bigint )
returns TABLE
-- Copyright 2017 by DataCraft Analytics Limited.
-- http://www.datacraft.nz/
-- This code by DataCraft Analytics Limited is made available under a:
--    Creative Commons Attribution-ShareAlike 4.0 License (International)
--    https://creativecommons.org/licenses/by-sa/4.0/
-- * You are free to use, share and adapt this code for any purpose.
-- * You must give appropriate credit, provide a link to the license
--   and indicate if changes were made.
-- * If you remix, transform or build upon the material, you must distribute
--   your contributions under the same license as the original.

as

return
with Concepts ( RelationshipID, ConceptID1, ConceptID2, Level )
as
(
	-- Anchor Expression --
	-- the purpose of this statement is to select the initial child objects of the parent --
	-- these are marked as Level 1 indicating their are direct descendants --
	select
		RelationshipID, 
		ConceptID1,		-- child concept
		ConceptID2,		-- parent concept
		Level = 1		
	from
		RelationshipsCore R
		inner join
		ConceptsCore C on R.ConceptID1 = C.ConceptID -- join to child concepts
	where
		R.RelationshipType = 116680003	-- 116680003 is the Concept ID for the 'Is A' relationship --
		and
		conceptid2 = @ParentConceptID
		
	
	union all

	-- Recursive Expression --
	-- the purpose of this statement is to match child concepts to previously found children --
	-- ths statement is called recursively until no more records are returned --
	select
		R.RelationshipID, 
		R.ConceptID1, 
		R.ConceptID2, 
		C.Level + 1			-- increment the level by 1 each time --
	from
		RelationshipsCore R 
		inner join 
		Concepts C on R.ConceptID2 = C.ConceptID1
	where
		R.RelationshipType = 116680003
)
select
	C.RelationshipID, 
	ChildLevel = Level, 
	ChildName = CC1.FullySpecifiedName, 
	ChildConceptID = C.ConceptID1, 
	RelationshipName = L.FullySpecifiedName, 
	ParentName = CC2.FullySpecifiedName, 
	ParentConceptID = C.ConceptID2, 
	ChildConceptStatus = CC1.ConceptStatus,
	ParentConceptStatus = CC2.ConceptStatus,
	ChildCTV3ID = CC1.CTV3ID,
	ParentCTV3ID = CC2.CTV3ID
from
	Concepts C
	left join
	ConceptsCore CC1 on C.ConceptID1 = CC1.ConceptID	-- join for child Concept detail --
	left join 
	ConceptsCore CC2 on C.ConceptID2 = CC2.ConceptID	-- join for parent Concept detail --
	left join
	RelationshipsCore R on C.RelationshipID = R.RelationshipID
	left join
	ConceptsCore L on R.RelationshipType = L.ConceptID	-- join for Relationship detail
go

This means that you can now find all children given a parent Concept ID by using the Table Value Function:

select * from dbo.ChildRelationships(73211009)

Because the intent is to only return a single result set, I have favoured using a Table Value Function over a Stored Procedure here. This allows us to use the TVF like any other table.

The Common Table Expression (CTE) allows me to create a recursive query using two sets of logic. The anchor expression selects the first set of results. The recursive expression selects the results there-after. The two parts of the CTE are joined using a union all. You can see above that the anchor expression returns the direct children of the parent expression. It limits relationships that to those that indicate child-parent associations.

The magic of the recursive CTE happens in the recursive expression. You should note how the recursive expression refers to the CTE itself, in this case Concepts. The CTE will continue to call itself until no results are returned.

For more information on Recursive CTEs see the Microsoft Technet article.

The introduction of SNOMED CT will possibly have some implications for how analysts must approach the querying of data. It will be necessary to have a complete reference set of SNOMED CT in order to generate appropriate static SQL where clauses for all terms and child-terms. This will be particularly relevant for those that undertake remote query execution.

Creative Commons License
Copyright 2017 DataCraft Analytics Limited for Code and Code Snippets on this page. Unless otherwise stated, code on this page is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.