Quick Python for ICD-10 XML Parsing

ICD-10 coding is a hot topic in medical data circles this year.  The short version is that, when you visit a doctor, they have a standard set of codes for both the Diagnoses and the Procedures relevant to your visit.  ICD, which stands for “International Classification for Diseases” has been around since 1900… that’s right, 113 years of standard medical coding and we still have a mess of healthcare data.  Ugh.  But ICD-9, which was the first to formally include Procedure codes (as ICPM) and not just Diagnoses, started in 1979 and is due for a facelift.

ICD-10 is the facelift, and it’s a pretty large overhaul.  Where ICD-9 had over 14,000 diagnosis codes, ICD-10 has over 43,000.  Many U.S. laws (mostly those that are touched by HIPAA) are requiring adherance to ICD-10 by October, 2014,  spawning a flurry of headless-chickens, and a rich field for consulting and the spending of lots of money.

Enter my job.  I’m trying to graft the “official” ICD9/10 crosswalk and code data into a Data Warehouse, in preparation for the analysis that needs to follow.  Naturally, I go and download the official data from here:  (Broken link, see update) and set of in SSIS to get things moving, because that’s what we use here.

UPDATE (2016-08):  The cms.gov links change annually, and the old ones die apparently… the latest is here: https://www.cms.gov/Medicare/Coding/ICD10/2017-ICD-10-CM-and-GEMs.html but I’m not going to keep udpating it.  Search Google for cms.gov ICD10.  Also, a very nice SEO person from zogmedia.com pointed this out, in a bit of a linkbaiting message, but, hey, they have a point, and they were cool about it… they wanted me to link to this site which may be better updated:  http://www.nuemd.com/icd-10/codes.

SSIS is plagued with issues.  I really must say that I don’t like it.  Having worked with everything from Informatica (obnote: I own some INFA stock) to mysqlimport via bash shell for ETL,  SSIS is low on my list.  In particular, for this project, when trying to load the XML files provided by CMS, SSIS complained that it can’t handle XML with mixed content in the XMLSource widget.  Once I tweaked the .xsd (which I shouldn’t have to do) to get around this, it complained of special characters in fields and got too frustrating to deal with.  Yes, there are alternatives in SSIS, but most involve coding in Visual Basic or C# and STILL using the SSIS tool.  This is a monolithic hammer to handle a very simple problem.

Look, all I really want is a list of codes and descriptions from the XML document.  There is a LOT of other useful metadata in there, but for now, it can wait.  Here’s a simple (not robust) solution in a handful of python lines:


import xml.etree.ElementTree as ET
import csv

csvwriter = csv.writer(open('diagnostics.csv', 'wb'))

tree = ET.parse('ICD10CM_FY2013_Full_XML_Tabular.xml')
root = tree.getroot()

for diag in root.iter('diag'):           # Loop through every diagnostic tree
   name = diag.find('name').text.encode('utf8')  # Extract the diag code
   desc = diag.find('desc').text.encode('utf8')  # Extract the description
   csvwriter.writerow((name,desc))       # write to a .csv file

And there we have a .csv which is much easier to load with whatever tool we want. This works well for the other XML files as well such as the DIndex and EIndex files, except for some reason they use different, optional, tags for their hierarchies… “mainTerm”s are the parent diagnostic codes and “term”s are the optional children. I’ll leave that as an exercise, though, it’s not too bad. 😉

One Comment on “Quick Python for ICD-10 XML Parsing

Leave a Reply

Your email address will not be published. Required fields are marked *