{"id":441,"date":"2013-06-11T12:22:55","date_gmt":"2013-06-11T16:22:55","guid":{"rendered":"http:\/\/happytechnologist.com\/?p=340"},"modified":"2016-09-06T10:34:23","modified_gmt":"2016-09-06T14:34:23","slug":"quick-python-for-icd-10-xml-parsing","status":"publish","type":"post","link":"https:\/\/www.chiplynch.com\/wordpress\/?p=441","title":{"rendered":"Quick Python for ICD-10 XML Parsing"},"content":{"rendered":"<p>ICD-10 coding is a hot topic in medical data circles this year. \u00c2\u00a0The short version is that, when you visit a doctor, they have a standard set of codes for both the Diagnoses and the Procedures relevant to your visit. \u00c2\u00a0ICD, which stands for &#8220;International Classification for Diseases&#8221; has been around since 1900&#8230; that&#8217;s right, 113 years of standard medical coding and we still have a mess of healthcare data. \u00c2\u00a0Ugh. \u00c2\u00a0But ICD-9, which was the first to formally include Procedure codes (as ICPM) and not just Diagnoses, started in 1979 and is due for a facelift.<\/p>\n<p>ICD-10 is the facelift, and it&#8217;s a pretty large overhaul. \u00c2\u00a0Where ICD-9 had over 14,000 diagnosis codes, ICD-10 has over 43,000. \u00c2\u00a0Many U.S. laws (mostly those that are touched by <a href=\"http:\/\/en.wikipedia.org\/wiki\/Health_Insurance_Portability_and_Accountability_Act\">HIPAA<\/a>) are requiring adherance to ICD-10 by October, 2014, \u00c2\u00a0spawning a flurry of headless-chickens, and a rich field for consulting and the spending of lots of money.<\/p>\n<p>Enter my job. \u00c2\u00a0I&#8217;m trying to graft the &#8220;official&#8221; ICD9\/10 crosswalk and code data into a Data Warehouse, in preparation for the analysis that needs to follow. \u00c2\u00a0Naturally, I go and download the official data from here: \u00c2\u00a0<a href=\"http:\/\/www.cms.gov\/Medicare\/Coding\/ICD10\/2013-ICD-10-CM-and-GEMs.html\">(Broken link, see update)<\/a>\u00c2\u00a0and set of in <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/ms141026.aspx\">SSIS<\/a>\u00c2\u00a0to get things moving, because that&#8217;s what we use here.<\/p>\n<p><em>UPDATE (2016-08): \u00c2\u00a0The cms.gov links change annually, and the old ones die apparently&#8230; the latest is here:\u00c2\u00a0<a href=\"https:\/\/www.cms.gov\/Medicare\/Coding\/ICD10\/2017-ICD-10-CM-and-GEMs.html\">https:\/\/www.cms.gov\/Medicare\/Coding\/ICD10\/2017-ICD-10-CM-and-GEMs.html<\/a>\u00c2\u00a0but I&#8217;m not going to keep udpating it. \u00c2\u00a0Search Google for cms.gov ICD10. \u00c2\u00a0Also, a very nice SEO person from zogmedia.com pointed this out, in a bit of a linkbaiting message, but, hey, they have a point, and they were cool about it&#8230; they wanted me to link to this site which may be better updated: \u00c2\u00a0<a href=\"http:\/\/www.nuemd.com\/icd-10\/codes\">http:\/\/www.nuemd.com\/icd-10\/codes<\/a>.<\/em><\/p>\n<p>SSIS is plagued with issues. \u00c2\u00a0I really must say that I don&#8217;t like it. \u00c2\u00a0Having worked with everything from <a href=\"http:\/\/www.informatica.com\/\">Informatica<\/a> (obnote: I own some INFA stock) to <a href=\"http:\/\/dev.mysql.com\/doc\/refman\/5.7\/en\/mysqlimport.html\">mysqlimport<\/a>\u00c2\u00a0via bash shell for ETL, \u00c2\u00a0SSIS is low on my list. \u00c2\u00a0In particular, for this project, when trying to load the XML files provided by CMS, SSIS complained that it can&#8217;t handle XML with mixed content in the XMLSource widget. \u00c2\u00a0Once I tweaked the .xsd (which I shouldn&#8217;t have to do) to get around this, it complained of special characters in fields and got too frustrating to deal with. \u00c2\u00a0Yes, there are alternatives in SSIS, but most involve coding in Visual Basic or C# and STILL using the SSIS tool. \u00c2\u00a0This is a monolithic hammer to handle a very simple problem.<\/p>\n<p>Look, all I really want is a list of codes and descriptions from the XML document. \u00c2\u00a0There is a LOT of other useful metadata in there, but for now, it can wait. \u00c2\u00a0Here&#8217;s a simple (not robust) solution in a handful of python lines:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n\r\nimport xml.etree.ElementTree as ET\r\nimport csv\r\n\r\ncsvwriter = csv.writer(open('diagnostics.csv', 'wb'))\r\n\r\ntree = ET.parse('ICD10CM_FY2013_Full_XML_Tabular.xml')\r\nroot = tree.getroot()\r\n\r\nfor diag in root.iter('diag'):           # Loop through every diagnostic tree\r\n   name = diag.find('name').text.encode('utf8')  # Extract the diag code\r\n   desc = diag.find('desc').text.encode('utf8')  # Extract the description\r\n   csvwriter.writerow((name,desc))       # write to a .csv file\r\n\r\n<\/pre>\n<p>And there we have a .csv which is much easier to load with whatever tool we want. This works well for the other XML files as well such as the DIndex and EIndex files, except for some reason they use different, optional, tags for their hierarchies&#8230; &#8220;mainTerm&#8221;s are the parent diagnostic codes and &#8220;term&#8221;s are the optional children. I&#8217;ll leave that as an exercise, though, it&#8217;s not too bad. \ud83d\ude09<\/p>\n","protected":false},"excerpt":{"rendered":"<p>ICD-10 coding is a hot topic in medical data circles this year. \u00c2\u00a0The short version is that, when you visit a doctor, they have a standard set of codes for both the Diagnoses and the Procedures relevant to&#8230; <a class=\"read-more\" href=\"https:\/\/www.chiplynch.com\/wordpress\/?p=441\">Read More<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[29,30],"tags":[],"class_list":["post-441","post","type-post","status-publish","format-standard","hentry","category-data","category-healthcare"],"_links":{"self":[{"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/441","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=441"}],"version-history":[{"count":2,"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/441\/revisions"}],"predecessor-version":[{"id":915,"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/441\/revisions\/915"}],"wp:attachment":[{"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=441"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=441"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=441"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}