{"id":442,"date":"2013-06-16T02:05:42","date_gmt":"2013-06-16T06:05:42","guid":{"rendered":"http:\/\/happytechnologist.com\/?p=357"},"modified":"2013-06-16T02:05:42","modified_gmt":"2013-06-16T06:05:42","slug":"r-sparklines-and-the-hhp-leaderboard","status":"publish","type":"post","link":"https:\/\/www.chiplynch.com\/wordpress\/?p=442","title":{"rendered":"R Slopegraphs and the HHP Leaderboard"},"content":{"rendered":"<p>I&#8217;m still working on my visualization-fu, so when the <a href=\"http:\/\/heritagehealthprize.com\/\" target=\"_blank\">Heritage Health Prize<\/a> finally got announced, the final scores provided a simple source of data that I wanted to investigate.<\/p>\n<p>I&#8217;ve\u00c2\u00a0<a href=\"http:\/\/happytechnologist.com\/?p=251\">written about the HHP<\/a>\u00c2\u00a0before. \u00c2\u00a0After spending three years with the competition, the\u00c2\u00a0<a href=\"http:\/\/online.wsj.com\/article\/PR-CO-20130603-907153.html\">winners were announced<\/a>\u00c2\u00a0at Health Datapalooza just a few days ago. \u00c2\u00a0Prior to the announcement, the teams had been ranked based on a 30% sample of the final data, so it was of some interest to see what happened to the scores against the full 100%. \u00c2\u00a0For one thing, I personally dropped from 80th place to 111th, and the winners of the $500,000 prize jumped from 4th place to take the prize&#8230; not an unheard of jump, but given the apparent lead of the top 3 teams it was somewhat unexpected. \u00c2\u00a0The results were published on the HHP site, but I scraped them manually into a .csv format for a little simpler manipulation. \u00c2\u00a0An Excel file with the raw and manipulated data is attached here: \u00c2\u00a0<a href=\"http:\/\/happytechnologist.com\/wp-content\/uploads\/2013\/06\/Final-Standings.xlsx\">HHP Final Standings<\/a>\u00c2\u00a0for convenience.<b><br \/>\n<\/b><\/p>\n<p>A decent visualization for this before-and-after style information is the\u00c2\u00a0<a href=\"http:\/\/charliepark.org\/slopegraphs\/\" target=\"_blank\">slopegraph<\/a>. \u00c2\u00a0 \u00c2\u00a0Here&#8217;s an example:<\/p>\n<div id=\"attachment_360\" style=\"width: 442px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/happytechnologist.com\/wp-content\/uploads\/2013\/06\/hhp_slopegraph_top_50-720.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-360\" class=\" wp-image-360\" title=\"Top 50 Teams\" alt=\"hhp_slopegraph_top_50 (720)\" src=\"http:\/\/happytechnologist.com\/wp-content\/uploads\/2013\/06\/hhp_slopegraph_top_50-720.png\" width=\"432\" height=\"432\" \/><\/a><p id=\"caption-attachment-360\" class=\"wp-caption-text\">Top 50 Teams<\/p><\/div>\n<p><!--more-->The top 50 teams from the pre-announcement list, and the top 50 teams from the final standings are shown (a total of 74 teams due to teams moving into or out of the top 50). \u00c2\u00a0Lower scores are better, so the first place teams are at the bottom. \u00c2\u00a0You notice a few things quickly from this chart. \u00c2\u00a0First, the first\u00c2\u00a0three teams from the public leaderboard (the 30% dataset), had scores that dropped significantly, allowing the fourth place team to overtake them on the right (100% dataset). \u00c2\u00a0It&#8217;s possible this is due to &#8220;overtraining&#8221;&#8230; where, through hundreds of test submissions, scores are artificially inflated by making improvements customized to the 30% and losing generality in the process. \u00c2\u00a0There&#8217;s no guarantees that&#8217;s what happened here, but it&#8217;s a reasonable possibility.<\/p>\n<p>The other blatant result is the upward slope of practically every entry. \u00c2\u00a0This would be expected if everyone overtrained in some way, but it could also be due to some data bias between the 30% and 70% that made the held-back data harder to predict. \u00c2\u00a0To get a bit more of an idea, we expand the list to include the top 500 teams on each side in the same manner:<\/p>\n<div id=\"attachment_361\" style=\"width: 442px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/happytechnologist.com\/wp-content\/uploads\/2013\/06\/hhp_slopegraph_top_500-720.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-361\" class=\" wp-image-361\" title=\"Top 500 Teams\" alt=\"hhp_slopegraph_top_500 (720)\" src=\"http:\/\/happytechnologist.com\/wp-content\/uploads\/2013\/06\/hhp_slopegraph_top_500-720.png\" width=\"432\" height=\"432\" \/><\/a><p id=\"caption-attachment-361\" class=\"wp-caption-text\">Top 500 Teams<\/p><\/div>\n<p>I could have fixed the upper and lower y-axis, but I chose not to. \u00c2\u00a0You still see that the upward trend holds pretty consistently. \u00c2\u00a0There appear to be two different typical slopes, one steeper than the other, but I haven&#8217;t gone to any lengths to prove that mathematically. \u00c2\u00a0Alas, I seem to be in the steeper slope, which means I lost more ground than those in the shallower slope, thus my 30+ position drop.<\/p>\n<p>One last interesting tidbit: \u00c2\u00a0while in the first graph it appears that the overall range of scores is reduced (the left side of the graph is much taller than the right side), that trend is far less pronounced in the second chart. \u00c2\u00a0This is a good example of the risk of limiting the analysis to only the top 50 scores, and how the obvious outliers (the top 3 public-scores) can alter a chart&#8217;s perception. \u00c2\u00a0Then again, some reduction was necessary. \u00c2\u00a0For completeness and comparison, here is the completely unfiltered chart:<\/p>\n<div id=\"attachment_362\" style=\"width: 442px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/happytechnologist.com\/wp-content\/uploads\/2013\/06\/hhp_slopegraph_full-720.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-362\" class=\" wp-image-362\" title=\"All Scores\" alt=\"hhp_slopegraph_full (720)\" src=\"http:\/\/happytechnologist.com\/wp-content\/uploads\/2013\/06\/hhp_slopegraph_full-720.png\" width=\"432\" height=\"432\" \/><\/a><p id=\"caption-attachment-362\" class=\"wp-caption-text\">All Scores<\/p><\/div>\n<p>Obviously this paints a very flat and oddly distributed picture. \u00c2\u00a0The range of outliers at the top (the very bad entries, probably made by teams just experimenting and making single near-random entries), causes the important detail near the peloton to squish down to be difficult to discern.<\/p>\n<p>So, a quick little chart analysis that led to some interesting insights. \u00c2\u00a0For those that are interested, here&#8217;s the R code for the top-50 chart (which is easy to modify to produce the others):<\/p>\n<pre class=\"brush: r; title: ; notranslate\" title=\"\">\n\nhhp &lt;- read.csv(&quot;HHP Final Standings Sparkline Source.csv&quot;)\n\nhhp$Team.Name. &lt;- gsub(&quot;&#x5B;^&#x5B;:alnum:]\/\/\/' ]&quot;, &quot;&quot;, hhp$Team.Name.)\nhhp$left &lt;- 1\nhhp$right &lt;- 2\nhhp$rank &lt;- rank(hhp$Final.Score)\nhhp$prerank &lt;- rank(hhp$Public.Score)\nxrange=c(.5,1.5)\nyrange=range(c(hhp&#x5B;hhp$rank&lt;=50|hhp$prerank&lt;=50,]$Public.Score,hhp&#x5B;hhp$rank&lt;=50|hhp$prerank&lt;=50,]$Final.Score))\n\npng(file=&quot;hhp_slopegraph_full.png&quot;,width=6,height=6,units=&quot;in&quot;,res=600)\n\npalette(grey(0:75\/75))\n\nwith(hhp&#x5B;hhp$rank&lt;=50|hhp$prerank&lt;=50,],\n{\n xrange=c(.75,2.25)\n yrange=range(c(Public.Score,Final.Score))\n plot( xrange, yrange,type=&quot;n&quot;, ylab=&quot;score&quot;, xaxt=&quot;n&quot;, xlab=&quot;&quot; )\n grid()\n axis(1,at=1:2,labels=c(&quot;public&quot;,&quot;final&quot;))\n segments(left, Public.Score, right, Final.Score,\n col=rank)\n})\ndev.off()\n\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m still working on my visualization-fu, so when the Heritage Health Prize finally got announced, the final scores provided a simple source of data that I wanted to investigate. I&#8217;ve\u00c2\u00a0written about the HHP\u00c2\u00a0before. \u00c2\u00a0After spending three years with&#8230; <a class=\"read-more\" href=\"https:\/\/www.chiplynch.com\/wordpress\/?p=442\">Read More<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[31,29,30,32],"tags":[],"class_list":["post-442","post","type-post","status-publish","format-standard","hentry","category-competitions","category-data","category-healthcare","category-r"],"_links":{"self":[{"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/442","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=442"}],"version-history":[{"count":0,"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/442\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=442"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=442"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.chiplynch.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=442"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}