ED: I spoke to a reporter yesterday for a half hour or so, discussing the final stretch of the Heritage Health Prize data mining competition I’ve been a competitor in for the past couple of years. Her article came out today and is posted here: 3-Million-Health-Puzzler-Draws-to-a-Close. I’m quoted as saying only: “They set the bar too high”. I probably said that; I said a lot of things, and I don’t want to accuse Cheryl of misquoting me (she was quite nice, and her article is helpful, well written, and correct), but I feel like a lot of context was missed on my comment, so I’m just going to write an article of my own that helps explain my perspective… I’ve been meaning to blog more anyway. 🙂
On April 4th 2011, a relatively unknown company called “Kaggle” opened a competition with a $3 Million bounty to the public. The competition was called the “Heritage Health Prize”, and it was designed to help healthcare providers determine which patients would benefit most from preventive care, hopefully saving the patients from a visit to the hospital, and saving money at the same time. And not just a little money either … the $3 Million in prize money pales in comparison to the billions of dollars that could be saved by improving preventive care. The Trust for America’s Health estimates that spending $10 in preventive care per person could save $16 billion per year, which is still just the tip of the iceberg for soaring health care prices in the United States.
The setup of the competition was fairly simple. Competitors were given two basic sets of data. First was a collection of two years’ worth of health care claims information and some details of the members who were getting the care called the “Training Set”. The data was cleaned pretty well so that it was all but impossible to tie a real person back to the data, to avoid any privacy concerns; every patient was identified by an ID that was randomly assigned. Still, there was a lot of information — each visit indicated a diagnostic code and a procedure in very broad strokes. Cancer, Renal failure, and Pregnancy were indicated with dozens of other codes based on groupings of industry standard ICD-9 codes. How the patient was treated, including with Radiology, Surgery, Anesthesia, or just Evaluation (observation) were among the care categories. Halfway through the competition, very basic information about Laboratory and Drug use was added to the data available. In all there were millions of pieces of information — a solid amount of data was provided that could be analyzed any way the contestants saw fit.
A second set of data — a Validation Set — was provided that included a third years’ worth of data, but the two sets differed in one way. The first two years included the number of days a patient spent in the hospital on the following year. The third set did not. The goal of the competition was to predict that missing number given the rest of the third years’ claims data. For example, if older people with Congestive Heart Failure tended to spend more time hospitalized the following year in the Training Set than other patients, then the predictions for those people would be higher in the Validation Set.
The people running the competition knew the real values of the hidden data, and it was kept secret (apparently quite well) for two years. To score the competition, each teams’ predictions is compared against the secret values. A simple formula that compared each of the 70,000+ patients real values to the competitors predicted values produced an overall score for each team’s submission. Teams could submit one solution per day, and the scores that were made public were only based on 30% of the submitted data. These steps were designed to keep people from learning too much about the secret set and mitigate a technique called “ladder climbing” which would be, if not cheating, at least against the spirit of the competition.
Only when prizes were awarded — there were three milestone prizes every six months or so, plus the final prize due out next week — were the scores published against the full 100% of the data. The results were interesting… some algorithms fared better than others against “unseen” data, so Kaggle’s choices were borne out to be good ones.
The catch, though (and isn’t there always a catch), was that in order to win the $3,000,000 prize, the winning team had to perform VERY well against the full 100% dataset. In specific terms, their score had to be a 0.40 or lower. Lower is better, since the score is basically a measure of error… how far off their guesses were from the real values. At this point the team in the lead has a 0.435, and the 10th place team is above 0.45. While these may sound like small numbers that are all very close, it took months and months for the first team to break 0.46, and far longer still for the first team to break 0.45. Many times during the competition the forums would have conversations about how likely it was that the 0.40 barrier would be broken, and most people were of the opinion that it would not be. Now, at the end, unless there is some miracle of statistics, it’s all but impossible for that to happen.
Still, with the scores as they are the winning team will receive $500,000… only a sixth of the potential $3 Million, but nothing to sneeze at.
The question of whether the bar was set too high — whether the $3 Million was a scam and the deck stacked firmly against ever paying it — is an interesting question. There are some things to indicate that it was… the 0.40 was set after competitors had already spent some time working, possibly to ensure that the bar would at least be difficult to reach. Also, when the Drug and Labwork datasets were provided, they were far less interesting than participants had hoped they would be. If they had been more detailed — indicating WHICH labs were performed or which drugs were taken or what the results of labwork were — it’s doubtless that better predictions would have been made.
But these estimates were all made on known approaches to the problem. Standard data mining methods were the first attempted, and undoubtedly Kaggle had some people designing models before the competition began to work out some bugs and validate the thresholds. It would have taken some very clever and in some ways monumental advances in the state of data science to reach the 0.40 target. In fact, Kaggle and the data mining community as a whole had some background in these areas, and certainly came armed to get the most for their money.
One precursor to the HHP that deserves comparison is the Netflix Prize. Netflix, the company that provides DVD rental by mail and streaming movies over the internet, held a competition in a format that was later thoroughly mimicked by Kaggle for the Heritage Health Prize. Except, rather than health care, the dataset was video rentals and movie ratings. Netflix had a vested interest: make better recommendations and better select the movies available and users will come back for more, meaning more money in Netflix’ pocket and happier users. The Netflix prize progressed basically the same way. Teams signed up, downloaded the data, developed models to predict movie ratings by users, and submitted the results for scoring. The scoring algorithm was similar; a single value that indicated the total deviation between the real (secret) movie ratings and the competitors predictions.
The Netflix prize also wasn’t going to pay their full purse unless the results were worthy; in that case, the $1-million prize required a 10% improvement over their existing algorithms.
Depending on who you ask, there were several important lessons learned from the Netflix competition. To me, two of the most important seemed to be:
- Ensembling (or blending) — the combining of multiple unrelated predictive models into a single predictive model to rule them all, and
- Enhancement — the addition of external data to augment the predictive capabilities of the model (or models)
- But Most of all: Openness
To understand the value of Openness, where teams are in cooperation and competition (coop-etition?) at the same time, just google “simonfunk“. According to some accounts, one of the turning points in the Netflix competition was the publication of the method that one of the participants (who went by the name “simonfunk”) was using, for all his competitors to see. It helped that he had done a great job on his own, but providing everyone with detailed instructions to recreate his score is credited with kicking off a push that ultimately led a team to surpass Netflix’s 10% target. In fact, without this kick, and without some last minute teaming of different competing groups (thus the Ensembling bullet above), the 10% goal probably would not have been reached. Netflix seems almost prescient in having set it, as it was perilously close to the end of the competition when 10% was finally surpassed. Even in the Heritage Health Prize there was much discussion of techniques on the forums and one competitor (danb) posted a good deal of knowledge on his way out. Other competitors published blog posts and sample code for no competitive advantage, merely for love of the game.
It’s not that these factors, or any of the Netflix techniques really, were completely original, but Data Science is just now coming of age and having high profile widely reported events like these reach the mainstream media really helped open up discussion on these approaches. It did precisely what its instigators, and Kaggle’s business model, intends to do: bring more minds to bear on a problem than are available in a normal corporate environment. The internet has even evolved a name for this sort of thing, it’s called “Crowdsourcing”. Using large, largely nameless groups of people to solve real problems en masse more quickly and efficiently than would otherwise be possible.
But as I said, Kaggle was aware of all of these techniques. Even outside the Netflix prize, Kaggle had run some of its own competitions, among the most notable being one backed by Deloitte Consulting (for whom I was an employee, although I was completely unattached to the Kaggle relationship) that tried to improve on the rating system used by Chess players and their governing bodies.
We’ll have to wait for the leaders to publish their results to see if they did anything differently, but as it stands now, with knowledge of the processes from the milestone winners (who had to publish their results, thus re-leveling the playing field in theory), there weren’t any great advances during the Heritage Health Prize.
In the context, then, of spurring innovation, and achieving the goals of the Heritage Health Prize, the 0.40 benchmark was probably very well chosen, although perhaps not quite as precisely as Netflix’s 10%. In order to win the huge sum of money someone would have to make a huge contribution to the field. The size of the gold ring was consummate with the difficulty it took to achieve. If it had been easier to achieve, would it have been less motivating? In the end, of course, someone saved $2.5 million and still got good media attention and hopefully some of the best data science targeted at the health care prediction problem. On the other hand it will cost them nearly $1 Million after the $500,000 prize, all the milestone prizes, and the cost of running the competition. In the $50-billion per year (or more, or less… estimates run wildly) in hospitalization care, something that could save this much money should be worth it either way. It’s almost certain that Netflix got more than it’s $1-Million worth out of the improvement in movie predictions. The HHP was a gamble by both sides… a large sum of money on the table and countless thousands of hours spent by data mining experts on a high profile, high value problem.
In the end, I found the quest fascinating. I’ve learned a huge amount about data mining in the process, and I even began working with data for a health-care provider as a client due in no small part to wanting to get more exposure to the data I learned about in this competition. I owe the topic a dozen more blog posts, and I may do that, or expand this one. For now, I’m happy the competition is over… it’s been a long walk. As it stands my team is in 84th place. Out of 2000 teams including far more experienced and competitive groups, some of which doubtless had more time to spend on the problem, I’m super happy with the results.