May 2011
Monthly Archive
Sun 29 May 2011
Ok, I’m still no expert data miner by a large margin, but I’ve learned a LOT in just a few weeks of playing with the Heritage Health Prize data. The folks on the Kaggle/HHP Chat Board are pretty helpful, and the internet is full of useful information. I’ve taken to using Excel and MYSQL far more than any mining-specific tools. I have been interested in R and RapidMiner, and I’ve been able to set up a few basic models with those tools. One thing I’ve been very happy with is the wealth of online tutorials available for just about everything. My resident 16 year old has been using them for a while to pick up piano and guitar songs, but I haven’t had much use until now; I’m pleased to report that the quality of these online free video or web tutorials is pretty high. I have a list started as a del.icio.us tag set if you want to see what I’ve been watching.
I’ve made 9 submissions (the first two or three of which I don’t count — let’s call those ‘test’ submissions). The 9th actually had a worse score than the 8th. Now that interests me. On my tests, which include several different sampling and “cross validation” methods on the two years of available data, my score on each submission improved from the last… not much in this last case, but enough for me to feel reasonable in submitting the algorithm. Why, then, did my result against the real data using the same algorithm go backwards? One possibility is that I’ve been overfitting the data. Basically, my algorithm makes assumptions that are either unnecessary or are only applicable to the sample data, and don’t hold true for the final data. At the tolerances we’re dealing with, it’s still possible that this is just a random selection bias issue, but it’s still interesting, and a common and very important problem in statistical data mining: how can you know when you’ve overfit? When do you know that you’re “trying too hard” as it is. 
(more…)
Thu 12 May 2011
Technology First is a local IT Trade Group, and their second annual “Technology Landscape Conference” was yesterday, so I dutifully (duty = I’m dating their intern) attended.
Ok, so there was some more duty… one of the companies presenting was ExpeData, a Dayton, Ohio (which is “local” for us folk) company who has a digital writing capture technology. We’ve been working with them for a few months to find some suitable applications and to discuss some security issues and requirements. It’s a fairly interesting technology, although I have some trouble finding its killer-app.
Another interesting company whose presentation I attended was Persistent Surveillance Systems — these guys have a 190+ MegaPixel camera array that they fly over the Cincinnati area (among others), taking pictures about once per second. When they hear about a crime, typically a murder, after the fact, they can go back and assign analysts to review the captured images to track people in the vicinity. Their software allows analysts to assign colored tracks and markers to people, vehicles, and anything else of interest — they initially track suspects, then go back and track anyone they interacted with, anyone nearby (possible witnesses/accomplices), and whatnot. The large pixel view of the city and long video times allow them to watch people drive all the way to their destination — a home, hideout, friends’ house, or whatever — where they can then work with police to get a warrant and follow up as appropriate. Their metadata is even good enough that they can apparently cross reference locations to find that, for example, the getaway driver from murder A may have lived next door to the suspect from murder B, which may help detectives tie together previously unrelated crimes.
(more…)
Mon 9 May 2011
So my goal for the weekend was to submit an entry to the Heritage Health Prize. It took me until Monday night (have to work off-hours, this isn’t a work-sponsored event), but our team (the Data Monkeys) (with Jeremi and Chris at this point) are now entered and somewhat amazingly NOT in last place! Yay!
But I’m ahead of myself… the Heritage Health Prize is a data competition run through Kaggle, who runs these sorts of things. It’s a $3-Million prize competition for a method of predicting what hospital patients will spend time in a hospital given their prior years’ medical history.
I’ve been wanting to enter something like this for a while. I don’t house any real hopes of winning (I have some fake hopes, of course); this sort of money attracts teams with far more depth of experience in data mining algorithms than I have — our team leans more towards data management, but not analytics. Still, this is an opportunity to head in that direction, so I’m going to take it.
(more…)
Wed 4 May 2011
We had quite the celebration down in Louisville this past weekend for the KY Derby Mini Marathon. Kory, Heather, Jessica, Emily, Mike, Jackie, Maureen, and I all got together at the start line to run the little guy (well, except Mike who actually showed up to run the FULL marathon, silly uncle). Kudos to Amy, Becca, and the 11,000 other people I don’t know who ran it with us. Angela took some photos, which you can get here. Some hilights, naturally, are here:

Kory with bouncing hair

What is that woman doing to Chip?

Joe doesn't recognize the photographer's mom
Kory had a fantastic time, especially for his first time out and having been out of training for a while. Joe beat me again by two minutes this year, but we both shaved 10 minutes or so off of our last year’s time.
For the first time, though, we had an out of town cheering section, so thanks to Melissa and Andrew who came all the way down from Ohio to see Heather, and who pretended it was at least a little bit to see me. :-) I know, I know, you’re all really just reading this for the star-trek pictures, right? You’re SURE you don’t want me to wax poetic on how we all got to share butt-space with the legendary Captain Kirk? No? Ok.
Click here:

Beam us somewhere!