Big Data: It’s not the size that matters

Many an article has been spent defining “Big Data”… everyone agrees that “Big Data” must be, well, large, and made up of data. There may be seemingly new ways of handling big data:tools such as Hadoop and R (my personal favorite) and concepts like No-SQL databases, and an explosion of data due to new collection tools: faster and more prolific sensors, higher quality video, and social websites. Large companies with the wherewithal to build petabyte and larger data centers are learning to collect and mine this data fairly effectively, and that’s all very exciting — there’s a wealth of knowledge to be gleaned from all this data. But what about the rest of us?

The thing is, it’s not really a matter of collecting and hoarding a large amount of data yourself. It’s how you use and take advantage of the data that you do have available that is at the core of these new trends. It used to be that you had to plan what sort of data you wanted, then go out and get it. You would collect requirements and build the database first, then populate it. Accounting, inventory, customer, and whatever databases you needed were tied to computer screens and bar code scanners that were particular to their purpose. The era of Data Warehousing and Enterprise Resource Planning pulled all these databases together, but they were the same old databases — more interconnected and better maintained perhaps, but still “purpose built.” Every field is carefully designed ahead of time with particular meaning and usage requirements.

The exciting promise of Big Data is the ability to unearth useful information from data from seemingly unrelated sources, and to answer questions that you didn’t ask when you gathered the information in the first place. In fact, much of the Big Data trend has to do with using data that you don’t own. One of the first things you should ask yourself, if you’re considering how to make use of these new trends, is how can you benefit from the data already available on Twitter, WeatherBug,, Google Maps, or any of hundreds of other publicly available data sets.

Now, I picked those examples for a reason — each of those is completely different, each has widespread (if not comprehensive) coverage of their area, and each has been used to answer questions completely unrelated to how their data was gathered. Social data can contribute all sorts of insight into opinions on products, companies, people, and places, and the basic data collection consists of a single 140 character text field with little context other than who wrote it. Have you researched your company, your products, or your competitors on social networks to understand what people are excited (or upset) about? Have you tracked your sales or operations based on weather patterns? Have you researched targeting advertisements based on crime rates, census information, or political districts?

Big data for its own purposes is just an extension of past data mining and database trends. The new trend is extending your data with other loosely related information to find new combinations of interesting facts. The promise of Big Data is about being more than the sum of its parts. There are stories about insurance companies basing rates at least partially on what browser you used to get an online quote. Researchers trying to improve Netflix suggestions mined data from IMDB to get more detailed information about the movies people were watching — and improved their predictive models by doing so. Stock market analysts have had some success predicting prices based on social media trends before news stories hit the mainstream media.

A few years ago, Wired Magazine posited the end of the scientific method, and big data was leading the charge. You may not have to formulate a hypothesis and build specific tests to validate it if there’s enough raw data, goes the concept. Science has always followed the “purpose built” approach, but now that’s changing. “Meta-studies” that take raw data from previous scientific studies have made new discoveries while skipping the often laborious and expensive testing step. As the saying goes: the world’s a much bigger lab.

But are these studies still valid? It’s a slippery slope, and one everyone should be careful with. With the access to all this data, it should be easy to find strong correlations — it may not be easy to figure out what to do with them. Target got in a bit of a public relations bind when it identified pregnant women based on their shopping habits. Many people blame Quants (a term used to describe financial data analysts) who used very modern big data style analytics for major stock market fluctuations and even for many of the mortgage issues in the market today. Meta studies are sometimes criticized for missing nuance or relating data sets that weren’t truly comparable.

The point is that the tools are available. They don’t require a break-the-bank investment, a private data center, or a five-year plan. An inquisitive mind, some readily available data-sets, and a few tools, and you can reap the risks and rewards of this new “Big Data” trend.

This entry was posted in Data, happytechnologist. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *