This entry was originally posted on the CloudSource blog on November 2011

Now that cloud is slowly evolving beyond the “peak of inflated expectations” on the Gartner hype cycle, it looks like a new hype is coming quickly. It’s called “big data.” Let’s first define the term, so we’re sure what we are talking about. According to Wikipedia, big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set. Examples include web logs; RFID; sensor networks; social networks; social data (due to the Social data revolution), Internet text and documents; Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and/or interdisciplinary scientific research; military surveillance; medical records; photography archives; video archives; and large-scale e-commerce.

Managing the data

Now, that’s clear. Actually, the world’s ‘digital universe’ is in the process of adding 1.8 Zettabytes in 2011 with continuing exponential growth – projecting 8 zettabytes in 2015 and 35 Zettabytes in 2020. 70 percent of that data is generated by individuals and 85 percent consists in unstructured data. We call that human information. Every second 97,000 tweets are added, every minute 12 million texts and every day 294 million emails.

Did you know that today, a single commercial flight across the U.S. generates 240 terabytes (TB) of wireless sensor data? The key issue is no longer capturing the data, but actually storing it. In a currently ongoing project, we gather 1.35 TB in a 15 minutes experiment using 1 million wireless sensors. With a stable 56MB wireless link, we need around 42 hours to gather and store the data, so new data transfer mechanisms and approaches need to be invented.

Whether it is for disaster recovery, for back-up or to perform compute intensive analytics or calculations, the time and cost of storing the data in the cloud is often forgotten. So, the first question to be raised is whether the data needs to be in the cloud in the first place, or whether a hybrid approach, integrating public cloud with enterprise IT resources (being it private cloud or legacy), should not be taken. Let’s look at an example where we combine social networking data, already in the cloud, with enterprise information.

Understanding your customer behavior

If you want to know what the world thinks about your product, your brand, your services, you better take a look at tweets, blog entries and forums. In the past, people moaned about how bad a service was at the local bar, today they do it on Twitter. You can no longer ignore that fact if you want to stay competitive.

A senior business and technology executive survey we commissioned showed us that enterprises typically only leverage 5 percent of the available information, that 48 percent do not have an effective information strategy in place and that only 2 percent can deliver the right information at the right time to support enterprise outcomes 100 percent of the time.

The social media data I talked about is located in the cloud, by definition. But ideally, companies want to cross-correlate this data with their own customer information. Actually HPLabs did just that with their “project fusion.” They learned to predict customer behavior by merging social media and company data. Obviously many larger companies are not interested in migrating all their customer data to the cloud so it’s key to be able to integrate data from multiple sources into such common analysis.

HP’s Approach

HP is conscious of the importance of providing the ability to search not just through structured data, but also to take advantage of being able to scan through the huge amount of non-structured data. By combining Vertica’s Analytic Platform focused on the analysis of structured data, with Autonomy’s Meaning Based Computing approach, HP is now offering you an environment through which you can really understand what’s happening. The combination of multiple information sources allows you to keep the data where it is while taking full advantage of the information embedded in it. This is what we call the Human Information Era.

So, big data may still be hype, but the data is there and enterprises need to take that into account. Tools exist today, as we demonstrated with project fusion and there is more to come. You really want to look at this because it may give you an unfair advantage in doing business. And if you don’t do it, your competitor might. That would be a real pity, wouldn’t it?