2012-01-13

Analytics for the Masses – How to mainstream Big Data and educate much needed waive of new data scientists


Big Data is a hot topic, who knows, may be even hotter than the Cloud itself. But, there is a notion of voodoo magic there that prevents wider adaptation and limits potential flow of talent.

Now, many companies have done great strides towards increasing the accessibility of Bid Data. For example, the recent EMC Greenplum UAP announcement, demonstrating Chorus as a human interface to data (structured or not), was a perfect example of how to make the Big Data analytics more accessible. But, all these efforts are targeting a very small subset of the potential users. I believe we are now in a unique position to bring the social aspects of data analytics for even wider audience.

Challenge
When I talk to people about the value of Big Data, especially those outside of the large enterprise circles, I often am faced with the same two arguments:

  • Too complex entry point - General inability to evaluate Big Data application platforms without a substantial upfront investment in development resources and/or required infrastructure
  • Really big hammer, but, where are the nails? - Even those who do understand the value of Big Data analytics enough to invest, struggle with practical application and lack of general vision for valuable use-cases

Forester estimates that today businesses, on average, utilize less than 5 percent of available data. Why so little? It is not ignorance, resulting from lack of visibility of that data; the rest is simply too expensive/complex to analyze. Businesses will need to be educated in seeing value in that data, and, have affordable platforms to harness the other 95 percent.

Solution
With the growing popularity of Big Data appliance and the increasing availability of Hadoop and Pig implementations like the one released by LinkedIn representing their homegrown PageRank solution based on DataFu library and some Pig's User Defined Functions, we are now in the unique position to address both of these Big Data misperceptions.

Recently, the US government has begun to open source pieces of the Data.gov platform. The raw content aggregated from the lowest levels of state, county and municipal governmental agencies across the country represents a unique opportunity for Big Data evangelism, a chance to bring analytics for the masses.

So far, these PBs of good quality data have only been understood on a summary level. Robust, user-friendly, publicly accessible platforms enabling immediate unadulterated access to that data at detail would:

  • Demonstrate the power capabilities of the underling Hadoop-based platform
  • Provide a common test-bed for new products and features
  • Create breeding ground for next-generation of data scientists
  • Contribute to Big Data liberation story

The educational aspect of such platform can’t really be overstated. This is where Big Data exploration and socialization will occur -- by allowing data scientists to manipulate data, collaborate with colleagues, and readily share their findings. I do not believe there are analytical platforms out there today that meet these requirements. Could perhaps this platform serve as training ground for the Big Data practitioners of tomorrow?

Content
While the bulk of the initial data-set would have to be sourced from Data.gov and Census data, additional self-propagating sources could easily be preconfigured in the system to make the solution more appealing to data scientists day one:

  • User-defined custom sources (XML or JSON Web Service & static content like CSV)
  • Website tools such as WHOIS, bit.ly, and Compete
  • Services that use email addresses as search terms, including Github
  • Finding information from just a name, with APIs such as WhitePages
  • Services, such as Klout, for locating people with Facebook and Twitter accounts
  • Search APIs, including BOSS and Wikipedia
  • Geographical data sources, including SimpleGeo and U.S. Census
  • Company information APIs, such as CrunchBase and ZoomInfo
  • APIs that list IP addresses, such as MaxMind

These data sets, when combined with features like Real-Time Collaboration, Data Search & Exploration and On-Demand Visualization, would make this platform an immediate hit and assure ongoing value of this solution over time.

Conclusion
This social analytics platform will not only make the Big Data more accessible to the masses, but also educate data scientists and other experts about the value of Big Data solutions. By creating a publicly accessible environment where the general public can analyze the large and current data-sets, we will further the adaptation of Big Data concepts and gain access to large numbers of data scientists who are currently outside of the reach of the still evolving market.

The pitch is simple: "Go ahead, scrape, parse, analyze and integrate open and proprietary data. We’ve got huge collections of scientific, social and geo data ready for you to harness. We are excited to see the new insights you will develop."

2011-11-08

Big Data more than just size, true value in business application and real-time analytics


When I talk to people about Big Data, especially those not exposed to the constant onslaught of NoSQL hype, I often get asked the same question: "Is it really that much better than database?" Assuming this a comparison question to the traditional RDBMS, I want to put down some thoughts on the subject.

The major reason why Big Data, and by extension NoSQL products like Hadoop, are so popular right now, is because there simply is no other option to deal with large, unstructured, data sets. Now, here is where many fail to grasp the significance -- it is the size (petabytes) and the unstructured format that is the core problem. Sure, given enough time and money RDBMS could maybe handle it, but, it would be expensive, time consuming and not agile enough to adopt when the business questions change.

That's why a whole new breed of products are emerging that build on top of Hadoop-like solutions that address the increasing proliferation of data. There are some like Cloudera that manage to deliver to market pretty impressive offerings. The business opportunities however reside more in visualization of the on-going stream of data. It is the decision-making impact that is missing from the Big Data offerings of today. Businesses today must gain new efficiencies in the midst of a very competitive market; the need to drive knowledge from their data to make smarter business decisions.

Those successful in capitalizing on this opportunity deliver applications that use big data technologies as a means to improving business processes, not using "Big Data" as a core value proposition of their solution.

Mobile space is a section of the big data market that many believe to develop faster than others. It will certainly require innovation because of the sheer number of devices and the bi-directional content they generate/require. There are more phones than people in the United States! Now add to this the exploding tablet....

A couple of net-new, real-world examples of companies that do understand the difference in the market today: 33Across in advertising and ipTrust in application security. Both build on Big Data as a tool to address real business problems.

On a bigger scale, very soon Big Data will stop being "big" and become a fundamental part of doing business. Accelerating that transition will allow companies to gain advantage. I am talking about solutions that will be able to absorb the ever-increasing stream of data and delivering a real-time analytics to process this information, and thus, enable instantaneous decision-making and smarter way of doing business.

These ideas are not too far in the future. Various big-data hardware optimization techniques and advances in scalability will make true, real-time analytics real.

2011-11-01

Corporate IT Oblivious To New Role, But, Their Renaissance is Coming

Enterprise IT organizations of today are largely unaware of the different ways their space is being impacted. IT personnel is increasingly unprepared to handle their shifting roles, and as a result, will become increasingly less relevant to the overall value of their companies.

Gross overgeneralization, yes, but so true. This post is inspired entirely by my recent work experiences, so, let me vent out a little here while I try to avoid being specific as to not get myself into trouble.

IT managers are latching onto the promise of virtualization as purely a means of lowering the cost of doing business as usual. At the same time, Gartner estimates that while business units strive for more agile processes and technology delivery solutions, IT will have increasingly less effective control of their IT spending which is already down to 25%. As IT organizations have larger amounts of their resources locked into assuring the "reliability of mature technologies", as much as 70% of their overall capacity, they are simply unable to address the new technical challenges facing their companies.

Now, I do realize there are many organizational and market challenges that IT faces every day. However, I am convinced that it is their unwillingness to challenge status quo that prevents them from becoming the responsive thought-leaders of enterprise that they once were.

Peter Sondergaard, senior vice president at Gartner and global head of research recently said: "Mature technologies are code for obsolete. You must dare to employ creative destruction to eliminate legacy, and selectively destroy low impact systems." Now, with predictions like "Mac have no place in the enterprise" and "Private Cloud is the last resource," I wouldn’t placed too much value on Gartner’s advice, but, Mr. Sondergaard is onto something here. 

It is the "reliability of mature technologies" that suck the life out of IT on daily basis. I recently spoke with an IT manager whose team was almost entirely consumed by the maintenance of two systems: Exchange and SharePoint, while at the same time playing an endless game of cat-and-mouse preventing users’ access to SaaS solutions like Springpad or document collaboration sites like Box.net. Here is the kicker -- their project for next year is to virtualize these "mature" applications! I can think of no better illustration than the Titanic's orchestra making sure their music sounds just right while sinking to their demise. 

Much has been written about the Consumerization of Enterprise and the Post-document Era. These events are breaking-up the internal IT monopoly and shifting their role from provider to broker. This can be illustrated by the increased use of single purpose applications which are simple to use, easy to develop, and really good at one thing. Successful enterprise IT will develop their own Cloud strategy and evaluate their every initiative against it, instead of being endlessly driven by their products or technology. Watch, the renaissance of IT is coming.

2011-10-27

Federated not Balkanized - The Future of Data and Its Current Cloud Challenges

As a long-term Cloud storage user I recently wanted to re-evaluate my options. New content management providers became available and I wanted to make sure I wasn’t missing on the new shinny tech out there.

As I was considering the pros and cons of each option, I realized the apparent shift in my personal attitude towards cloud data storage over last few years. My concerns used to be solely with security. Now, while the data security is still critical, I am much more interested in data access, ownership, integration and its control.

Many people talk about how the recent consumerization of the enterprise, where the lines between our personal and work data are being increasingly blurred. But nothing brought it home for me as much as what I saw during the recent VMworld in Las Vegas where Steve Herrod, CTO from VMware, was talking about the new content storage solution code-name Project Octopus. It provides Dropbox-like experience to corporate users while preserving the IT control over the company content. Before he gave a demo, Steve has asked how many of the 20,000+ attendants currently use a consumer cloud storage solution like Dropbox at work. About a half of the audience raised their hands. These are some of the most network-savvy, security-conscious users of the Cloud industry!

Should we be surprised? How many of us currently use our personal devices at work? More importantly, how many businesses are actually OK with that? So, how did we get here? More importantly, how must we deal with this exponential growth of data while preserving the necessary level of control?

For starters, we need to realize that with the proliferation of SaaS-based solutions, we are giving up more and more control over our own data. When was the last time you read the agreements for which we so nonchalantly check the “I Agree” box when signing up for new Web-based app?

But SaaS is not the problem here, as we are moving to the post-document era and increasingly larger amounts of our content is managed in the public cloud, we do not necessary need to give up control. Rather, we need to start thinking about a more federated storage model. Now, I know the concept of “storage federation” get some people really excited, but, what I am talking about here is a model focused not on private vs. public storage but a fabric that is intimately aware of the data content, its origin, as well as its access and retention policies in context of user’s current identity across all providers.

Tim Berners-Lee, at his recent keynote address at RSA Europe, talked about the demand for control of storage. Not as something we will need in the future, but rather, as a clear and present danger of impacting the cloud adaptation and balkanization of our data resulting in our inability to leverage its real value.

Analyst firm IDC now claim the growing volumes of cloud storage providers will lead to combined storage spending of $22.6 billion by 2015. We must figure out a scalable and secure way of controlling all that storage really soon, otherwise the promise of cloud value will be overshadowed by increased loss of control and eventually lesser security.

2011-10-21

The Shifting Sands of IT Infrastructure Management

Ever since his keynote at VMworld in Vegas, I'd been following Steve Herrod's posts on the subject of IT management, specifically in the area of Infrastructure. The need for higher levels of automation is obvious, however, the need for embedding management in the infrastructure in order to achieve that level of automation represents a huge shift in my opinion from how we are currently viewing cloud infrastructure.

Take away for me from Steve's latest post on the subject:

  • Automate everything in sight as this is the only way to achieve the efficiency and economics of cloud computing
  • Change IT mentality from provider to service broker
  • Get up to speed on vCenter and vFabric Application Management suites

2011-10-20

Joe Tucci, CEO, EMC Talks Information Processing in Seattle

Joe Tucci spoke in Seattle last night. I did not go, but, thanks to a pretty good TV feed (45 min long) from the Distinguished Lecturer Series at University of Washington Huskies I was able to take some notes on his view on information processing over the next decade and where EMC will focus:


  • Recent IDC report puts info growth at 44x (0.9 zettabytes to 35.2) & 90% unstructured
  • Most companies spend 3/4 on maintaining existing infrastructure (will be true for 10 years)
  • 3D movie is about a petabyte with all camera angles and footage included
  • The average company is attacked 300 times per week
  • IT staffing will increase less than 50% in next 10 years but the data under management will grow much faster.
  • EMCs Mission: To lead customers towards a hybrid cloud
  • x86 based private clouds and hybrid clouds
  • EMC still owns 80% of VMWare (was that in question?)
  • There are now more than virtual machines shipped than physical machines
  • Killer app: Real time data analytics
  • EMC 5 year M&A plan: roughly 50% of investments in R&D (10.5B) and 50% in M&A (14.0B)
  • EMC has 14,000 sales people
  • EMC is now 152 in fortune 500
    • Revenue is $17B
    • Free cash flow: $3.4b

EMC Modular Greenplum Appliance Supports Multiple Data Types

EMC acquired Greenplum in 2010, and it wasted no time in developing an EMC Greenplum Data Computing Appliance (DCA) combining EMC storage hardware and replication and recovery options with Greenplum's massively parallel processing (MPP) database. EMC's Data Computing Division is expanding on Greenplum's deep support for in-database analytics with partners including SAS and MapR.

EMC introduced its own distribution of Hadoop software in May, and a Modular DCA set for release this fall promises to support the Greenplum SQL/relational database as well as Hadoop deployments on the same appliance. With Hadoop, EMC addresses analysis of truly big data like clickstreams and unstructured data such as social-network comments. The Modular DCA will also support high-capacity storage modules on the same appliance for long-term retention of records to meet regulatory mandates.

Sources:
EMC Tailors Storage Systems For Big Data
Big Data: Informatica Tackles The High-Velocity Problem
Big Data A Big Backup Challenge
12 Top Big Data Analytics Players