Big Data is a hot topic, who knows, may be even hotter than the Cloud itself. But, there is a notion of voodoo magic there that prevents wider adaptation and limits potential flow of talent.
Now, many companies have done great strides towards increasing the accessibility of Bid Data. For example, the recent EMC Greenplum UAP announcement, demonstrating Chorus as a human interface to data (structured or not), was a perfect example of how to make the Big Data analytics more accessible. But, all these efforts are targeting a very small subset of the potential users. I believe we are now in a unique position to bring the social aspects of data analytics for even wider audience.
Challenge
When I talk to people about the value of Big Data, especially those outside of the large enterprise circles, I often am faced with the same two arguments:
- Too complex entry point - General inability to evaluate Big Data application platforms without a substantial upfront investment in development resources and/or required infrastructure
- Really big hammer, but, where are the nails? - Even those who do understand the value of Big Data analytics enough to invest, struggle with practical application and lack of general vision for valuable use-cases
Forester estimates that today businesses, on average, utilize less than 5 percent of available data. Why so little? It is not ignorance, resulting from lack of visibility of that data; the rest is simply too expensive/complex to analyze. Businesses will need to be educated in seeing value in that data, and, have affordable platforms to harness the other 95 percent.
Solution
With the growing popularity of Big Data appliance and the increasing availability of Hadoop and Pig implementations like the one released by LinkedIn representing their homegrown PageRank solution based on DataFu library and some Pig's User Defined Functions, we are now in the unique position to address both of these Big Data misperceptions.
Recently, the US government has begun to open source pieces of the Data.gov platform. The raw content aggregated from the lowest levels of state, county and municipal governmental agencies across the country represents a unique opportunity for Big Data evangelism, a chance to bring analytics for the masses.
So far, these PBs of good quality data have only been understood on a summary level. Robust, user-friendly, publicly accessible platforms enabling immediate unadulterated access to that data at detail would:
- Demonstrate the power capabilities of the underling Hadoop-based platform
- Provide a common test-bed for new products and features
- Create breeding ground for next-generation of data scientists
- Contribute to Big Data liberation story
The educational aspect of such platform can’t really be overstated. This is where Big Data exploration and socialization will occur -- by allowing data scientists to manipulate data, collaborate with colleagues, and readily share their findings. I do not believe there are analytical platforms out there today that meet these requirements. Could perhaps this platform serve as training ground for the Big Data practitioners of tomorrow?
Content
While the bulk of the initial data-set would have to be sourced from Data.gov and Census data, additional self-propagating sources could easily be preconfigured in the system to make the solution more appealing to data scientists day one:
- User-defined custom sources (XML or JSON Web Service & static content like CSV)
- Website tools such as WHOIS, bit.ly, and Compete
- Services that use email addresses as search terms, including Github
- Finding information from just a name, with APIs such as WhitePages
- Services, such as Klout, for locating people with Facebook and Twitter accounts
- Search APIs, including BOSS and Wikipedia
- Geographical data sources, including SimpleGeo and U.S. Census
- Company information APIs, such as CrunchBase and ZoomInfo
- APIs that list IP addresses, such as MaxMind
These data sets, when combined with features like Real-Time Collaboration, Data Search & Exploration and On-Demand Visualization, would make this platform an immediate hit and assure ongoing value of this solution over time.
Conclusion
This social analytics platform will not only make the Big Data more accessible to the masses, but also educate data scientists and other experts about the value of Big Data solutions. By creating a publicly accessible environment where the general public can analyze the large and current data-sets, we will further the adaptation of Big Data concepts and gain access to large numbers of data scientists who are currently outside of the reach of the still evolving market.
The pitch is simple: "Go ahead, scrape, parse, analyze and integrate open and proprietary data. We’ve got huge collections of scientific, social and geo data ready for you to harness. We are excited to see the new insights you will develop."








