One benefit of global warming (if there is such a thing), is that it does away with yet another argument against moving to Portland from the Bay Area. Take for example today, late October, and the weather is as if someone turned the awesome up all the way to eleven. So, let’s recap: way better coffee and beer, no sales tax, superior quality of life, yep, nowhere near as much traffic, and now you can take that “six months of rain” argument off your list. I know what you are thinking, how about the jobs? As it happens, we are actively looking for Cloud and Data Architects at Intel to work on something…well, awesome. Are you looking for the next challenge and a more livable place? Ping me on LinkedIn.

Over eight months ago, I joined Intel to work on their next-generation data analytics platform. In large, my decision was based on Intel’s desire to address the “voodoo magic” in Big Data: the complexities that require deep technical skills which are preventing domain experts from gaining access to large volumes of data. The idea was that by leveraging the distributed data processing capabilities of Apache Hadoop, and combining them with Intel’s breadth of infrastructure experience, we could make Big Data analytics more accessible and therefore more prevalent.

Last week, Intel demonstrated just how serious it is about this vision by announcing a strategic partnership with Cloudera, the largest distributor of Hadoop.

Much has been already written about this partnership. To me, this single largest data center technology investment demonstrates the level of Intel’s commitment to deliver on the promise of open, performance optimized platform for big data analytics. As part of this deal, Cloudera will optimize its software to take greater advantage of the features found in Intel processors, which already power the majority of data centers.

One of the fastest growing areas and biggest opportunity for Hadoop optimization is Internet of Things (IoT). Whether it is edge signal aggregation, stream processing, or scalable storage later, the use of Hadoop in IoT currently demands a substantial layer of specialized code. The problem is that this software layer is too complex to develop. Intel’s collaboration with Cloudera will greatly simplify analysis of machine-generated data and become an intrinsic part of the next-generation IoT analytics platform.

The wider Hadoop ecosystem will benefit too.

The leaders of both companies are already talking about a two-year optimization roadmap and their commitment to release these enhancements upstream into the open source community.

By making this platform generally available, Intel will assure that in near future, you will be able to build innovative data solutions that are less expensive and easier to implement, while still realizing its rapid performance improvements in Hadoop.

Note: these opinions are my own and do not represent my employer

Update 2013-12-23 - Looks like the lack of programmatic access to the billing API has now been solved. Google just GA’d Billing Data API.

As part of my recent solution review, I wanted to compare a few performance metrics specific to multi-node data service deployment on different clouds. This post is about my experience with Google Compute Engine (GCE) as part of that evaluation.

API & Tools

When targeting developers, API and surrounding tooling is an absolute must. The ability to easily manage and automate cloud resources is something that developers demand. Their usage patterns require efficiency, which at that level, comes mainly from automation.

Here are three specific areas that set GCE apart from others. Remember, it is not that other providers do not have these (which in many cases they do not) but rather about how clean, explicit and simple GCE implementation is in these areas.

REST Interface

One of the benefits of REST as a cloud management interface is its consistent approach to provisioning and management of resources. To manage GCE, clients send authenticated requests to perform a particular action: provision network, create instances, associate disks, etc.

One of the nice GCE touches in assisting programmatic implementation is that the GUI interface to Cloud Console exposes REST and command line equivalent for each operation. This allows developers to simply copy the defined operation and use it in their automation tools to remove guesswork from the initial message format creation.

Command-line tool

gcutil is a command-line tool designed to help with management of GCE resources. Written in Python, gcutil runs natively on any UNIX-based OS or under Cygwin on Windows. The important thing to realize here is that while gcutil is a command-line tool; it still uses the same REST interface to message its commands to GCE.

One of the things that I often long for in cloud management APIs is support for multiple personas. With gcutil it is as simple as providing an existent credential file (—credentials_file). This way separating accounts is just a runtime flag away.

What makes gcutil really user-friendly for developers however is its ability to set default values for common operations. By caching values of common commands (—cache_flag_values), gcutil can reuse arguments like —zone or —machine_type across multiple commands.

Perhaps the part that makes gcutil most unique though is its ability to perform each command in either synchronous or asynchronous mode. By default, gcutil waits for each command to complete before moving returning control. In asynchronous mode, however, gcutil returns request id immediately after posting the request. This was a massive feature for me when testing number of cluster node discovery strategies.

These features combined with the ability to customize result format per each command: JSON, CSV, Table as well as the ability to return only the name of the newly created resources which allows for piping results from one command as on input to another, make gcutil one of the best though-through IaaS clients I’d ever seen.

Speed & Flexibility

In my short experience, I found instances and disk (yes, not “volumes”) provisioning as well as general instance startup on GCE to be fast. My specific interest was the time that it took to spin, configuring and terminating entire clusters of data nodes. In that specific use-case, CGE was faster than EC2, Azure or Rackspace.

The project metaphor, while somewhat awkward for me initially, quickly became for me a clear separation for distinct areas of work. Additionally, its integration with the advanced routing features allowed me to easily create gateway servers (aka VPN) to span clusters across local and GCE network.

For me personally, perhaps the biggest feature was the metadata support. In addition to the basic key value pair tags, every GCE instance also supports metadata. In addition to including information defined by the service itself like instance host name, it can also include user-defined data.

gcutil addinstance node-${NODE_COUNT}" \
    --metadata="cluster:${CLUSTER_NAME}" \
    --metadata="index:${NODE_COUNT}"
    ...

Instance configuration, as well as the configuration of other instances in the same project, is available in a form of a REST query against the provisioned metadata server. This metadata can also include project-level metadata.

The place where this capability really came handy for me was node-level metadata. By simply defining metadata value for a node index, I was able to have individual data nodes define their own unique cluster names (—metadata=name:node-0) as well as query the project-level data for cluster name.

Custom metadata becomes especially useful when using startup scripts to execute during instance boot. Using gcutil, I was able to pass in a single local startup scripts using the —metadata_from_file flag and have it discover its variables from metadata parameters.

NODE_INDEX=$(curl http://metadata/computeMetadata/v1beta1/instance/attributes/index)

Pricing

In my particular test cycles, I must have deployed close to 1000 individual instances across EC2 and GCE. Each one of these instances stayed up for maximum 15-20 minutes, just enough to run a set of tests on the new cluster. The part that makes GCE a lot more compelling for these kinds of use-cases is the granular pricing. Google prices its instances in one-minute increments with 10-minute minimum; not hourly, like EC2.

One area where GCE is perhaps not as flexible as I would like is in the area of billing. I do like the flexibility to charge individual projects to different credit cards, but would like to see a consolidated billing option there too. Also, this is the one area that is not supported by the API!

GCE seems like a fundamentally different type of IaaS, designed specifically with developers in mind. While probably not much of a challenge to EC2 anytime soon, over time though, provided they augment their service offering list, GCE’s focus on developers will pay off. Having experienced their tools first hand, it is clear these guys know how to run infrastructure at a massive scale without alienating developers.

I am excited to share with you today that starting Monday I will be joining the Big Data team at Intel. Yes, Intel. While not traditionally known for its software offerings, Intel has recently entered the Big Data space by introducing their own, 100% open source Hadoop distribution with unique security and monitoring features.

As illustrated by their Github repository, Intel has already done a lot of work with Apache Hadoop. The particular repos worth mentioning are their work on HBase’s security in Project Rhino as well as their work on advanced SQL support in Hive and performance enhancements for HBase in Project Panthera. In addition to these projects, Intel has also established Benchmarking Suite and Performance Analyzer projects which aim to standardize measurements around real-life Hadoop workloads.

As a solution architect, I will work on a team dedicated to designing the next-generation data analytics platform by leveraging the full breadth of Intel’s infrastructure experience with compute (Xeon processor), storage (Intel SSD), and networking (Intel 10GbE), as well as its cryptographic and compression technologies (AES-NI, SSE).

So why Intel?

If you have read my blog over the last couple of years, you will know how passionate I am about data. I believe Apache Hadoop specifically represents a very unique opportunity for the enterprise as it challenges many preconceived notions about analytics, both from the scale as well as cost perspective.

My vision for Hadoop is for it to become a truly viable data platform: open, easy and performant.

A platform upon which the community can innovate and solve real complex problems. I joined Intel because it provides me the means to execute on this very vision. After all, Intel’s Hadoop distribution is the only open source platform backed by a Fortune 100 company.

About a year and a half ago, I wrote about Big Data Opportunities, focusing primarily on Leveraging Unused Data, Driving New Value From Summary Data and Harnessing Heterogeneous Data Sensors (more recently known as Internet of Things).

Since that post, data space has exploded with numerous solutions addressing many of these areas. These solutions while mostly based on batch operations and limited to serial MapReduce jobs against frequently off-line, inadequately secured, Hadoop cluster, they do allow access to previously inaccessible data.

While attending recently Hadoop Summit I got a glimpse of the upcoming trends and I decided to outline three new areas of opportunities in data management space:

Complex Event Processing During Ingestion

As we are often reminded, Hadoop, in its original architecture, was not built for low-latency access. Over the last year, a few point-solutions came to market to address this very issue, but now, with the introduction of YARN as part of the upcoming Hadoop 2.0 release, I think we finally have a platform to build an open support for processing ingestion listeners.

I envision these implemented as a set of native Hadoop components capable of eventing on ingested data without impacting the throughout (inability to throttle the data) or the format in which the data is persisted in HDFS (proprietary formats).

So, why is all this dynamic capability important? Many companies already decided to use HDFS as a global data repository. But, the luck of low-latency access prevents any kind of dynamic analytic on that data. I suspect in the near future, projects like Tez and Spar, besides enabling real-time validation, Id generation and cleansing, will greatly expedite the time it takes to derive actionable insight from ingested data. Perhaps someone finally comes up with a solution for “continuous query” on top of HDFS (yes, little GemFire nostalgia here from my VMware days).

On the other hand, if Hortonworks, as part of its Stinger initiative, does deliver on its challenge to make Hive 100x faster, low-latency access to newly ingested data in HDFS through SQL interface may be sufficient.

Upgrade Compliance Approach

In many cases, the traditional compliance rules have not caught up with the new data management patterns of Hadoop. Most of the clusters I have ever seen not only don’t have comprehensive security layer, they even lack a good data access audit. Some initiatives like Apache Accumulo are introducing cell-level access management extension on top of Hadoop, but as more and more data is loaded into HDFS, there is an increasing opportunity for comprehensive monitoring solution, especially as it relates to large distributed enterprise data pipelines.

This is not only a technology problem.

Many of the traditional data compliance approaches don’t fit well in the new data access paradigms.

Especially when you consider the the traditional regulations on data in rest and their implementation on distributed data grids with data partitioning and query sharing during access.

Heterogeneous Data Management

Traditional Relational Database Management Systems (RDBMS) relied on the SQL interface to extend its functionality and manage it state (e.g. grant privileges on object to user;) . However with the growing number of hybrid data systems there is now a need to read, write and update data in multiple systems. For example, document stored in Cassandra are often indexed in ElasticSearch; data persisted in HDFS is cataloged in HCatalog and cached in Memcached.

The goal here is to Instead of directly programming applications to these underlying technologies, there needs to be a more transparent and centralized “abstraction layer” that could simplify this effort for developers, and enable easier upgrade/replacement path for infrastructure architects (API).

There you have it, my three areas of opportunity for innovative data solutions. If last year taught me anything though, by the time I write next year’s version of this post, it will likely include technologies and challenges we have not even heard about today.

Last week I attended the Hadoop Summit. This two-day conference was packed full with deep sessions on many of the current and upcoming Big Data technologies. One of these technologies, YARN (Yet Another Resource Negotiator), simply stole the show. While still only in preview, YARN is quickly becoming the hottest feature of the upcoming Hadoop 2.0 release. (For good overview of Apache Hadoop YARN and the paradigm shift it represents read this)

The most exciting thing about the upcoming 2.0 release is the way in which the Hadoop community responded to the enterprise adoption challenges.

Hadoop traditionally is considered for its batch processing capabilities. While multiple vendors keep coming up with new point-solutions that bolt SQL support on top of Hadoop, the community, with perhaps more clarity of vision that I ever witnessed in an open source project, pressed on believing that SQL-support must be an integrated inside of Hadoop. The benefits of this approach are clearly visible in YARN. Not only does it improves Hadoop’s resource management capabilities (greater scalability and better performance), but it also finally enables users to build new robust applications on top of that platform.

These applications are no longer bound by MapReduce capabilities. They can still leverage SQL as an online interface to HDFS-stored data through Hive. But, when more fitting, these applications can leverage other data processing technologies like:

  • Tez for complex acyclic-graph of interactive tasks
  • Storm or S4 for data stream management
  • Spark for in-memory data processing

With YARN, Hadoop has finally became a highly customizable enterprise data platform capable of supporting multiple processing tasks across both low latency and high throughput workloads. Even more importantly, Hadoop community has proven its ability to address hard adoption challenges inside of the platform, in structured, pragmatic manner.

It is that proof point that will now make Hadoop a key component of enterprise data architecture.

Earlier this week I published a post on the importance of “native” data services. These services deliver predictable, low-latency connectivity to your data regardless of the underlying infrastructure in which your application is deployed. My colleagues challenged me after publishing this post however that while assuring locality of data is certainly important, and should be aspired for, the reality of the shifting landscape in today’s enterprise makes this a utopian notion.

As one who strives for pragmatism, I aim for a less unicorn-like approach to data provisioning. So, I admit, unless your organization starts with green-field applications, for the foreseeable you will need data flow automation.

This is partially due to the increasing distribution of data. Today, if you go to any of the Fortune 5000 companies, you’ll find multiple types of data engines. Obviously there are traditional RDBMS like Oracle or SQL Server, but your will also find key/value stores like Redis or Riak, and, almost certainly, one of those cool new document stores like MongoDB or Couchbase. And that’s not even considering the specialized solutions for graph, analytics or caching. All these engines store business critical data. They manage it across internal repositories in a wide variety of product-specific, sometimes proprietary, formats.

Besides the fact that this approach leads to multiple copies of the same data without any well-defined source of “truth”, driving any kind of value from such dispersed stores can be technically challenging. Those who figure how to do it without the need for constant data copying across multiple storage architectures stand to gain market leadership and certainly reap financial benefits. Need proof? Consider AWS. Their new Data Pipeline Service, announced at the recent re:invent conference, scales from a simple piece of business logic against a small dataset, all the way to sophisticated batches executing against Elastic MapReduce services, RDS or even S3.

But, perhaps a better approach would be avoiding moving data all together and integrating multiple sources like Hadoop’s Distributed File System (HDFS) with some kind or relational database and overlaying it with caching layer to enable federated query across those repos. Such approach would allow developers to leverage these sources through a variety of batch processes as well as highly optimized, low-latency transactional workloads enabled through in-memory data layer. (See my recent post using HDFS for this very reason here)

So, in response to my colleagues, I still think data locality will play a massive role in the long-term adoption of PaaS.

But, whether you think this transition is immanent or more gradual, if your organization does not have a scalable data storage strategy today that is capable of at least co-locating all of your data while in rest, you risk finding yourself in a midst of sprawling ETLs, endlessly chasing information across multiple storage platforms while being unable to drive even a fraction of the value of your own data.

In general, Platform as a Service (PaaS) is developed by developers for developers. Of course they’re going to love it.

It enables them to focus on the nuances of their applications – not on the day-to-day pointless activities that so often take their time away from solving real problems.

The non-developers point to the abstraction of underlining infrastructure and dynamic resource allocation as some of the core benefits of PaaS. In short, we often view PaaS as a runtime execution engine that trivialize the complex aspects of application development and deployment.

The problem with that kind of view however is that it focus primarily on the run-time aspects of the platform. This may be a result of some vendors treating data services as an external concern, strapped onto the platform as an add-on, almost as an afterthought. Heroku, for example, provides only Postgres as their one “native” data service, while OpenShift does slightly better, adds MySQL and a community supported edition of MongoDB.

Everyone would agree that add-ons play an important role in the extendibility of any PaaS solution. I would argue, however, that as the “open” and “polyglot” aspects of PaaS become the de facto standard, a more holistic view of the entire application platform, including a diverse selection of native data services is quickly becoming a major differentiator.

Today, for example, you would not choose PaaS without its support for most common development frameworks, or its ability to run unmodified in public cloud and in private data centers.

The very same way, you should not choose a PaaS solution without an integrated, native and diversified data service support. As many of you know, I work for VMware, which initiated open source PaaS solution called Cloud Foundry.

Right now, Cloud Foundry delivers the richest selection of native data services on the market, including MySQL, PostgreSQL, MongoDB, RabitMQ and a couple different versions of Redis.

These services deliver predictable, low-latency connectivity to your data whether your application is deployed to the public instance of Cloud Foundry operated by VMware, AWS instance operated by one of our ecosystem partners like AppFog, or to a private instance running out of your own data center. Whichever Cloud Foundry instance your application targets, that data service provisioned by Cloud Foundry will behave exactly the same.

However, it would be naïve to expect all necessary data services to always be available natively. Just for these kinds of situations, Cloud Foundry provides an open source Service Broker (yes, service extending a service), which delivers the very same provisioning characteristics to external or legacy services, which are currently not offered by Cloud Foundry. The best part is that these services can be managed through the same API and benefit from the very same native integration into your application.

In short, if application mobility is important to you, please view data services as an intrinsic part of your PaaS strategy.

Add-ons are great and certainly appropriate in many cases; just make sure they don’t become your gateway drug locking your application to specific provider.

The “high-priests” of Big Data have spoken. Hadoop Distributed File System (HDFS) is now the de facto standard platform for data storage. You may have heard this “heresy” uttered before. But, for me, it wasn’t until the recent Strata conference that I began to really understand how prevalent this opinion actually is.

Perhaps even more important, how big of an impact this approach to data storage is beginning to have on the architecture of our systems.

Since the Strata conference, I’ve tried to reconcile this new role of HDFS with yet another major shift in system architecture: the increasing distinction between where data sleeps (as in where it is stored) and where data lives (as in where it is being used). Let me explain how one relates to the other, and why I actually now believe that HDFS is becoming the new, de facto standard for storing data.

HDFS Overview

HDFS is a fault-tolerant, distributed file system written entirely in Java. The core benefit of HDFS is in its ability to store large files across multiple machines; in distributed computing commonly referred to as “nodes”.

Because HDFS is designed for deployment on low-cost commodity hardware, it depends on software-based data partitioning to achieve its reliability. Traditional file systems would require the use of RAID to accomplish this same level of data durability, but, in HDFS’s case, it is done without dependency on the underlining hardware. HDFS divides large files into smaller individual blocks and distributes these blocks across multiple nodes.

It is important to note that HDFS is not a general-purpose file system. It does not provide fast individual record lookups, and, its file access speeds are pretty slow. However, despite these shortcomings, the appeal of HDFS as a free, reliable, centralized data repository capable of expanding with organizational needs is growing.

Benefitting from the growing popularity of Hadoop, where HDFS is used as the underlining data storage, HDFS is increasingly viewed as the answer to the prevalent need for data collocation. Many feel that centralized data enables organizations to derive the maximum value from individual data sets. Because of these characteristics, organizations are increasingly willing to ignore the performance shortcoming of HDFS as a “database” and use it purely as a data repository.

Before you discredit this approach, please consider the ongoing changes that are taking place in on-line application architectures. Specifically the shift away from direct queries to the database and increasing reliance on law latency and high-speed data grids that are distributed, highly optimized, and most likely host the data in memory.

Shift in Data Access Patterns

Increasingly, the place where data is stored (database) is not the place where the application data is managed. The illustration that perhaps most accurately reflects this shift is comparing data storage to the place where data sleeps and data application to the place where data lives.

Building on this analogy, the place where data is stored does not need to be fast; it does however need to be reliable (fault-tolerant) and scalable (if I need more storage I just add more nodes).

This shift away from monolithic data stores is already visible in many of today’s Cloud-scale application architectures. Putting aside the IO limitations and the obsessive focus on atomicity, consistency, isolation, durability (ACID) of traditional databases, which leads to resource contention and subsequent locks. Simply maintaining speed of query execution as the data grows in these type of databases is physically impossible.

By contrast, new applications architected against in-memory data grids benefit from already “buffered” data, execute queries in parallel, and are able to asynchronously persist modifications to storage, so that these operations do not negatively impact their performance. This approach results in greater scalability of the overall application and delivers raw speed in order of magnitude compared to disk-based, traditional databases.

It is important to realize that these in-memory data grids are not dependent on the persistence mechanism and can leverage traditional databases as well as next-generation data storage platforms like HDFS.

New Data Storage Architecture

As in-memory data grids become the backbone of next-generation on-line applications, their dependency on any specific data storage technology becomes less relevant. Overall, organizations want durable, scalable and low-cost data storage, and HDFS is increasingly becoming their consolidated data storage platform of choice.

As you can imagine, this is not an all-or-nothing situation. Whatever the specific workload is – write-intensive or demanding low-latency – HDFS can support these requirements with a variety of solutions. For example, an in-memory grid can be used for sub-second analytical processes of terabytes of data while persisting data to HDFS as a traditional data warehouse for back-office analytics.

Considering the relatively short life span of HDFS, its ecosystem often displays maturity. Solutions like Cloudera’s open source Impala can now run on the raw HDFS storage and expose it to on-line workloads through a familiar SQL interface without the overhead of MapReduce (as it is implemented by Hive).

The Kiji Project is another example of an open source framework building on top of HDFS to enable real-time data storage and service layer for applications. Impala and Kiji are just a few frameworks of what is likely to become a vibrant ecosystem.

Many organizations have already started to leverage HDFS’s capabilities for various, non Hadoop-related applications. At Strata, I attended a session HDFS Integration presented by Todd Lipcon from Cloudera and Sanjay Radia from Hortonworks. It was a great overview of the vibrant technological integrations of HDFS with tools like Sqoop, Flume, FUSE or WebHDFS…just to name a few.

HDFS has also a large set of native integration libraries in Java, C++, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk and many more. Additionally, HDFS has a powerful command-line and Web interface as well as Apache HBase project, which when necessary, can run on top of HDFS and enable fast record-level access for large data sets.

Once the data is centrally located, there is a well-documented concept of Data Gravity originally created by Dave McCrory, which among many other things has the effect of attracting new applications and potentially resulting in further increase of the data quality and overall value to an organization.

I am not saying that all future data processing frameworks should be limited to HDFS. But, considering its prevalence in the Big Data market, low-cost, and scalability, and when combined with the vibrant ecosystem of libraries and project, it may be wise for organizations to start consider HDFS as their future-proof data storage platform.

We are in a midst of drastic shift in application development landscape. Developers entering the market today use different tools and follow different patterns.

One of the core patterns of on-line application development today is cloud scale design. While traditionally architectures would rely on more powerful servers, today, that approach simply does not scale. We have reached that point where, in many cases, there are no powerful enough servers, or their cost would be prohibitive. Considering the unpredictable usage patterns today’s on-line applications also must be flexible to address demand spikes and assure efficient service during low utilization.

Increasingly, organizations are comfortable scaling their capability that way on the Web and App Server layers. However, as the number of application instances increases to accommodate the growing demand, many times their data layer is simply unable to keep up.

The overall performance of any solution is only as good as its lowest common denominator. Increasingly, that lowest common denominator of today’s on-line applications is the database.

The time for one-size-fits-all database has expired. When faced with today’s performance requirements, we must consider the most appropriate data solutions for each specific use case. To assure the necessary scale out architecture, we must choose database that is actually optimized for the type of workload your application requires.

Now, some believe that these data characteristics are unique to the dotcom space. I would argue however that, right now, enterprise itself is in a midst of data renaissance. Due to the rapid proliferation of new data sources, enterprise applications require longer storage, faster delivery, and higher availability than ever before. The lines between public and enterprise application architectures are blurring.

Any organization developing applications today should have a well-thought through approach for managing data in three specific categories:

  • Big data sets of infrequently changing content persisted in many kinds of formats
  • Volatile data to support fast delivery in on-line transactional applications
  • Flexible data platform capable of adapting to ever-changing development requirements

So, the next time someone recommends that “uber” database that does everything, don’t buy it. There simply is no solution on the market today that addresses all of these needs well.