2013-05-31

My experience with Google Compute Engine

As part of my recent solution review, I wanted to compare a few performance metrics specific to multi-node data service deployment on different clouds. This post is about my experience with Google Compute Engine (GCE) as part of that evaluation.

API & Tools

When targeting developers, API and surrounding tooling is an absolute must. The ability to easily manage and automate cloud resources is something that developers demand. Their usage patterns require efficiency, which at that level, comes mainly from automation.

Here are three specific areas that set GCE apart from others. Remember, it is not that other providers do not have these (which in many cases they do not) but rather about how clean, explicit and simple GCE implementation is in these areas.

REST Interface

One of the benefits of REST as a cloud management interface is its consistent approach to provisioning and management of resources. To manage GCE, clients send authenticated requests to perform a particular action: provision network, create instances, associate disks, etc.
One of the nice GCE touches in assisting programmatic implementation is that the GUI interface to Cloud Console exposes REST and command line equivalent for each operation. This allows developers to simply copy the defined operation and use it in their automation tools to remove guesswork from the initial message format creation.

Command-line tool

gcutil is a command-line tool designed to help with management of GCE resources. Written in Python, gcutil runs natively on any UNIX-based OS or under Cygwin on Windows. The important thing to realize here is that while gcutil is a command-line tool; it still uses the same REST interface to message its commands to GCE.

One of the things that I often long for in cloud management APIs is support for multiple personas. With gcutil it is as simple as providing an existent credential file (--credentials_file). This way separating accounts is just a runtime flag away.

What makes gcutil really user-friendly for developers however is its ability to set default values for common operations. By caching values of common commands (--cache_flag_values), gcutil can reuse arguments like --zone or --machine_type across multiple commands.

Perhaps the part that makes gcutil most unique though is its ability to perform each command in either synchronous or asynchronous mode. By default, gcutil waits for each command to complete before moving returning control. In asynchronous mode, however, gcutil returns request id immediately after posting the request. This was a massive feature for me when testing number of cluster node discovery strategies.

These features combined with the ability to customize result format per each command: JSON, CSV, Table as well as the ability to return only the name of the newly created resources which allows for piping results from one command as on input to another, make gcutil one of the best though-through IaaS clients I'd ever seen.

Speed & Flexibility

In my short experience, I found instances and disk (yes, not "volumes") provisioning as well as general instance startup on GCE to be fast. My specific interest was the time that it took to spin, configuring and terminating entire clusters of data nodes. In that specific use-case, CGE was faster than EC2, Azure or Rackspace.

The project metaphor, while somewhat awkward for me initially, quickly became for me a clear separation for distinct areas of work. Additionally, its integration with the advanced routing features allowed me to easily create gateway servers (aka VPN) to span clusters across local and GCE network.

For me personally, perhaps the biggest feature was the metadata support. In addition to the basic key value pair tags, every GCE instance also supports metadata. In addition to including information defined by the service itself like instance host name, it can also include user-defined data.
gcutil addinstance node-${NODE_COUNT}" \
      --metadata="cluster:${CLUSTER_NAME}" \
      --metadata="index:${NODE_COUNT}"
      …
Instance configuration, as well as the configuration of other instances in the same project, is available in a form of a REST query against the provisioned metadata server. This metadata can also include project-level metadata.

The place where this capability really came handy for me was node-level metadata. By simply defining metadata value for a node index, I was able to have individual data nodes define their own unique cluster names (--metadata=name:node-0) as well as query the project-level data for cluster name.

Custom metadata becomes especially useful when using startup scripts to execute during instance boot. Using gcutil, I was able to pass in a single local startup scripts using the --metadata_from_file flag and have it discover its variables from metadata parameters.
NODE_INDEX=$(curl http://metadata/computeMetadata/v1beta1/instance/attributes/index)

Pricing

In my particular test cycles, I must have deployed close to 1000 individual instances across EC2 and GCE. Each one of these instances stayed up for maximum 15-20 minutes, just enough to run a set of tests on the new cluster. The part that makes GCE a lot more compelling for these kinds of use-cases is the granular pricing. Google prices its instances in one-minute increments with 10-minute minimum; not hourly, like EC2.

One area where GCE is perhaps not as flexible as I would like is in the area of billing. I do like the flexibility to charge individual projects to different credit cards, but would like to see a consolidated billing option there too. Also, this is the one area that is not supported by the API!

GCE seems like a fundamentally different type of IaaS, designed specifically with developers in mind. While probably not much of a challenge to EC2 anytime soon, over time though, provided they augment their service offering list, GCE’s focus on developers will pay off. Having experienced their tools first hand, it is clear these guys know how to run infrastructure at a massive scale without alienating developers.


2013-03-10

The Three Stages of Cloud Cost Management

No. We said three “stages”.
From a cost perspective, cloud services are divided into two distinct types – services with a monthly recurring fee like GitHub, and services based entirely on usage like AWS.
The need to manage the cost of the former is minimal. In most cases, change will only occur when you change your plan.
But, if you are leveraging the true elasticity of the cloud on any significant scale, you are most likely faced with challenges of cost management associated with the unpredictable usage patterns of your application.
In this post, we’ll explore the three basic stages of cost management for usage-based services: passive monitoringefficiency optimization and operational decision-making...
You can read the rest of this post on Cloudability Blog, where it was originally posted on February 14, 2013.

2013-01-18

My new role as the VP of Engineering at Cloudability


I am excited to finally share with you about taking on a new role as the VP of Engineering at Cloudability.

Cloudability helps companies manage their cloud spending by analyzing costs and monitoring usage patterns across a broad variety of cloud providers and services; ranging from infrastructure providers like Amazon Web Service, Microsoft Azure and Rackspace to cloud services like Dropbox, Salesforce and GitHub.

Cloudability is based in Portland, Oregon, and is a graduate of the Portland Incubator Experiment. GigaOM has recently named Cloudability as one of the most promising cloud companies.

In my new role, I will be managing the rapidly growing engineering team and working on new innovative offerings. I look forward to once more being engaged with a local team and enjoying the vibrant startup scene of downtown Portland.

2013-01-11

Too many DBs. Where is the consolidation when you need one?

Just yesterday I read about PouchDB, which is related to both TouchDB and CouchDB. Today I see this subway map from 451 Research outlining the increasing complexities of the database landscape with all its types.


They even clarify that this was "designed to help businesses identify a shortlist of potentially suitable choices". Really?

I get the value of multiple data options, but, this? This is getting silly. The funny thing is that last year I read Gartner's paper about the consolidation in this space, especially in amongst the In-Memory Data Grids and No/New-Sql solutions. Well, can we please start that consolidation now? As one who tries to at least keep up to date on all these products, and, whenever possible go deep on at least few leading solutions in each one of these categories, I sure am now ready for the StopBuildingAnotherDB... DB.


2012-12-13

Data locality the future but opportunity in data flow automation still massive

Earlier this week I published a post on the importance of “native” data services. These services deliver predictable, low-latency connectivity to your data regardless of the underlining infrastructure in which your application is deployed. My colleagues challenged me after publishing this post however that while assuring locality of data is certainly important, and should be aspired for, the reality of the shifting landscape in today’s enterprise makes this a utopian notion. 

As one who strives for pragmatism, I aim for a less unicorn-like approach to data provisioning. So, I admit, unless your organization starts with green-field applications, for the foreseeable you will need data flow automation. 

This is partially due to the increasing distribution of data. Today, if you go to any of the Fortune 5000 companies, you'll find multiple types of data engines. Obviously there are traditional RDBMS like Oracle or SQL Server, but your will also find key/value stores like Redis or Riak, and, almost certainly, one of those cool new document stores like MongoDB or Couchbase. And that’s not even considering the specialized solutions for graph, analytics or caching. All these engines store business critical data. They manage it across internal repositories in a wide variety of product-specific, sometimes proprietary, formats.

Besides the fact that this approach leads to multiple copies of the same data without any well-defined source of “truth”, driving any kind of value from such dispersed stores can be technically challenging. Those who figure how to do it without the need for constant data copying across multiple storage architectures stand to gain market leadership and certainly reap financial benefits. Need proof? Consider AWS. Their new Data Pipeline Service, announced at the recent re:invent conference, scales from a simple piece of business logic against a small dataset, all the way to sophisticated batches executing against Elastic MapReduce services, RDS or even S3.

But, perhaps a better approach would be avoiding moving data all together and integrating multiple sources like Hadoop’s Distributed File System (HDFS) with some kind or relational database and overlaying it with caching layer to enable federated query across those repos. Such approach would allow developers to leverage these sources through a variety of batch processes as well as highly optimized, low-latency transactional workloads enabled through in-memory data layer. (See my recent post using HDFS for this very reason here)

So, in response to my colleagues, I still think data locality will play a massive role in the long-term adoption of PaaS. But, whether you think this transition is immanent or more gradual, if your organization does not have a scalable data storage strategy today that is capable of at least co-locating all of your data while in rest, you risk finding yourself in a midst of sprawling ETLs, endlessly chasing information across multiple storage platforms while being unable to drive even a fraction of the value of your own data.

2012-12-10

PaaS not just about runtime, data services are the next differentiator


In general, Platform as a Service (PaaS) is developed by developers for developers. Of course they’re going to love it. It enables them to focus on the nuances of their applications – not on the day-to-day pointless activities that so often take their time away from solving real problems. The non-developers point to the abstraction of underlining infrastructure and dynamic resource allocation as some of the core benefits of PaaS. In short, we often view PaaS as a runtime execution engine that trivialize the complex aspects of application development and deployment.

The problem with that kind of view however is that it focus primarily on the run-time aspects of the platform. This may be a result of some vendors treating data services as an external concern, strapped onto the platform as an add-on, almost as an afterthought. Heroku, for example, provides only Postgres as their one “native” data service, while OpenShift does slightly better, adds MySQL and a community supported edition of MongoDB.

Everyone would agree that add-ons play an important role in the extendibility of any PaaS solution. I would argue, however, that as the “open” and “polyglot” aspects of PaaS become the de facto standard, a more holistic view of the entire application platform, including a diverse selection of native data services is quickly becoming a major differentiator.

Today, for example, you would not choose PaaS without its support for most common development frameworks, or its ability to run unmodified in public cloud and in private data centers. The very same way, you should not choose a PaaS solution without an integrated, native and diversified data service support.

As many of you know, I work for VMware, which initiated open source PaaS solution called Cloud Foundry. Right now, Cloud Foundry delivers the richest selection of native data services on the market, including MySQL, PostgreSQL, MongoDB, RabitMQ and a couple different versions of Redis. These services deliver predictable, low-latency connectivity to your data whether your application is deployed to the public instance of Cloud Foundry operated by VMware, AWS instance operated by one of our ecosystem partners like AppFog, or to a private instance running out of your own data center. Whichever Cloud Foundry instance your application targets, that data service provisioned by Cloud Foundry will behave exactly the same.

However, it would be naïve to expect all necessary data services to always be available natively. Just for these kinds of situations, Cloud Foundry provides an open source Service Broker (yes, service extending a service), which delivers the very same provisioning characteristics to external or legacy services, which are currently not offered by Cloud Foundry. The best part is that these services can be managed through the same API and benefit from the very same native integration into your application.

In short, if application mobility is important to you, please view data services as an intrinsic part of your PaaS strategy. Add-ons are great and certainly appropriate in many cases; just make sure they don’t become your gateway drug locking your application to specific provider.

2012-11-15

HDFS has won, now de facto standard for centralized data storage

The “high-priests” of Big Data have spoken. Hadoop Distributed File System (HDFS) is now the de facto standard platform for data storage. You may have heard this “heresy” uttered before. But, for me, it wasn’t until the recent Strata conference that I began to really understand how prevalent this opinion actually is. Perhaps even more important, how big of an impact this approach to data storage is beginning to have on the architecture of our systems.

Since the Strata conference, I’ve tried to reconcile this new role of HDFS with yet another major shift in system architecture: the increasing distinction between where data sleeps (as in where it is stored) and where data lives (as in where it is being used). Let me explain how one relates to the other, and why I actually now believe that HDFS is becoming the new, de facto standard for storing data.

HDFS Overview

HDFS is a fault-tolerant, distributed file system written entirely in Java. The core benefit of HDFS is in its ability to store large files across multiple machines; in distributed computing commonly referred to as “nodes”.

Because HDFS is designed for deployment on low-cost commodity hardware, it depends on software-based data partitioning to achieve its reliability. Traditional file systems would require the use of RAID to accomplish this same level of data durability, but, in HDFS’s case, it is done without dependency on the underlining hardware. HDFS divides large files into smaller individual blocks and distributes these blocks across multiple nodes.


It is important to note that HDFS is not a general-purpose file system. It does not provide fast individual record lookups, and, its file access speeds are pretty slow. However, despite these shortcomings, the appeal of HDFS as a free, reliable, centralized data repository capable of expanding with organizational needs is growing.

Benefitting from the growing popularity of Hadoop, where HDFS is used as the underlining data storage, HDFS is increasingly viewed as the answer to the prevalent need for data collocation. Many feel that centralized data enables organizations to derive the maximum value from individual data sets. Because of these characteristics, organizations are increasingly willing to ignore the performance shortcoming of HDFS as a “database” and use it purely as a data repository.

Before you discredit this approach, please consider the ongoing changes that are taking place in on-line application architectures. Specifically the shift away from direct queries to the database and increasing reliance on law latency and high-speed data grids that are distributed, highly optimized, and most likely host the data in memory.

Shift in Data Access Patterns

Increasingly, the place where data is stored (database) is not the place where the application data is managed. The illustration that perhaps most accurately reflects this shift is comparing data storage to the place where data sleeps and data application to the place where data lives.

Source: VMware, Inc.
Building on this analogy, the place where data is stored does not need to be fast; it does however need to be reliable (fault-tolerant) and scalable (if I need more storage I just add more nodes).

This shift away from monolithic data stores is already visible in many of today’s Cloud-scale application architectures. Putting aside the IO limitations and the obsessive focus on atomicity, consistency, isolation, durability (ACID) of traditional databases, which leads to resource contention and subsequent locks. Simply maintaining speed of query execution as the data grows in these type of databases is physically impossible.

By contrast, new applications architected against in-memory data grids benefit from already “buffered” data, execute queries in parallel, and are able to asynchronously persist modifications to storage, so that these operations do not negatively impact their performance. This approach results in greater scalability of the overall application and delivers raw speed in order of magnitude compared to disk-based, traditional databases.

It is important to realize that these in-memory data grids are not dependent on the persistence mechanism and can leverage traditional databases as well as next-generation data storage platforms like HDFS.

New Data Storage Architecture

As in-memory data grids become the backbone of next-generation on-line applications, their dependency on any specific data storage technology becomes less relevant. Overall, organizations want durable, scalable and low-cost data storage, and HDFS is increasingly becoming their consolidated data storage platform of choice.

As you can imagine, this is not an all-or-nothing situation. Whatever the specific workload is – write-intensive or demanding low-latency – HDFS can support these requirements with a variety of solutions. For example, an in-memory grid can be used for sub-second analytical processes of terabytes of data while persisting data to HDFS as a traditional data warehouse for back-office analytics.

Considering the relatively short life span of HDFS, its ecosystem often displays maturity. Solutions like Cloudera’s open source Impala can now run on the raw HDFS storage and expose it to on-line workloads through a familiar SQL interface without the overhead of MapReduce (as it is implemented by Hive).

The Kiji Project is another example of an open source framework building on top of HDFS to enable real-time data storage and service layer for applications. Impala and Kiji are just a few frameworks of what is likely to become a vibrant ecosystem.

Many organizations have already started to leverage HDFS’s capabilities for various, non Hadoop-related applications. At Strata, I attended a session HDFS Integration presented by Todd Lipcon from Cloudera and Sanjay Radia from Hortonworks. It was a great overview of the vibrant technological integrations of HDFS with tools like Sqoop, Flume, FUSE or WebHDFS…just to name a few.

HDFS has also a large set of native integration libraries in Java, C++, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk and many more. Additionally, HDFS has a powerful command-line and Web interface as well as Apache HBase project, which when necessary, can run on top of HDFS and enable fast record-level access for large data sets.

Source: datagravity.org

Once the data is centrally located, there is a well-documented concept of Data Gravity originally created by Dave McCrory, which among many other things has the effect of attracting new applications and potentially resulting in further increase of the data quality and overall value to an organization.

I am not saying that all future data processing frameworks should be limited to HDFS. But, considering its prevalence in the Big Data market, low-cost, and scalability, and when combined with the vibrant ecosystem of libraries and project, it may be wise for organizations to start consider HDFS as their future-proof data storage platform.



2012-11-07

Data-related investments shift from technology to individual skills – talent the new differentiator


Over the last decade, the access to best-of-bread data technologies has become easier. This is due mainly to the increasing popularity of open source software (OSS). While this phenomenon holds true in other areas like operating systems, application servers, development frameworks or even monitoring tools, it is perhaps most prevalent in the area of data.

Not many people will argue with the fundamental role OSS plays in Big Data or NoSQL. These categories were virtually built on OSS. Just consider Hadoop, MongoDB and Redis as a small sample. What surprised me though is how these solutions have infiltrated the enterprise market and caused a drastic shift in the way organizations are now spending their money on data-related projects.
Funds allocation for data related projects
This shift away from technology and investment into skills is primarily due to two overarching technology trends:

  • Adoption of Cloud services as the predominate means of delivering applications; not software
  • Prevalent lack of support for scale-out architecture in traditional data solutions and its dependency on proprietary hardware

Just consider this; today, whether I work for a small start-up or a large conglomerate, I have access to the same data technology that is used by the most popular companies in the world: Twitter, Facebook, Instagram just to name a few. This new technologies are have been designed from ground up to meet the scale of on-line applications and are also available as open source. However, the issue is that many of these data solutions meet one and only one specific issue. It is the skills, expertise and endless hours of experts that make these data products truly valuable to organizations. Unfortunately, in the hands of many untrained professionals, these powerful tools, while free, are useless!

There are already some high-profile examples of this shift to OSS-based data solutions in the enterprise. Disney, which recently announced their Data Management Platform, has built it on a “start-up-like” budget. This complex system uses Hadoop and NoSQL databases, coupled with some innovative API architecture that shields developers from product complexities and enables standard approaches to data management, regardless of its format.

Sears also announced last week that it was going “all-in on Hadoop”. They found their traditional (proprietary) data solution not flexible enough, and decided that to remain relevant, this old-school company had to adopt to new (Big Data) technologies. Sears is in fact so committed to this new direction, that they are actually planning on re-selling their new Big Data solution as a service.

These two Fortune 100 companies are but a precursor to what we will see over the next five years as a result of this growing trend in today’s enterprise. While everyone may have access to free analytical solutions like Hadoop, open source key-value like Redis or document-based database like MongoDB, the skill and expertise needed to glue these products into differentiated solutions is hard to come by.

This is where I can see another shift that is currently impacting the popularity of OSS in the enterprise: organizations are increasingly willing to work with multiple, best-of-breed, vendors.
Willingness to work with multiple best-of-breed vendors
The traditional desire to have a single partner responsible for all issues appears to have been overwritten by competitive advantages and the fear of vendor lock-in. The money saved by making these OSS choices gives now organizations the luxury of outsourcing to a variety of technology experts.

For example, when Disney needs support with a Hadoop cluster, they call Cloudera, When they have questions about Solr or Cassandra implementation they bring in DataStax. The interesting part of this trend is that it even applies to propitiatory products that leverage OSS as in case of Sears leveraging Datameer’s expertise in analytical tools.

I’m sure there are many examples of where this kind of best-of-breed approach to data delivered less than optimal results. But, to remain competitive enterprise has to make systems that scale and are able to quickly adopt to the ever-changing application demands. Right now, these systems seem to be built on top of data platforms based on open source. And, since the access to these technologies has been commoditized, it is the skill of personnel that’s becoming the true differentiator.

Unfortunately, in many cases, companies shy away from training their people in these solutions, fearing they will leave for greener pastures. That’s however a very shortsighted perspective. They really should fear what will happen if they don’t train them and they stay!

2012-10-25

Strata/Hadoop World 2012 - ETL, SQL, other immediate thoughts on Big Data


I don't want to bore you with Strata coverage; endless blogs already have done that. In stead, here are few immediate thoughts I put together as I wait for my flight back home. I’d love to hear your perspective on these, your personal observations or thoughts on Strata and space of Big Data in general.

ETL’s no longer cool, specifically the “Extraction” and “Load” parts. It’s all about “Transformation” in real-time now. The notion of multiple copies of data, especially when schema-less data discovery is necessary, is no longer workable. This is primarily due to volumes, but also, increasingly to the data velocity. Many vendors started addressing this challenge with in-memory solutions for instant transformation of raw data in HDFS into flexible/interactive sets. Perhaps the best example of this is Platfora (more about these guys below), although Cloudera, PayPal and MapR have presented solutions using similar approach.

SQL is cool once more. The excitement I sensed form people when talking about Cloudera’sImpala is perhaps the best example of this point. In contrast to Hive, which has the feeling of an afterthought, the demos of SQL queries in Impala that I saw, felt native, robust and just natural. This is most likely due to the fact that Impala does not use MapReduce although I’m sure there is more to this than that. Cloudera’s real-time query management to support Impala was also impressive. The good thing is that Impala is query-compatible with Hive, and the JDBC/ODBC drivers will make the transition for many Cloudera customers fairly transparent. For more complex/custom implementations it may be worth waiting until UDF support is introduced in next version.

Big Data still mainly for scientists. Having been previous week at the SpringOne 2GX conference, I was looking for ways to leverage the data insight capabilities coming from Big Data analytics in on-line applications. Unfortunately, most of the demos I saw were centered on data scientist enablement; batch-based, not much of support for multi-tenancy or process isolation. Again, a few companies talked about initiatives underway for closer on-line application integration.

Hadoop needs in-memory BI. In-memory indexes (based on dynamically created schema) appear to be the way forward in ongoing straggle to leveraging totally unstructured data. I saw a couple of interesting demos where robust data discovery tools combined with data domain knowledge enabled convincing results. Perhaps the most interesting of those was Platora’s “Hadoop Data Refinery”. Putting a side the really slick HTML5-based UI that translates users’ requests to MapReduce code that is then dispatched to Hadoop. The really “cool” stuff happens underneath, where the results to these requests are loaded into memory database engine called “Fractal Cache”. I could not get anyone there to tell me the underlining technology for these, but the resulting “lenses” make access to large datasets very responsive and user-friendly. This contrasts sharply with my experience in traditional tools relying on SQL and Hive continuously re-scanning underlining HDFS-data.

Overall, Strata/Hadoop World was a testament to the increasing maturity of the Big Data space. I took a lot of notes and many of the presentations are already available on-line. I look forward to a more in depth post mortem next week.

2012-10-10

For Developers, Database Soon to go the Route of Garbage Collector


It used to be simple: there were only a few viable database providers. Most organizations made their RDBMS choice by selecting one of these providers, and they appointed a Database Administrator (DBA) to set and administer all its rules.

Sometime during late 1990s, our approach to data started to shift. People began questioning if traditional databases were designed to perform the tasks we required of them. The one-size-fits-all approach by “Web-enabling” traditional databases led to some spectacular failures under the irregular access patterns, increasing access concurrency and geographically dispersed demand.

Now, in some cases, people tried to fight this shift by vertical scaling their existing databases, acquiring increasingly more powerful machines to compensate for its inherent lack of scalability. All too soon, most realized there simply was no single powerful-enough machine to address their scalability challenges. Others, perhaps seeing the futility of this effort, were unwilling to pay the price for yet another round of hardware upgrades.

Today, the landscape looks very different. Literally hundreds of databases are available on the market, and new ones are added almost every day. Many of these have grown organically to address very specific use-case (pain point as it were, where existent solutions were inadequate or too expensive.) Many of the offerings, see the below chart, are based on Open Source Software (OSS) and almost all can run on commodity hardware.
Source: The 451 Group

Now, some could argue that this Wild West of data persistence will soon end due to consolidation. That’s how it always works, right?

In contrast to many other areas, database consolidation is not something we ought to expect anytime soon, at least not in a way in which we are familiar. Many of these OOS projects do not respond to traditional competitive pleasures. They are managed and maintained by vibrant communities of volunteers -- not driven by market capitalization opportunities. In many cases these solutions address a very specific use-case, and can’t easily be replaced by a next-in-line competitor without losing some very specific features.

The question then becomes “Do application developers really care what technology is used to persist their data?” I would argue that the developers are very passionate about where their data lives (when it’s being used). They really do not care where that data sleeps (is stored). Increasingly, the place where the data lives is the memory, not storage.

It is for that very reason that I am convinced that databases will go the route of Garbage Collector. Let me explain. Unless you are still writing apps in C (and there are some really good reasons why you should sometimes do this), the notion of garbage collection as a means of memory management is something totally abstracted from you as a developer. Yes, some frameworks like Java and .NET do allow you to proactively trigger this process, but unless you go through some “special” steps, this is still only a request. Garbage Collector will perform that task whenever it needs to.

In the same way as GC, the database, and its unique complexities, will soon be abstracted from the developer. The developer will be interacting with a simple service already optimized with the most appropriate technology for that specific workload/data type combination.

You can already see this strategy being implemented in some forward-thinking organizations like Disney, who at recent Cassandra Summit, demonstrated their Data Management Platform (DMP).

Disney's DPM Platform
Source: Cassandra Summit
 Disney's goals for DMP were to:

  • Hiding operational complexity from their application developers
  • Abstracting specific storage engines behind APIs – Focus on Semantics, not Technologies
  • Delivering a uniform security layer across all of their data stores 

Another inherent benefit of this kind of approach is that it creates a more centralized data catalog. The thinking here is that data has more value in a larger context and when its structure is known. Once the data access is streamlined, it is easier to start overlaying additional value-add services in run-time like analytics or notifications.

If you are not convinced this is the future, just consider the alternative. Managing hundreds of different types of persistence frameworks, each with its own replication strategy, storage format and API. What would you prefer?