« 'Security by Compliance Is No Longer Working.' Did it ever? | Main | A few snaps from The Boys' visit »
Sunday
Jul192009

Dealing with Data During Cloudbursts

I enjoy reading Joe Weinman's posts. And today's post at GigaOM is no exception. He does a great job organizing the problem of data when considering the architecture of cloudbursting. Joe's post has prompted me to break my 'radio silence' of a few months.


In 4 1/2 Ways to Deal With Data During Cloudbursts, Joe calls out a number of architectures and discusses some of the considerations that impact the choice of a relevant scenario. I won't try to recreate the architectural strategies, but I was taken by how relevant these same constellations are when addressing some of the more advanced considerations of data clouds, peripatetic workloads and data governance.


1) Independent Clusters: This one is pretty straight forward and Joe's characterization of "minimal communication and data-sharing requirements between the application instances running in the enterprise and cloud data centers" makes sense. The data-specific considerations in using the cloud service resources tend mostly to center about providing the user with a uniform (or at least acceptable) standard of data security.


2) Remote Access to Consolidated Data: This strategy is called out for those situations in which application instances running in the cloud require access to a single-instance data store, or data store(s) which must for various reasons remain within the confines of the enterprise data center.


Notice my 'or' in the last sentence. Besides architectural requirements that require a single-instance data store, the reality of enterprise IT is that data stewardship requirements often require the authoritative datum to remain within the enterprise data center.


3) On-Demand Data Placement: Weinman points out that



...if I/O intensity and/or network latency are too high for remote access, then any needed data that isn’t already in the cloud must be placed there at the beginning of the cloudburst, and any changes must be consolidated in the enterprise store at the end of the cloudburst. The question is: “How much data needs to get where, and how quickly?”



This is clearly the right question to ask first. If a large data set is required to be in close proximity to the cloud service application instances, this may require enterprise IT to rely on a number of tactics to reduce delay in commencing cloud-based operation: large bandwidth networking services, possibly made available on-demand; advanced WAN optimization technologies (e.g. data deduplication).


As in my consideration of the remote access to consolidated data, on-demand data placement may imply a requirement for additional measures to deal with compliance and data stewardship, therefore calling on the purveyors of fast file transfer or on-demand, adjustable data transport services to offer a form of 'in-flight' data mediation services. Alternatively, the enterprise data center may be called on to implement dataset virtualization approaches or data masking systems in order to remain in compliance.


4) Pre-positioned Data Placement: He makes the point that pre-positioning "... adds additional costs as a full secondary storage environment and a metro or wide-area network must be deployed.'


4.5) BC/DR Plus Cloudbursting: This was the point at which I chortled with recognition.


Thanks, Joe! I've been looking for the context in which to make this point for years. This has been a soapbox of mine for a long time ... almost since the notion of utility computing (now 'cloud computing') started circulating as a meme.


In addition to using cloudbursting as the premise on which to incorporate business continuity and disaster recovery costs into calculation, I'd like to throw in at least one more, in hopes of getting this to 4 3/4 ways to deal with data + cloudbursts. Please bear with me... this is work in progress.


Data Governance, Data Stewardship and Data Residency:


Many of the issues relating to data in conjunction with cloudbursting are not new. When you stop to think about it, the 4 1/2 ways that Weinman outlines are variants of a generic data sharing problem across organizational boundaries. If we add any form of data sharing to the real cost of the enterprise data center, the issue we must address is that of Data Stewardship. It's been defined in various places, but here's one of my favorites since it places it in context with Data Governance.



Data Governance: The execution and enforcement of authority over the management of data assets and the performance of data functions.

Data Stewardship: The formalization of accountability for the management of data resources.



Data governance in the enterprise data center may require a 'complete' record to always be under the stewardship of the enterprise, and never at risk of being located in a different legal jurisdiction (e.g., the details of a financial transaction must remain in the immediate and direct control of the responsible financial institution). Examples abound, but one can point to financial and personal information which must, for compliance reasons, never leave the geographical borders of a country with stringent data protection regulation n (e.g., not in that cloud-resident datastore in India or Switzerland).


In these cases, the implications of cloud bursting on data may require the addition of data masking/data obfuscation, or applications which are demonstrably proven to operate on meta-data of other kinds without jeopardizing data stewardship compliance. This particular aspect of Data Stewardship is sometimes called the Data Residency Dilemma.


Getting to 4.75 - Data Governance Plus Cloudbursting: Even if, in addition to taking responsibility for data mirroring or replication to provide Business Continuity / Disaster Recovery, the enterprise is constrained from using Data + Cloudbursting because of the costs and constraints of data governance, the question arises: Are there services / technologies that can be provided by the *aaS supplier which can be brought to bear? To me, this appears to be a question of data center pragmatics rather than strictly an issue of recalculating the breakeven point.


There are many technologies for data sharing, some of which come into play for Data + Cloudbursting. When the solution requires extending the 'boundaries' of the enterprise in both the application and data domains (as we do with cloudbursting), the first question has usually been constructed as: Should the shared data reside inside or outside the firewall?


Elastic perimeter technologies: For cloudbursting with data 'leaving the building', elastic virtual private networks (such as CohesiveFT's VPN-Cubed , particularly their Data Center to EC2 version) address the underlying, network-oriented issues of wandering data.


Data masking & obfuscation: Conventional data encryption of "data at rest" does not satisfy the safety requirements of most enterprises when data is placed outside the corporate data center. Because the data must be decrypted when “in use” by a cloud-resident application image, conventional disk or file encryption does not protect against a compromise of or misuse by any systems processing the data. Suitably transformed portions (i.e. fields) can be used that provide integrity of the source data required for the application by means of data masking or obfuscation.


Meta-data & data virtualization: We're now starting to see, usually in conjunction with specific SaaS offers, data 'proxy' servers and other means that allow the enterprise to retain specific data elements 'locally resident' within the data center rather than residing 'in the clear' within a data cloud. What we can expect to see within the next year are solutions that provide this type of offer associated with Master Data Management technologies, or enhanced data-in-motion services provided by cloud service providers at all levels -- Iaas, PaaS and Saas. The most immediate utility of these offers will be for enterprises wishing to make real use of cloudbursting.


---


Joe Weinman broadened the definition of the real costs of an enterprise data center and has shown clearly how cloudbursting + pre-positioned data can contribute to addressing the BC/DR costs. Like BC/DR, the enterprise data center has to consider data governance in the context of interorganizational data sharing. Cloudbursting is just one form of data sharing, and presents the innovative cloud service an opportunity to provide generic solutions to data sharing governance for the enterprise.


Truth in advertising: In two of my recent entrepreneurial adventures (Univa and Safe Data Sharing) as well as two for whom I've acted as an advisor (Perspecsys and Replicus ), the problems of data stewardship and anticipatory data transport (e.g. moving/replicating the dataset well in advance) all come into play.

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
All HTML will be escaped. Hyperlinks will be created for URLs automatically.