This article explains the new challenges in enterprise integration caused by the advent of big data, and presents some approaches to overcome them.
Big data is the latest buzz-word in the industry. At the basic level, big data is just that: large amounts of data that cannot be handled by conventional systems. With increases in hardware capacity, our definition of what constitutes a conventional system changes, and so does the threshold of big data, which is not something new but has always been around. It’s just that the threshold of what constitutes big data has changed. Today, the threshold for big data may be terabytes (1012 bytes). Soon, it will be petabytes (1015 bytes). Twenty years ago, there were very few systems that could process gigabytes (109 bytes) of data in an acceptable timeframe. So gigabytes would have been the lower threshold of big data at that time.
Integration is needed when we have to connect two or more software systems together. Large enterprises will have tens or hundreds of systems to connect together. Since the number of systems involved in integration could be large, the amount of data that flows across these connections will also be large.
How does big data impact enterprise integration?
With the advent of big data processors, more and more organisations recognise the value of big data analytics, and the need to process big data within enterprises is increasing. This will lead to large amounts of data (which is a subset of the big data captured by the organisation) moving across integration middleware (a software that connects two or more software systems). Such large amounts of data would overload existing integration middleware systems, since they were designed to handle lower volumes of data. This is depicted in Figure 1.
Let us analyse Figure 1 in detail. Big data is not what is really interesting. Who needs a mountain of data anyway? What we want are the results of processing this data. Depending on the processing algorithm used to process the data, we get different results. This is where it gets interesting.
Let us suppose that an enterprise implements a big data solution. The big data processing (refer to the big data processor in the figure) solution could be a shared service across the enterprise. Due to the technical complexity and cost involved in building a big data processor, it is not possible for each division in the organisation to have its own. But due to the business value accruing from big data analysis, sooner or later, different divisions in the enterprise will want their own processing on the big data set. This can be done by moving a subset of the big data, relevant to that division, over to its systems. This is the point at which the integration systems in the enterprise will feel the impact. To enable different divisions in the organisation to “do their own stuff” with data, a subset of the big data will start moving across the integration middleware.
Overcoming the big data challenge
So how can we solve the new challenge caused by subsets of big data moving across the integration middleware? Let us look at three options:
- Buy more hardware: This is technically called vertical scaling or horizontal scaling. This approach may work with smaller challenges, but not with big data, since the amount of new hardware required to support the load will make the idea financially unviable.
- Buy specialised big data solutions: These solutions can be purchased and given to each division that needs a big data processor. There are very few big data processors in the market now, but we can expect more soon. This will be cheaper than buying more hardware, and can be considered, if the organisation has enough funds. Note that this approach works by avoiding the problem altogether: we avoid moving any data across the integration middleware. On the flip side, this approach will result in multiple copies of big data in the enterprise.
- The third option involves extending our middleware’s capabilities by using data grids or distributed caching platforms. This attempts to overcome the big data challenge by increasing the integration middleware’s available memory, and introducing an asynchronous link in the integration middleware’s data-persistence mechanism.
How do these techniques help us overcome the challenge? Answering that question requires a deeper understanding of the root cause of the underlying problem. A middleware solution fails at high loads due to the following issues:
- Memory overload, caused by data, threads or sockets
- Lack of system resources like threads, sockets and swap space
A distributed cache helps with the first issue, by increasing the available memory. For example, if you have ten servers with 10 GB each, a distributed cache can help you add up all that RAM and use it like local RAM, effectively giving you 100 GB. It helps with the second issue by avoiding the need for a large number of threads, or a large swap space. This is accomplished by intelligent persistence mechanisms, like write-behind-cache.
In a write-behind-cache, data that needs to be persisted is written in an asynchronous manner: the write request is accepted, and the write function returns immediately. The persistence mechanism then writes into the file or DB; this frees up the persistence threads of the middleware, increasing the scalability of the overall solution.
A telecom use-case
Let us look at how the third option from the previous section, of using data grids, can be implemented. The use-case here is from a telecom scenario, and is depicted in Figure 2.
Figure 2 has conceptual similarity with Figure 1. The Network Switch is the data source here. The big data processor maps to the Mediation solution here. The Analytics Application is similar to the Data Warehouse. The Event Processor system is similar to the Fraud Management application. Let us understand the data flow in this figure.
This is the data flow for a cell phone services provider — a telecom company, or telco, as per industry parlance. Whenever a phone call is made, records called Call Detail Records are generated by the telco’s hardware, and the records get collated at the Network Switch. The Mediation system then processes these records. It performs validation, filtering, etc., and gives the records to the three systems it connects to: Fraud Management, Billing and Data Warehouse. The Billing system needs to connect to many other systems: CRM, Inventory, Fraud Management. For some of the data flows, like CRM to Billing, the volume is so high that we have to provide a direct connection from Billing to CRM, for a few use-cases (around 5 per cent result in such high volumes).
Normal middleware, even with clustering and load balancing, cannot handle this. This is where the need for Middleware Infrastructure comes in, in the architecture (see Figure 2). The Middleware Infrastructure component is a separate product that provides features like local and distributed caching, load balancing, failover and recovery, with much higher scalability than that provided by standard middleware products. Some of the middleware infrastructure products come in the form of data grids, which support scaling to hundreds of nodes. Examples of such products are Oracle Coherence, JBoss Infinispan, Websphere Extremescale and Terracotta Big Memory.
The future of Big data and integration
Any solution in the technology space starts out in a niche area, and as it becomes mainstream, it gets more and more commoditised. We can expect big data solutions to follow this path in the near future with:
- The arrival of big data processing appliances
- Support for big data in cloud platforms
- Cloud-based integration platforms that are pre-packaged with middleware infrastructure.
Hopefully, this article gives you a good overview of the challenges posed to integration solutions by the advent of big data in enterprises. The example discussed, which is a use-case from the telecom domain, is generic enough to be applicable to other domains. The key value that open source brings to such solutions is that we can scale out our solution with much lower financial implications, compared to commercial solutions.