Resolving Disassociated Processing of Real-Time and Historical Data in IoT

CIOReview - July 19, 2016 - By Konstantin Boudnik, Senior Director, Technology Solutions, EPAM

With the increasing pace of digital disruption, many enterprises today are focused on staying ahead with real-time data and real-time insights. Real-time analytics provides an opportunity to make proactive decisions, eliminate risks and gain competitive advantage in the marketplace by allowing companies to react quickly to changing conditions by tapping into data that’s always on. For example, if you’re launching a healthcare app, you could use real-time monitoring information to provide early warning signs and save patients’ lives. But the challenge of processing real-time data from a variety of sensors and mobile and remote devices, in combinations with historical datasets to enhance actionable insights, can add more complexity into already sophisticated pipelines. There are a few solutions to adding two data-flows, requiring the same data to be processed twice, to the same pipeline.

The most advanced solution is to combine the batch, stream processing, and serving DB components into an in-memory data fabric. A system like this will transparently work with the stored and streamed data at the same time in a transactional fashion. Therefore, the data fabric becomes the single source of truth. Alexandre Boudnik, a computer scientist who has worked on compilers, hardware emulators and testing tools for over 20 years, came up with the term Iota architecture (Greek letter ι).

One example of this solution is using Apache Kafka, in combination with Apache Ignite, to provide messages serving and to process the streaming in combination with the data retained in the secondary storage (Apache Cassandra, HDFS, or even a traditional RDBMS server). Feature rich in-memory data-fabrics like Ignite:

 

Another approach that was adopted earlier is known as Lambda architecture (Greek Λ), where two intake layers deal with the incoming data at different speeds, reconciling at the query point (commonly called serving DB). While offering better delivery SLAs, it isn’t free of interoperation impedance and high operational, hardware and management costs. One particular issue experienced is the correct recovery following the failure of an intake. The recovery logic is frequently moved to the client software forcing it to be stateful and more complex as a result. Making changes to and deploying the stateful code in a distributed system could be quite an intricate undertaking, especially with a need for data reprocessing once the new code is provisioned and running. One possible optimization, sometimes dubbed as Kappa architecture (Greek κ), is to combine the batch and stream processing components into a single subsystem, which is then used by the serving DB in the querying. Some telecommunications companies are using this kind of processing to capture and consume data with sensor feeds and telemetry through Apache Kafka and then piped into a streaming dataflow engine, like Apache Flink, for analytics.

The more advanced solution, Iota architecture, is the most recommended. The Iota design pattern has a number of unique properties: elimination of the need for expertise spanning multiple programming models and platforms, reduced hardware needs, low data-center operational complexity, a shorter application development and deployment loop, and a low-cost, long term ownership. This combination helps to increase the data platform ROI by bringing down the capital expenditures into rapidly commoditized computer systems and using smaller development and cluster operation teams.

Original publication is here.