Complex Data Processing Gets Simpler With Cascading

Dmitry Orekhov

Solutions Architect, EPAM Belarus
Blog

Big data processing has made incredible strides over the past years, and it would be hard to overstate the role of the MapReduce programming model in this progress. However, MapReduce, while powerful, is still almost universally regarded as a complicated and difficult framework to use, even for professional software engineers. Indeed, developing useful applications on Hadoop with pure MapReduce isn't a trivial task. The Cascading framework based on Hadoop MapReduce may simplify this process significantly.

 

What’s Wrong With MapReduce?

Paradigm

MapReduce requires a programmer to think in terms of "map" and "reduce," an unintuitive programming model. It is much easier to develop complex applications if you work with a model that more easily maps to your problem domain.

Verbosity and development time

Programming a simple MapReduce job with basic Java requires writing many lines of boilerplate code (e.g. a trivial Word Count requires about 50 lines). There is plenty of Hadoop intrusiveness (Context, Writables, Exceptions, etc.) and a low-level glue code. Rich data types and common operations (such as joins, projections and filters) are tedious to implement and, predictably, require a custom code plus a considerable amount of development time, code review and testing.

Optimization and performance

Imagine a complex data flow that results in dozens of MapReduce jobs that perform joining, filtering and aggregating differently. Because of Hadoop’s distributed nature and the complexity of the MapReduce paradigm, it is hard to decide how to optimize the execution of such a flow to achieve optimal performance. This task requires a huge amount of effort along with an in-depth understanding of Hadoop internals.

In addition, complex multi-staged jobs can be difficult to manage and maintain, and there may be some version mismatches during deployment. Moreover, there are some problems with debugging, testability and IDE support.

 

You Can Do It Simpler With…

...Pig and Hive

Because of all these issues, there are several application frameworks that try to ease the complexity of writing MapReduce jobs by using high-level abstractions over MapReduce. They usually provide a set of tools that users may be more familiar with, for example, Hive provides an SQL-like language to operate on your data. It was designed to appeal to a community already familiar with SQL and relational databases. Also, Pig provides a procedural scripting language (PigLatin) for expressing data flows, and a runtime environment where PigLatin scripts are executed.

...Cascading

Cascading is a thin Java library for defining complex data flows on top of Hadoop and API compatible distributions. It provides rich query API, a query planner, a job scheduler and abstracts away from much of the complexity of Hadoop. Applications developed with Cascading are compiled and packaged into standard Hadoop-compatible JAR files that are similar to other native Hadoop applications. Cascading lets the developer quickly assemble complex distributed data-processing applications and efficiently schedule them based on their dependencies without having to "think" in MapReduce. Cascading operates on top of MRv1 and MRv2 (YARN).

Cascading uses a tuple-centric data model (just like Pig and Hive) that works best when your input data can be represented using a named collection of scalar values, much like the rows of a database table. It allows you to think about your data processing workflow in terms of operations on fields, without having to worry about how to transpose this view of the world onto the key/value model of the MapReduce paradigm.

Cascading and Hadoop

Cascading, Hive, and Pig were developed in parallel and sometimes perform the same actions. There are a lot of questions regarding how one framework differs from the others, as well as where and which framework to use. However, a detailed point-by-point comparison of Hadoop frameworks goes beyond the scope of this article. Instead, let’s take a closer look at why we have chosen Cascading.

Hive and Pig were built to make MapReduce accessible to data analysts with limited experience in programming. Both are really great tools for ad-hoc data analysis and quick exploration of data. Both enable your data to easily flow in parallel with simple commands and provide the ability to easily manipulate your data at scale. At the same time, Hive and Pig have some shortcomings. First, neither HiveQL nor PigLatin are Turing complete; they do not allow users to control the flow and modularity features that are present in general purpose programming languages, including functions, modules, loops, and branches. Because they lacks code separation and sharing functionality for complex flows, you need to embed them in the external procedural code. Second, in Hive or Pig anything complicated still requires User Defined Functions in Java. UDFs, while usually not too complex per se, still require a separate code base and, therefore, you need to maintain two separate languages. In addition, it would be nice to have some type checking to find errors at compile time rather than job submission time (or even 3 hours after job submission time).

Guided by these considerations when choosing a MapReduce framework, we found that Cascading suits us most. As a Java API, Cascading is primarily suitable for developers and allows you to build rich Data Analytics and Data Management applications, reusable frameworks, libraries, as well as write unit tests in any JVM-based language. For example, if you consider Java too verbose, you can use Scalding (a Scala library built on top of Cascading) to write even more concise and clearer code. Applications developed with Cascading are compiled and packaged into standard Hadoop-compatible JAR files that you bundle with your job. Any additional operation can be implemented as a straight Java function.

 

Cascading Framework

Imagine a Stream of Fluent Data vs. Key/Value pairs

The Cascading processing model is based on a metaphor of pipes (data streams), plumbing (pipe assemblies) and filters (data operations). Pipes are created independently from the data they will process. A simple chain of pipes without forks or merges is called a branch; an interconnected set of pipe branches is called a pipe assembly. Each assembly of pipes has a head and a tail – a source tap (input data) and a sink tap (output data), respectively. Taps are bound to pipes to create a flow. Any unconnected pipes and taps will cause the planner to throw exceptions. Cascading represents all data as “Tuples.” A tuple is a record of data, and a pipe represents a 'tuple stream' through which tuples flow, so that an operation can be performed on that stream. Tuples are composed of fields, much like a database record. Every input or output file has field names associated with it, so that values in the tuple may be used as both declarators and selectors. Every processing element of the pipe assembly either expects the specified fields or creates them.

Data Stream transformation

As data moves through the pipe, streams may be separated or combined.

  • Split takes a single stream and sends it down multiple paths - that is, it feeds a single Pipe instance into two or more subsequent separate Pipe instances with unique branch names.
  • Merge combines two or more streams that have identical fields into a single stream. This is performed by passing two or more Pipe instances to a Merge or GroupBy pipe.
  • Join combines data from two or more streams that have different fields, based on common field values (analogous to a SQL join).

The manner of Data Stream transformation depends on the type of Pipe.

There are six Pipe types defined as subclasses of Pipe for operations on the tuple streams as they pass through the assemblies: Each, Merge, GroupBy, Every, CoGroup and HashJoin. Pipes may involve operating on individual tuples (e.g., transform or filter), on groups of related tuples (e.g., count or subtotal), or on entire streams. The following is a short overview of the pipes available in Cascading.

  • Each pipes perform operations based on the data content of individual tuples - applying functions or filters, such as conditionally replacing certain field values, removing tuples that have values outside a target range, etc. You may use Each pipes to split or branch a data stream.
  • Merge can be used to combine two or more streams into one, as long as they have the same fields. Merge emits a single stream of tuples (in random order) that contains all the tuples from all the specified input streams.
  • GroupBy groups the tuples of a stream based on the common values in a specified field. The purpose of grouping is typically to prepare a stream for processing by the Every pipe which performs aggregator and buffer operations on the groups, such as counting, totaling, or averaging values within that group.
  • Every pipe operates on a tuple stream that has been grouped. Thus, the Every class is only for use on the output of GroupBy or CoGroup, and cannot be used with the output of Each, Merge, or HashJoin.
  • CoGroup and HashJoin perform a join on two or more streams, similar to a SQL join, and groups the single resulting output stream on the value of a specified field. The resulting output stream contains fields from all the input streams. The difference between them is that the former is a Reduce-side join while the latter is a Map-side join that loads tuples from right-side pipe(s) in memory. Thus, HashJoin has some constraints regarding data size, but is optimized to join small streams to no more than one large stream.


Data transformation

Cascading allows developers to transform data (tuples) as a tuple stream goes through the pipe assemblies, like filter, organize etc. To do this, Cascading provides the concepts of Operation and Fields.

  • Operation takes a tuple as an input, applies an operation to it and produces zero or more result tuples. Cascading provides some classes for implementing Operation interface, like Filter, Aggregator, Function, etc.
  • Field may be used both to declare field names, and to reference field value in a tuple.

To summarize, your code should have a tap to get input data, a tap to dump output data, and in between, some pipes where all the data processing happens.

To learn more about the programming model of Cascading please visit http://www.cascading.org/documentation.

 

Hello. How Can We Help You?


Our Offices

  • Canada

    • Ottawa

      343 Preston Street,
      ON K1S 1N4, Ottawa
      Canada

      Map
    • Toronto

      5 Park Home Avenue,
      Suite 400,
      ON M2N 6L4, North York,
      Toronto
      Canada

      Map
      P: +1-416-591-4004
      F: +1-416-595-1551
    Learn more
  • Mexico

    • Guadalajara

      Periférico Sur #8110,
      Col. El Mante
      45609 Tlaquepaque, Jalisco
      Mexico

      Map
      P: +52-33-462-400-98
    Learn more
  • United States

    • Newtown, PA

      41 University Drive,
      Suite 202,
      Newtown, PA 18940
      USA

      Map
      P: +1-267-759-9000
      F: +1-267-759-8989
    • Bellevue, WA

      110 110th Ave. NE,
      Suite 310
      Bellevue, WA 98004
      USA

      Map
    • Cambridge, MA

      One Mifflin Place,
      Cambridge, MA 02138
      USA

      Map
      P: +1-267-759-9000
      F: +1-267-759-8989
    • Conshohocken, PA

      101 East 8th Ave,
      Suite 201,
      Conshohocken, PA 19428
      USA

      Map
      P: +1-484-382-1300
    • Los Angeles, CA

      11601 Wilshire Blvd,
      Suite 350,
      Los Angeles, CA 90025
      USA

      Map
    • Mountain View, CA

      465 Fairchild Dr,
      Building B,
      Suite 221,
      Mountain View, CA 94043
      USA

      Map
    • New York, NY

      24 West 25th Street,
      5th Floor,
      New York, NY 10010
      USA

      Map
      P: +1-267-759-9000
      F: +1-267-759-8989
    • San Francisco, CA

      222 Kearny Street,
      Suite 308,
      San Francisco, CA 94108
      USA

      Map
    • Washington D.C.

      7901 Jones Branch Drive,
      Suite 400,
      McLean, VA 22102
      USA

      Map
    Learn more
  • Austria

    • Vienna

      Nottendorfer Gasse 11,
      1030 Wien
      Austria 

      Map
    Learn more
  • Bulgaria

    • Sofia

      69 Bulgaria Blvd.,
      Infinity Tower,
      1404 Sofia
      Bulgaria

      Map
      P: +359-700-20-273
    Learn more
  • Czech Republic

    • Prague

      City Tower building,
      Hvezdova 2b,
      Prague 4
      Czech Republic

       

      Map
      P: +420 22 888 28 23
    Learn more
  • Germany

    • Frankfurt am Main

      Franklinstrasse 56, 
      60486 Frankfurt am Main,
      Germany

      Map
      P: +49-69-31019090
    Learn more
  • Hungary

    • Budapest

      Corvin Offices I.
      Futó street 47-53,
      Budapest, H-1082
      Hungary

      Map
      P: +36-1-327-7400
    • Debrecen

      Bethlen Street 3-9,
      Debrecen, 4026
      Hungary

      Map
      P: +36-52-999-485 / 45050
    • Szeged

      Felső Tisza-Part 25,
      Szeged, 6723
      Hungary

      Map
      P: +36-62-808-013
      F: +36-62-550-655
    Learn more
  • Ireland

    • Dublin

      Alexandra House,
      The Sweepstakes,
      Ballsbridge, Dublin 4,
      D04 C7H2
      Ireland

      Map
      P: +353-(0)-1-631-9280
    Learn more
  • Netherlands

    • Delft

      Delftechpark 37j
      2628 XJ Delft
      Netherlands

      Map
      P: +31 20 241 6134
    • Schiphol

      The Base B
      Evert van de Beekstraat 104
      1118CN Schiphol
      Netherlands

      Map
      P: +31 20 241 6134
    Learn more
  • Poland

    • Gdańsk

      Grunwaldzka 472D,
      Olivia Six,
      80-309 Gdańsk
      Poland

      Map
    • Katowice

      Chorzowska Str. 148
      40-101 Katowice
      Poland

      Map
      P: +48-12-222-02-02
    • Krakow

      Opolska Str. 114
      31-323 Kraków
      Poland

      Map
      P: +48-12-222-02-02
    • Warsaw

      Al. Jana Pawła II 23,
      Atrium International,
      00-854 Warszawa
      Poland

      Map
    • Wroclaw

      Ul. Piotra Skargi 1,
      Budynek A,
      50-082 Wrocław
      Poland

      Map
    Learn more
  • Sweden

    • Göteborg

      Lilla Nygatan 2,
      411-09 Göteborg
      Sweden

      Map
      P: +46-31-146-550
    • Stockholm

      Kungsgatan 50,
      111 35, Stockholm
      Sweden

      Map
    Learn more
  • Switzerland

    • Glattpark

      Boulevard Lilienthal 2,
      8152 Glattpark (Opfikon)
      Switzerland

      Map
      P: +41-43-500-21-49
    Learn more
  • United Kingdom

    • London

      114 Middlesex Street,
      London, E1 7HY
      United Kingdom

      Map
      P: +44-203-514-0027
    • Manchester

      Tower 12,
      18-22 Bridge Street,
      Manchester, M3 3BZ
      United Kingdom

      Map
      P: +44-203-514-0027
    Learn more
  • Australia

    • Sydney

      Suite 2.14
      Level 2
      Zhen Building
      33 Lexington Drive
      Bella Vista NSW 2153

      Map
      P: +61-2-8310-8272
    Learn more
  • China

    • Guangzhou

      Unit B01, 23/F,
      Yuexiuxinduhui Building,
      No. 236, 6th Zhongshan Road,
      Yuexiu District, Guangzhou,
      China 510180

      Map
    • 广州

      中国广州市越秀区
      中山六路236号
      越秀新都会大厦中座 23楼 B01室
      邮编510180

      地图
    • Shanghai

      Room A11, 6th Floor, Block C,
      666 East Beijing Road,
      Huangpu District, Shanghai,
      China 200001

      Map
      P: +86-21-53080606 +86-21-53085450
    • 上海

      中国上海市黄浦区
      北京东路 666号
      科技京城C区 6楼 A11室
      邮编200001

      地图
      P: +86-21-53080606 +86-21-53085450
    • Shenzhen

      3/F, Block 5, Vision Shenzhen Business Park,
      9th Gaoxin South Road, 
      Shenzhen Hi-tech Industrial Park,
      Nanshan District, Shenzhen,
      Guangdong, China 518057

      Map
      P: +86-755-36899008
    • 深圳

      中国广东省深圳市
      南山区高新南九道
      威新软件园5号楼3楼
      邮编518057

      地图
      P: +86-755-36899008
    • Suzhou

      Building 12, Creative Industrial Park,
      328 Xinghu Street,
      Suzhou Industrial Park,
      Suzhou, China 215123

      Map
    • 苏州

      中国江苏省苏州市
      苏州工业园区星湖街328号
      创意产业园内12号楼
      邮编215123

      地图
    Learn more
  • Hong Kong

    • Hong Kong

      198 Wellington Street,
      17F, Central
      Hong Kong

      Map
      P: +852-5808-6018
    Learn more
  • India

    • Hyderabad

      10, 11 & 12th Floors,
      Salarpuria Sattva Knowledge City,
      Plot No. 2, Phase - 1,
      Survey No. 83/1,
      Raidurgam Village,
      Serilingampally Mandal,
      Hyderabad, Telangana - 500081
      India

      Map
      P: +91-40-47979900
    • Pune

      SmartWork Business Center Pvt Ltd,
      Suite 8, Level 1,
      West Wing, Nyati Unitree,
      Samrat Ashok Road,
      Yerwada, Pune - 411006,
      Maharashtra
      India

      Map
      P: +91-20-4913-6025
    Learn more
  • Singapore

    • Singapore

      5 Shenton Way
      UIC Building, #10-01,
      Singapore (068808)

      Map
      P: +65-6911-6888
    Learn more
  • United Arab Emirates

    • Dubai

      EPAM Systems FZ-LLC
      Dubai Branch,
      Building 16, Office 241,
      Dubai Internet City,
      PO Box 501929, Dubai
      United Arab Emirates

      Map
      P: +971-4-568-3569
    Learn more
  • Armenia

    • Yerevan

      15 Khorenatsi Street,
      Elite Plaza Business Center,
      0010 Yerevan
      Armenia

      Map
      P: +374-10-60-00-65
    Learn more
  • Belarus

    • Brest

      6A Masherov Avenue,
      224030 Brest
      Belarus

      Map
      P: +375-162-52-5268
      F: +375-162-50-9888
    • Gomel

      80 Rechitsky Avenue,
      246012 Gomel
      Belarus

      Map
      P: +375-17-389-0100, ext. 54079
      F: +375-232-70-50-31
    • Grodno

      87B Gorkogo Street,
      230005 Grodno
      Belarus

      Map
      P: +375-17-389-0100, ext. 69011
    • Minsk

      1/1 Academician Kuprevich Street,
      Suite 110,
      220141 Minsk
      Belarus

      Map
      P: +375-17-389-0100
      F: +375-17-268-6699
    • Mogilev

      19 Cosmonaut Street,
      Suite 510,
      212009 Mogilev
      Belarus

      Map
      P: +375-17-389-0100, ext. 1001
    • Vitebsk

      11-a Stroitelei Avenue,
      Suite 311,
      210032 Vitebsk
      Belarus

      Map
      P: +375-17-389-0100, ext. 54433
    Learn more
  • Kazakhstan

    • Astana

      8 Auezova Street,
      Office B,
      010000 Astana
      Kazakhstan

      Map
      P: +7-7172-475-970
      F: +7-7172-688-774
    • Karaganda

      58/3 Ermekova Street,
      100009 Karaganda
      Kazakhstan

      Map
      P: +7-7212-93-01-01 +7-7212-93-01-00
    Learn more
  • Russia

    • Moscow

      9th Radialnaya Street,
      Building 2,
      115404 Moscow
      Russia

      Map
      P: +7-495-730-6362
      F: +7-495-730-6361
    • Izhevsk

      150 V. Sivkova Street,
      426000 Izhevsk
      Russia

      Map
      P: +7-3412-271882 +7-3412-271337
    • Nizhny Novgorod

      30 Poltavskaya Street
      603089 Nizhny Novgorod
      Russia

      Map
    • Ryazan

      16 Gogolya Street,
      390035 Ryazan
      Russia

      Map
      P: +7-4912-93-57-33
    • Saint Petersburg

      22/2 Zastavskaya Street,
      MegaPark,
      196084 Saint Petersburg
      Russia

      Map
      P: +7-812-611-1094
      F: +7-812-611-1094
    • Samara

      21 Michurina Street,
      443110 Samara
      Russia

      Map
      P: +7-846-200-0970
    • Saratov

      37 Tankistov Street,
      410019 Saratov
      Russia

      Map
      P: +7-8452-692-981
    • Sergiev Posad

      7 Akademika Silina Street,
      141313 Sergiev Posad
      Russia

      Map
      P: +7-496-547-11-39
      F: +7-496-547-11-39
    • Togliatti

      31E Yubileynaya Street,
      445037 Togliatti
      Russia

      Map
      P: +7-495-730-6360, ext. 47650
    • Tver

      8 Kolodkina Street,
      170002 Tver
      Russia

      Map
      P: +7-4822-630-070 +7-4822-630-071 +7-4822-630-072
      F: +7-4822-630-073
    Learn more
  • Ukraine

    • Dnipro

      17A Volodymyra Monomaha Street,
      49000 Dnipro
      Ukraine

      Map
      P: +380-56-790-5651
    • Kharkiv

      33G 23 Serpnya Street,
      61072 Kharkiv
      Ukraine

      Map
      P: +380-57-728-06-22
    • Kyiv: Registered office

      28 Fizkultury Street,
      03150 Kyiv
      Ukraine

      Map
      P: +380-44-390-5457
      F: +380-44-390-0861
    • Kyiv: Visitors office

      14B Kudryashova Street
      03035 Kyiv, Ukraine

      Map
      P: +380-44-390-5457
      F: +380-44-390-0861
    • Lviv

      45 O.Stepanivny Street,
      79018 Lviv
      Ukraine

      Map
      P: +380-32-242-4642
      F: +380-44-390-5458
    • Vinnytsia

      51 Ovodova Street,
      21000 Vinnytsia
      Ukraine

      Map
      P: +380-432-551-294 +380-432-551-275
      F: +380-432-551-293
    Learn more