Google Cloud Dataflow has moved their programming model from Apache Spark to Apache Beam, and has also released an article about the comparison between Apache Spark and Apache Beam strictly from the programming-model perspective. However, some people who have just heard the name “Apache Beam” may find difficulties in understanding what it is in the first place. Besides, many model concepts of Apache Beam have also been incorporated into newer releases of Apache Spark. So, if you are still wondering about what Beam and Spark are, and how the two differ from each other, continue reading below!
About Apache Spark
Apache Spark is an open-source cluster computing framework which provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. You can also define it as a batch processing engine that emulates streaming through microbatching. It was released for the first time in May 2014. However, Apache Spark requires a cluster manager as well as a distributed storage system. For cluster management, it supports standalone, Hadoop YARN, and Apache Mesos. For distributed storage, it supports a wide variety, such as HDFS, MapR-FS, Cassandra, Kudu, OpenStack Swift, and Amazon S3.
Apache Spark provides the user with an API centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset that is distributed over a cluster and maintained in a fault-tolerant way. Unlike MapReduce which forces a particular linear dataflow structure, Apache Spark offers a deliberately restricted form of distributed shared memory.
About Apache Beam
Apache Beam is an open-source unified programming model that defines and executes data processing pipelines, including ETL, batch, and stream processing. It was first released in June 2016. You can say that it is a framework for creating pipelines, which will be executed in a runner like Apache Spark, Apache Flink, Apache Apex, or Google Cloud Dataflow. You can use one of the provided SDKs to define the pipelines and choose one of the supported runners to execute the pipelines. It has been termed as an “uber-API for big data” because of the incredible flexibility it offers. With Apache Beam, you can create more complex pipelines, which are often needed for advanced processing of big data.
|Apache Beam||Apache Spark|
|- An open-source cluster computing framework||- An open-source framework to define and execute pipelines|
|- Released in June 2016||- Released in May 2014|
|- Provides several SDKs for defining pipelines, supports several runners including Apache Spark||- Requires a cluster manager and a distributed storage system|
As explained above, Apache Beam and Apache Spark are two quite different things. Apache Beam is a framework for creating data pipelines. On the other hand, Apache Spark is a framework for cluster computing. Apache Beam has several SDKs for defining the pipelines, and supports several runners for executing the pipelines – Apache Spark is one of them, but not the only option. Meanwhile, Apache Spark itself requires a cluster manager and a distributed storage system.