The paper proposes a new abstraction called resilient distributed datasets (RDD) that enables efficient data reuse for applications. These RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning, and manipulate them using “high-level” operators like
RDD is a read-only, partitioned collection of data records, and its creation can only be made via deterministic operations on data in stable storage or other RDDs. Not all RDDs need to be materialized at all times. Information describing how the RDD has been derived from other datasets is logged and allows an RDD to be reconstructed in case of failure.
Users can control persistence and partitioning for RDDs. The users can indicate which RDDs will be reused and choose where to store them, for example in memory. The elements of an RDD can also be partitioned across machines.
An example of an RDD is as follows:
lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist()
In the example above, the RDD is defined in the first line, fetching data from an HDFS file which is then filtered and finally persisted so it can be shared between queries. At this point no work has been done yet, the actual work is done when the user performs operations that return some value, for example:
RDDs consist of five pieces of information: a set of partitions, which are atomic pieces of the dataset; a set of dependencies on parent RDDs; a function for computing the dataset based on its parents; and metadata about its partitioning scheme and data placement.
The dependencies can have two types: narrow and wide. Narrow dependencies are those which are used by at most one partition of the child RDD, and wide dependencies are used by multiple child partitions.
map leads to a narrow dependency, it can be used for pipelined execution in a single cluster, dealing with one element at a time.
join leads to wide dependencies, it might require data from several partitions/nodes. In case of failures, narrow dependencies are easier to rebuild compared to wide dependencies.
The system runs on top of Mesos and each Spark program is executed as a separate Mesos application, with its own driver (master) and workers, and resource sharing between applications is handled by Mesos.
Whenever an action (
save, etc) is executed on an RDD, the job scheduler examines the RDD’s lineage graph to build a directed acyclic graph (DAG) of stages to execute. Tasks are then launched to compute missing partitions from each stage until the target RDD is computed.
When a task fails it is re-executed in another node as long its parents are still available. If some stages are unavailable the tasks are resubmitted in parallel to compute the missing partitions.
The users can also require that some computations are “checkpointed” to avoid waiting a long time if a long RDD has a node failure. The Spark API allows users to pass a
REPLICATE flag to
persist to checkpoint data (that is chosen by the user).