Explains a few buzz words in Spark, External Shuffle Service (ESS), Remote Shuffle Service (RSS). And a few existing ESS, RSS solutions for cloud use cases.
Spark has a built-in External Shuffle Service (ESS) on YARN, it runs as a long-running
auxiliary on each node manager. When spark.shuffle.service.enabled
is enabled,
Spark executors will register with the ESS and connect with ESS via shuffle client.
The shuffle blocks will be preserved in YARN node manager local dir, so the data
is prevserved even the executor is idle and removed by the driver. ESS is required to be
enabled before using dynamica allocation.
LinkedIn has published a paper Magnet: Push-based Shuffle Service for Large-scale Data Processing to explain the optimizations on top of ESS. With its so called push-merge style, it saves disk I/O by combining smaller chunks of shuffle blocks to larger ones, and it tolerates failures as it only does the best-effort push. This increase the reliability of the shuffle service, as a result, Spark job runs are more stable and possibly achived better performance. Magnet now is available in Spark 3.2 via SPARK-30602. And one thing to notice is, Magnet stores the shuffle data mapping info directly in Spark driver, this avoids to having a single-point contact (usually a server role in the RSS) to store such info, which could potentially become a bottleneck when it scales out.
On-prem clusters, it is common to use Spark dynamical allocation with ESS. This saves the resources because Spark can dynamically scale down the executors based on the needs, this saves resources and can be used by other executors. ESS works very well when the cluster size is mostly static, you don’t need to scale up/down the cluster size very often, so the shuffle data could be preserved on node managers to serve other executors. However, when it comes to the cloud,the assumption no longer stands.
On cloud, the pricing mode is pay-as-you-go. Running large scale computations on cloud requires to use resources efficiently, this requires to dynamically scale the cluster based on the actual needs. In this case, there will be no long-running nodes available for running ESS instances. If the cluster is being dynamically resized frequently, it will cause a lot of overhead to recompute task stages as the shuffle data are lost along with the decommissioned nodes. This is why disaggregated compute and storage architecture is important on cloud.
Some of the existing practices is to continue use YARN ESS on cloud for Spark, this works but less efficiently. It is a anti-pattern to the cloud while requires to have long-running nodes. To achieve better efficiency and reliability, there are many Remote Shuffle Service (RSS) being developed.
The remote shuffle service takes a further step by storing the shuffle data in remote storage. This is an output of disaggregated compute and storage architecture. On the mapper side, it needs to push data to the remote RSS server before ending a stage, on the reducer side, it needs to fetch data from remote RSS servers instead of local storage. With this disaggregated architecture, the compuate nodes can be highly dynamical. With Spark DA enabled, the compute nodes can be scaled up and down independently with the RSS nodes.
There are several RSS existing today, here is a list with a short description about their characteristic:
As it for now, there is no open source solution available for providing RSS on K8s. It is easy to setup and deploy RSS servers on K8s with K8s primitives. The key challenage is how to provide a reliable, cost saving solution. Key problems like:
We will cover these topics in the future.