6.5 C
New York

Amazon EMR Serverless helps bigger employee sizes to run extra compute and memory-intensive workloads

Amazon EMR Serverless means that you can run open-source huge information frameworks comparable to Apache Spark and Apache Hive with out managing clusters and servers. With EMR Serverless, you’ll be able to run analytics workloads at any scale with automated scaling that resizes assets in seconds to fulfill altering information volumes and processing necessities. EMR Serverless robotically scales assets up and down to supply simply the correct amount of capability to your software.

We’re excited to announce that EMR Serverless now gives employee configurations of 8 vCPUs with as much as 60 GB reminiscence and 16 vCPUs with as much as 120 GB reminiscence, permitting you to run extra compute and memory-intensive workloads on EMR Serverless. An EMR Serverless software internally makes use of employees to execute workloads. and you’ll configure totally different employee configurations based mostly in your workload necessities. Beforehand, the biggest employee configuration obtainable on EMR Serverless was 4 vCPUs with as much as 30 GB reminiscence. This functionality is particularly useful for the next widespread eventualities:

  • Shuffle-heavy workloads
  • Reminiscence-intensive workloads

Let’s have a look at every of those use instances and the advantages of getting bigger employee sizes.

Advantages of utilizing massive employees for shuffle-intensive workloads

In Spark and Hive, shuffle happens when information must be redistributed throughout the cluster throughout a computation. When your software performs large transformations or cut back operations comparable to be part of, groupBy, sortBy, or repartition, Spark and Hive triggers a shuffle. Additionally, each Spark stage and Tez vertex is bounded by a shuffle operation. Taking Spark for instance, by default, there are 200 partitions for each Spark job outlined by spark.sql.shuffle.partitions. Nevertheless, Spark will compute the variety of duties on the fly based mostly on the info measurement and the operation being carried out. When a large transformation is carried out on high of a giant dataset, there may very well be GBs and even TBs of information that should be fetched by all of the duties.

Shuffles are usually costly when it comes to each time and assets, and might result in efficiency bottlenecks. Subsequently, optimizing shuffles can have a big affect on the efficiency and price of a Spark job. With massive employees, extra information might be allotted to every executor’s reminiscence, which minimizes the info shuffled throughout executors. This in flip results in elevated shuffle learn efficiency as a result of extra information might be fetched regionally from the identical employee and fewer information might be fetched remotely from different employees.


To exhibit the advantages of utilizing massive employees for shuffle-intensive queries, let’s use q78 from TPC-DS, which is a shuffle-heavy Spark question that shuffles 167 GB of information over 12 Spark phases. Let’s carry out two iterations of the identical question with totally different configurations.

The configurations for Take a look at 1 are as follows:

  • Measurement of executor requested whereas creating EMR Serverless software = 4 vCPUs, 8 GB reminiscence, 200 GB disk
  • Spark job config:
    • spark.executor.cores = 4
    • spark.executor.reminiscence = 8
    • spark.executor.situations = 48
    • Parallelism = 192 (spark.executor.situations * spark.executor.cores)

The configurations for Take a look at 2 are as follows:

  • Measurement of executor requested whereas creating EMR Serverless software = 8 vCPUs, 16 GB reminiscence, 200 GB disk
  • Spark job config:
    • spark.executor.cores = 8
    • spark.executor.reminiscence = 16
    • spark.executor.situations = 24
    • Parallelism = 192 (spark.executor.situations * spark.executor.cores)

Let’s additionally disable dynamic allocation by setting spark.dynamicAllocation.enabled to false for each assessments to keep away from any potential noise attributable to variable executor launch occasions and hold the useful resource utilization constant for each assessments. We use Spark Measure, which is an open-source software that simplifies the gathering and evaluation of Spark efficiency metrics. As a result of we’re utilizing a set variety of executors, the entire variety of vCPUs and reminiscence requested are the identical for each the assessments. The next desk summarizes the observations from the metrics collected with Spark Measure.

. Complete Time Taken for Question in milliseconds shuffleLocalBlocksFetched shuffleRemoteBlocksFetched shuffleLocalBytesRead shuffleRemoteBytesRead shuffleFetchWaitTime shuffleWriteTime
Take a look at 1 153244 114175 5291825 3.5 GB 163.1 GB 1.9 hr 4.7 min
Take a look at 2 108136 225448 5185552 6.9 GB 159.7 GB 3.2 min 5.2 min

As seen from the desk, there’s a important distinction in efficiency attributable to shuffle enhancements. Take a look at 2, with half the variety of executors which can be twice as massive as Take a look at 1, ran 29.44% quicker, with 1.97 occasions extra shuffle information fetched regionally in comparison with Take a look at 1 for a similar question, identical parallelism, and identical combination vCPU and reminiscence assets. Subsequently, you’ll be able to profit from improved efficiency with out compromising on value or job parallelism with the assistance of enormous executors. We now have noticed comparable efficiency advantages for different shuffle-intensive TPC-DS queries comparable to q23a and q23b.


To find out if the big employees will profit your shuffle-intensive Spark functions, take into account the next:

  • Verify the Levels tab from the Spark Historical past Server UI of your EMR Serverless software. For instance, from the next screenshot of Spark Historical past Server, we are able to decide that this Spark job wrote and skim 167 GB of shuffle information aggregated throughout 12 phases, trying on the Shuffle Learn and Shuffle Write columns. In case your jobs shuffle over 50 GB of information, you could doubtlessly profit from utilizing bigger employees with 8 or 16 vCPUs or spark.executor.cores.

  • Verify the SQL / DataFrame tab from the Spark Historical past Server UI of your EMR Serverless software (just for Dataframe and Dataset APIs). While you select the Spark motion carried out, comparable to acquire, take, showString, or save, you will note an aggregated DAG for all phases separated by the exchanges. Each alternate within the DAG corresponds to a shuffle operation, and it’ll comprise the native and distant bytes and blocks shuffled, as seen within the following screenshot. If the native shuffle blocks or bytes fetched is way much less in comparison with the distant blocks or bytes fetched, you’ll be able to rerun your software with bigger employees (with 8 or 16 vCPUs or spark.executor.cores) and assessment these alternate metrics in a DAG to see if there’s any enchancment.

  • Use the Spark Measure software along with your Spark question to acquire the shuffle metrics within the Spark driver’s stdout logs, as proven within the following log for a Spark job. Overview the time taken for shuffle reads (shuffleFetchWaitTime) and shuffle writes (shuffleWriteTime), and the ratio of the native bytes fetched to the distant bytes fetched. If the shuffle operation takes greater than 2 minutes, rerun your software with bigger employees (with 8 or 16 vCPUs or spark.executor.cores) with Spark Measure to trace the development in shuffle efficiency and the general job runtime.
Time taken: 177647 ms

Scheduling mode = FIFO
Spark Context default diploma of parallelism = 192

Aggregated Spark stage metrics:
numStages => 22
numTasks => 10156
elapsedTime => 159894 (2.7 min)
stageDuration => 456893 (7.6 min)
executorRunTime => 28418517 (7.9 h)
executorCpuTime => 20276736 (5.6 h)
executorDeserializeTime => 326486 (5.4 min)
executorDeserializeCpuTime => 124323 (2.1 min)
resultSerializationTime => 534 (0.5 s)
jvmGCTime => 648809 (11 min)
shuffleFetchWaitTime => 340880 (5.7 min)
shuffleWriteTime => 245918 (4.1 min)
resultSize => 23199434 (22.1 MB)
diskBytesSpilled => 0 (0 Bytes)
memoryBytesSpilled => 0 (0 Bytes)
peakExecutionMemory => 1794288453176
recordsRead => 18696929278
bytesRead => 77354154397 (72.0 GB)
recordsWritten => 0
bytesWritten => 0 (0 Bytes)
shuffleRecordsRead => 14124240761
shuffleTotalBlocksFetched => 5571316
shuffleLocalBlocksFetched => 117321
shuffleRemoteBlocksFetched => 5453995
shuffleTotalBytesRead => 158582120627 (147.7 GB)
shuffleLocalBytesRead => 3337930126 (3.1 GB)
shuffleRemoteBytesRead => 155244190501 (144.6 GB)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 156913371886 (146.1 GB)
shuffleRecordsWritten => 13867102620

Advantages of utilizing massive employees for memory-intensive workloads

Sure sorts of workloads are memory-intensive and should profit from extra reminiscence configured per employee. On this part, we talk about widespread eventualities the place massive employees may very well be useful for operating memory-intensive workloads.

Information skew

Information skews generally happen in a number of sorts of datasets. Some widespread examples are fraud detection, inhabitants evaluation, and revenue distribution. For instance, whenever you need to detect anomalies in your information, it’s anticipated that solely lower than 1% of the info is irregular. If you wish to carry out some aggregation on high of regular vs. irregular data, 99% of the info might be processed by a single employee, which can result in that employee operating out of reminiscence. Information skews could also be noticed for memory-intensive transformations like groupBy, orderBy, be part of, window features, collect_list, collect_set, and so forth. Be a part of sorts comparable to BroadcastNestedLoopJoin and Cartesan product are additionally inherently memory-intensive and vulnerable to information skews. Equally, in case your enter information is Gzip compressed, a single Gzip file can’t be learn by a couple of job as a result of the Gzip compression sort is unsplittable. When there are a number of very massive Gzip information within the enter, your job might run out of reminiscence as a result of a single job might should learn an enormous Gzip file that doesn’t match within the executor reminiscence.

Failures attributable to information skew might be mitigated by making use of methods comparable to salting. Nevertheless, this usually requires in depth modifications to the code, which will not be possible for a manufacturing workload that failed attributable to an unprecedented information skew brought on by a sudden surge in incoming information quantity. For a less complicated workaround, you could simply need to improve the employee reminiscence. Utilizing bigger employees with extra spark.executor.reminiscence means that you can deal with information skew with out making any modifications to your software code.


With the intention to enhance efficiency, Spark means that you can cache the info frames, datasets, and RDDs in reminiscence. This allows you to reuse a knowledge body a number of occasions in your software with out having to recompute it. By default, as much as 50% of your executor’s JVM is used to cache the info frames based mostly on the property spark.reminiscence.storageFraction. For instance, in case your spark.executor.reminiscence is about to 30 GB, then 15 GB is used for cache storage that’s resistant to eviction.

The default storage degree of cache operation is DISK_AND_MEMORY. If the scale of the info body you are attempting to cache doesn’t match within the executor’s reminiscence, a portion of the cache spills to disk. If there isn’t sufficient area to write down the cached information in disk, the blocks are evicted and also you don’t get the advantages of caching. Utilizing bigger employees means that you can cache extra information in reminiscence, boosting job efficiency by retrieving cached blocks from reminiscence quite than the underlying storage.


For instance, the next PySpark job results in a skew, with one executor processing 99.95% of the info with memory-intensive aggregates like collect_list. The job additionally caches a really massive information body (2.2 TB). Let’s run two iterations of the identical job on EMR Serverless with the next vCPU and reminiscence configurations.

Let’s run Take a look at 3 with the beforehand largest potential employee configurations:

  • Measurement of executor set whereas creating EMR Serverless software = 4 vCPUs, 30 GB reminiscence, 200 GB disk
  • Spark job config:
    • spark.executor.cores = 4
    • spark.executor.reminiscence = 27 G

Let’s run Take a look at 4 with the newly launched massive employee configurations:

  • Measurement of executor set in whereas creating EMR Serverless software = 8 vCPUs, 60 GB reminiscence, 200 GB disk
  • Spark job config:
    • spark.executor.cores = 8
    • spark.executor.reminiscence = 54 G

Take a look at 3 failed with FetchFailedException, which resulted as a result of executor reminiscence not being adequate for the job.

Additionally, from the Spark UI of Take a look at 3, we see that the reserved storage reminiscence of the executors was totally utilized for caching the info frames.

The remaining blocks to cache had been spilled to disk, as seen within the executor’s stderr logs:

23/02/06 16:06:58 INFO MemoryStore: Is not going to retailer rdd_4_1810
23/02/06 16:06:58 WARN MemoryStore: Not sufficient area to cache rdd_4_1810 in reminiscence! (computed 134.1 MiB up to now)
23/02/06 16:06:58 INFO MemoryStore: Reminiscence use = 14.8 GiB (blocks) + 507.5 MiB (scratch area shared throughout 4 duties(s)) = 15.3 GiB. Storage restrict = 15.3 GiB.
23/02/06 16:06:58 WARN BlockManager: Persisting block rdd_4_1810 to disk as a substitute.

Round 33% of the endured information body was cached on disk, as seen on the Storage tab of the Spark UI.

Take a look at 4 with bigger executors and vCores ran efficiently with out throwing any memory-related errors. Additionally, solely about 2.2% of the info body was cached to disk. Subsequently, cached blocks of a knowledge body might be retrieved from reminiscence quite than from disk, providing higher efficiency.


To find out if the big employees will profit your memory-intensive Spark functions, take into account the next:

  • Decide in case your Spark software has any information skews by trying on the Spark UI. The next screenshot of the Spark UI reveals an instance information skew state of affairs the place one job processes a lot of the information (145.2 GB), trying on the Shuffle Learn measurement. If one or fewer duties course of considerably extra information than different duties, rerun your software with bigger employees with 60–120 G of reminiscence (spark.executor.reminiscence set wherever from 54–109 GB factoring in 10% of spark.executor.memoryOverhead).

  • Verify the Storage tab of the Spark Historical past Server to assessment the ratio of information cached in reminiscence to disk from the Measurement in reminiscence and Measurement in disk columns. If greater than 10% of your information is cached to disk, rerun your software with bigger employees to extend the quantity of information cached in reminiscence.
  • One other technique to preemptively decide in case your job wants extra reminiscence is by monitoring Peak JVM Reminiscence on the Spark UI Executors tab. If the height JVM reminiscence used is near the executor or driver reminiscence, you’ll be able to create an software with a bigger employee and configure the next worth for spark.executor.reminiscence or spark.driver.reminiscence. For instance, within the following screenshot, the utmost worth of peak JVM reminiscence utilization is 26 GB and spark.executor.reminiscence is about to 27 G. On this case, it might be useful to make use of bigger employees with 60 GB reminiscence and spark.executor.reminiscence set to 54 G.


Though massive vCPUs assist improve the locality of the shuffle blocks, there are different components concerned comparable to disk throughput, disk IOPS (enter/output operations per second), and community bandwidth. In some instances, extra small employees with extra disks may supply greater disk IOPS, throughput, and community bandwidth general in comparison with fewer massive employees. We encourage you to benchmark your workloads towards appropriate vCPU configurations to decide on the most effective configuration to your workload.

For shuffle-heavy jobs, it’s beneficial to make use of massive disks. You may connect as much as 200 GB disk to every employee whenever you create your software. Utilizing massive vCPUs (spark.executor.cores) per executor might improve the disk utilization on every employee. In case your software fails with “No area left on gadget” as a result of incapacity to suit shuffle information within the disk, use extra smaller employees with 200 GB disk.


On this submit, you realized about the advantages of utilizing massive executors to your EMR Serverless jobs. For extra details about totally different employee configurations, check with Employee configurations. Giant employee configurations can be found in all Areas the place EMR Serverless is obtainable.

Concerning the Creator

Veena Vasudevan is a Senior Accomplice Options Architect and an Amazon EMR specialist at AWS specializing in huge information and analytics. She helps prospects and companions construct extremely optimized, scalable, and safe options; modernize their architectures; and migrate their huge information workloads to AWS.

Related Articles


S'il vous plaît entrez votre commentaire!
S'il vous plaît entrez votre nom ici

Latest Articles