Christopher Philip Hebert

Blog

2025-02-11

In a specific project, I am using Apache Spark to distribute some tasks across a cluster. These tasks will take somewhere between a minute to 3 hours. I represent each task as one row in the dataset and I use [mapPartitions](https://spark.apache.org/docs/3.5.1/api/java/org/apache/spark/sql/Dataset.html#mapPartitions(org.apache.spark.api.java.function.MapPartitionsFunction,org.apache.spark.sql.Encoder)) to send them around.

Obviously, there's a number of ways in which this is a suboptimal situation. But the particular thing I'm observing is that Spark lacks a way to inform it that a given row is expected to take longer to map than others.

It's not critical for me to do that right now, and there are some indirect ways to cause it, but I am surprised to find no method to associate some kind of weight to each partition.