Christopher Philip Hebert

Back to Home

Blog

Back to Blog
Previous Next

2025-02-11

In a specific project, I am using Apache Spark to distribute some tasks across a cluster. These tasks will take somewhere between a minute to 3 hours. I represent each task as one row in the dataset and I use mapPartitions to send them around.

Obviously, there's a number of ways in which this is a suboptimal situation. But the particular thing I'm observing is that Spark lacks a way to inform it that a given row is expected to take longer to map than others.

It's not critical for me to do that right now, and there are some indirect ways to cause it, but I am surprised to find no method to associate some kind of weight to each partition.