[SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk #50122
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thanks for @yorksity who reported this error and even provided a PR for it.
This solution very different from #40883 as
BlockManagerMasterEndpoint#getLocationsAndStatus()
needed some refactoring.What changes were proposed in this pull request?
This PR fixes an error which can be manifested in the following exception:
The PR is changing
BlockManagerMasterEndpoint#getLocationsAndStatus()
.The
BlockManagerMasterEndpoint#getLocationsAndStatus()
function is giving back an optionalBlockLocationsAndStatus
which consist of 3 parts:locations
: all the locations where the block can be found (as a sequence of block manager IDs)status
: one block statuslocalDirs
: optional directory paths which can be used to read block if the block is found in the disk of an executor running on the same hostThe block (either RDD blocks, shuffle blocks or torrent blocks) can be stored in many executors with different storage levels: disk or memory.
This PR changing how the block status and the block manager ID for the
localDirs
is found to guarantee they belong together.Why are the changes needed?
Before this PR the
BlockManagerMasterEndpoint#getLocationsAndStatus()
was searching for the block status (status
) and thelocalDirs
separately. The block status actually was computed as the very first one where the block can be found. This way it can easily happen this block status was representing an in-memory block (where the disk size is 0 as it is stored in the memory) but thelocalDirs
was filled out based on a host local block instance which was stored on disk.This situation can be very frequent but only causing problems (exceptions as above) when encryption is on (spark.io.encryption.enabled=true) as for a not encrypted block the whole file containing the block is read, see
https://github.com/apache/spark/blob/branch-3.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1244
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Host local block fetching was already covered by some existing unit tests but a new unit test is provided for this exact case: "SPARK-43221: Host local block fetching should use a block status with disk size".
The number of block mangers and the order of the blocks was chosen after some experimentation as the block status order is depends on a
HashSet
, see:This test was executed with the old code too to validate the issue is reproduced:
Was this patch authored or co-authored using generative AI tooling?
No.