Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][ML][SPARK-51379] Move treeAggregate's final aggregation from driver to executor #50142

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Mar 4, 2025

What changes were proposed in this pull request?

Move treeAggregate's final aggregation from driver to executor.

ee20fbb introduced an optimization that:

Move final iteration of aggregation of RDD.treeAggregate to an executor with one partition and fetch that result to the driver

This PR tries to apply this optimization so that less memory is required for ML's treeAggregate.

Why are the changes needed?

to save driver memory

Does this PR introduce any user-facing change?

no

How was this patch tested?

ci and manually check

preparing data:

df = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
for i in range(10):
    df = df.union(df)

df.count()
df.repartition(1024).write.mode("overwrite").parquet("/tmp/test_data")

training a lr

from pyspark.ml.classification import *
df = spark.read.parquet("/tmp/test_data")
lr = LogisticRegression()
model = lr.fit(df)

before:
image

after:
image

In each iteration, the data sent to driver is reduced from 136.1 KiB to 21.3 KiB

Was this patch authored or co-authored using generative AI tooling?

no

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant