1) Context
Recently, I tested Spark 3.5.1 Connect. Everything worked perfectly in local mode (using --master=local[*]
).
Given the successful test, I decided to adopt Spark Connect for our production environment. However, after deploying it to a cluster, I encountered an unexpected error:
“SparkConnectGrpcException: (… omitted …) java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD”
This error occurred when I ran df.count()
, despite data processing working without any issues beforehand.
I was puzzled as to why this problem arose in cluster mode.
2) Problem Investigation
After some research, including browsing Jira issues, I came across a helpful resource:
https://issues.apache.org/jira/browse/SPARK-46032
Even Hyukjin Kwon, one of the maintainers of Spark, couldn’t pinpoint the exact cause of the problem.
One potential explanation I found was that “it seems the Spark Connect jar is not being loaded into the executor JVM.”
3) Workaround
Prabodh suggested a solution using spark.addArtifacts():
spark.addArtifacts("/Users/<my_user_name>/.ivy2/jars/org.apache.spark_spark-connect_2.12-3.5.1.jar")
Unfortunately, this didn’t resolve my issue.
Since I’m running Spark on Kubernetes, I added the spark-connect_2.12-3.5.1.jar
directly to the Spark Docker image.
After doing this, the problem was resolved.
You can download the jar file here: https://repo.maven.apache.org/maven2/org/apache/spark/spark-connect_2.12/3.5.1/