Spark 3.5.1 Connect error: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f

18 Aug 2024

1) Context

Recently, I tested Spark 3.5.1 Connect. Everything worked perfectly in local mode (using --master=local[*]).

Given the successful test, I decided to adopt Spark Connect for our production environment. However, after deploying it to a cluster, I encountered an unexpected error:

“SparkConnectGrpcException: (… omitted …) java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD”

This error occurred when I ran df.count(), despite data processing working without any issues beforehand.

I was puzzled as to why this problem arose in cluster mode.

2) Problem Investigation

After some research, including browsing Jira issues, I came across a helpful resource:

https://issues.apache.org/jira/browse/SPARK-46032

Even Hyukjin Kwon, one of the maintainers of Spark, couldn’t pinpoint the exact cause of the problem.

One potential explanation I found was that “it seems the Spark Connect jar is not being loaded into the executor JVM.”

3) Workaround

Prabodh suggested a solution using spark.addArtifacts():

spark.addArtifacts("/Users/<my_user_name>/.ivy2/jars/org.apache.spark_spark-connect_2.12-3.5.1.jar")

Unfortunately, this didn’t resolve my issue.

Since I’m running Spark on Kubernetes, I added the spark-connect_2.12-3.5.1.jar directly to the Spark Docker image.

After doing this, the problem was resolved.

You can download the jar file here: https://repo.maven.apache.org/maven2/org/apache/spark/spark-connect_2.12/3.5.1/