Skip to content

Conversation

SajidAlamQB
Copy link
Contributor

@SajidAlamQB SajidAlamQB commented Sep 30, 2025

Description

Related to: #1170

Fixes CI e2e test failure for kedro-docker with Spark support. The test "Execute docker build and run using spark Dockerfile" was failing when PySpark attempted to convert DataFrames to Pandas using Apache Arrow. The error occurred because Arrow couldn't access sun.misc.Unsafe due to JVM restrictions.

Development notes

Updated kedro_docker/template/Dockerfile.spark to add JVM configuration environment variables
Added --add-opens flags to allow Apache Arrow access:

java.base/java.nio - for buffer operations
java.base/sun.nio.ch - for channel implementations
java.base/jdk.internal.misc - for access to sun.misc.Unsafe

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
@SajidAlamQB SajidAlamQB changed the title test docker ci fix: Kedro-docker CI E2E Failures Oct 2, 2025
@SajidAlamQB SajidAlamQB self-assigned this Oct 2, 2025
Signed-off-by: Sajid Alam <[email protected]>
@SajidAlamQB SajidAlamQB marked this pull request as ready for review October 2, 2025 14:41
Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing this long-running issue!

Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I actually have a question. 😅


# set Spark configuration to handle Arrow properly
ENV SPARK_SUBMIT_OPTS="-Dio.netty.tryReflectionSetAccessible=true"
ENV ARROW_PRE_0_15_IPC_FORMAT=1
Copy link
Member

@deepyaman deepyaman Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ridiculously old--like 6 years ago. I can't imagine people should be on pre-0.15 Arrow in reality. Should this need to be done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking more into it, it should not be required when working with Spark 3. What version of Spark are the e2e tests using?

Copy link
Contributor Author

@SajidAlamQB SajidAlamQB Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that ARROW_PRE_0_15_IPC_FORMAT=1 is outdated but after testing without it the same sun.misc.Unsafe error to returns.

Is the flag effecting more than just IPC format compatibility?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the flag effecting more than just IPC format compatibility?

Diving into the PySpark source, I would expect it throw an error if you're using Spark 3: https://github.com/apache/spark/blob/ee8d66105885929ac0c0c087843d70bf32de31a1/python/pyspark/sql/pandas/utils.py#L58-L60

I wouldn't be surprised if this is still necessary for Spark 2; Spark 3 really pushed to be more Arrow-native.

I think it's worth confirming:

  1. Is Spark 2 used in the tests? I couldn't tell easily just from scanning the code and looking at the CI logs.
  2. If so, why is Spark 2 being used? As far as I can tell, the starter should get you set up including Spark 3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh this is confusing, I will keep digging into this a bit more.

Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
@SajidAlamQB SajidAlamQB marked this pull request as draft October 3, 2025 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants