-
Notifications
You must be signed in to change notification settings - Fork 106
fix: Kedro-docker CI E2E Failures #1201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing this long-running issue!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I actually have a question. 😅
|
||
# set Spark configuration to handle Arrow properly | ||
ENV SPARK_SUBMIT_OPTS="-Dio.netty.tryReflectionSetAccessible=true" | ||
ENV ARROW_PRE_0_15_IPC_FORMAT=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ridiculously old--like 6 years ago. I can't imagine people should be on pre-0.15 Arrow in reality. Should this need to be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking more into it, it should not be required when working with Spark 3. What version of Spark are the e2e tests using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that ARROW_PRE_0_15_IPC_FORMAT=1
is outdated but after testing without it the same sun.misc.Unsafe
error to returns.
Is the flag effecting more than just IPC format compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the flag effecting more than just IPC format compatibility?
Diving into the PySpark source, I would expect it throw an error if you're using Spark 3: https://github.com/apache/spark/blob/ee8d66105885929ac0c0c087843d70bf32de31a1/python/pyspark/sql/pandas/utils.py#L58-L60
I wouldn't be surprised if this is still necessary for Spark 2; Spark 3 really pushed to be more Arrow-native.
I think it's worth confirming:
- Is Spark 2 used in the tests? I couldn't tell easily just from scanning the code and looking at the CI logs.
- If so, why is Spark 2 being used? As far as I can tell, the starter should get you set up including Spark 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh this is confusing, I will keep digging into this a bit more.
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Description
Related to: #1170
Fixes CI e2e test failure for kedro-docker with Spark support. The test "Execute docker build and run using spark Dockerfile" was failing when
PySpark
attempted to convertDataFrames
toPandas
using Apache Arrow. The error occurred because Arrow couldn't accesssun.misc.Unsafe
due to JVM restrictions.Development notes
Updated
kedro_docker/template/Dockerfile.spark
to add JVM configuration environment variablesAdded
--add-opens
flags to allow Apache Arrow access:java.base/java.nio
- for buffer operationsjava.base/sun.nio.ch
- for channel implementationsjava.base/jdk.internal.misc
- for access tosun.misc.Unsafe
Checklist
jsonschema/kedro-catalog-X.XX.json
if necessaryRELEASE.md
file