-
Notifications
You must be signed in to change notification settings - Fork 104
feat(datasets): Enrich databricks connect error message #1039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(datasets): Enrich databricks connect error message #1039
Conversation
e990a37
to
480ac00
Compare
27ff177
to
15d06fd
Compare
Signed-off-by: star-yar <[email protected]>
Signed-off-by: star-yar <[email protected]>
Signed-off-by: star-yar <[email protected]>
67a5016
to
da3b0b2
Compare
Signed-off-by: star-yar <[email protected]>
da3b0b2
to
b962af8
Compare
Signed-off-by: star-yar <[email protected]>
Any ideas why this might fail? Not sure this is caused by my PR, wdyt? |
No this isn't related to your PR. This failure started showing up on our |
@star-yar, the test failures are resolved, but now there's a coverage issue which is related to the changes you made. Can you add tests to cover the behaviour? |
Before implementing such a change, I wanted to share some additional learning. We should not invoke Databricks Connect here. I think the newer Databricks Connect is intended for establishing the connection once and then retrieving the created session via So, in the context of a kedro project, the user should create a hook that'll establish the connection via databricks connect. Then, the catalog code should always get spark session, not make it (meaning we need to remove databricks connect invocation completely), relying on the user creating the connection first. So maybe we should mention this approach somewhere in docs. And rely on the fact that spark session is created not handling the creation case. This will impact the error message we output. workflow now; no spark session pre-init flowchart LR
n1["kedro session creates"]
n2["dataset invoked"]
n3["dataset creates session
through databricks-connect/pyspark"]
n1 --> n2 --> n3
suggested workflow; no spark session pre-init flowchart LR
n1["kedro session creates"]
n2["dataset invoked"]
n3["dataset gets session
through pyspark*"]
n1 --> n2 --> n3
* if databricks-connect is installed, it'll complain that you're trying to create a session through pyspark - we handle this by raising the error and notifying user that he first needs to create a hook creating session. |
Signed-off-by: Ankita Katiyar <[email protected]>
Any reflections @merelcht ? |
Hi @star-yar , I'm really sorry for the late response. You raise a very good point. In fact, this is what we recommend when using spark (without databricks) in the docs: https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#initialise-a-sparksession-using-a-hook We've had an open issue to rewrite our spark datasets for ever (#135), but never got round to it so the Spark related datasets have just evolved through contributions but not with a proper architecture in mind. We're currently working on a major release, but afterwards we can put this back on our priority list. You're also more than welcome to make a contribution if you have time. |
Description
Resolves #1038
Development notes
Catches error raised by databricks-connect and reraises with suggestions on resolution.
Checklist
jsonschema/kedro-catalog-X.XX.json
if necessaryRELEASE.md
fileget_spark
so I'm marking this as done)