Skip to content

feat(datasets): Enrich databricks connect error message #1039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

star-yar
Copy link

@star-yar star-yar commented Mar 14, 2025

Description

Resolves #1038

Development notes

Catches error raised by databricks-connect and reraises with suggestions on resolution.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes (no tests exist for get_spark so I'm marking this as done)
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

@star-yar star-yar force-pushed the feature/extend-the-dbx-connect-error-message branch from e990a37 to 480ac00 Compare March 14, 2025 20:04
@star-yar star-yar changed the title feat:(datasets): Enrich databricks connect error message feat(datasets): Enrich databricks connect error message Mar 14, 2025
@star-yar star-yar force-pushed the feature/extend-the-dbx-connect-error-message branch 4 times, most recently from 27ff177 to 15d06fd Compare March 14, 2025 22:03
Signed-off-by: star-yar <[email protected]>
Signed-off-by: star-yar <[email protected]>
Signed-off-by: star-yar <[email protected]>
@star-yar star-yar force-pushed the feature/extend-the-dbx-connect-error-message branch from 67a5016 to da3b0b2 Compare March 14, 2025 22:12
@star-yar star-yar marked this pull request as draft March 14, 2025 22:13
Signed-off-by: star-yar <[email protected]>
@star-yar star-yar force-pushed the feature/extend-the-dbx-connect-error-message branch from da3b0b2 to b962af8 Compare March 14, 2025 22:15
Signed-off-by: star-yar <[email protected]>
@star-yar
Copy link
Author

Any ideas why this might fail? Not sure this is caused by my PR, wdyt?

@merelcht

@merelcht
Copy link
Member

Any ideas why this might fail? Not sure this is caused by my PR, wdyt?

@merelcht

No this isn't related to your PR. This failure started showing up on our main branch too. We'll have a look at fixing it.

@star-yar star-yar marked this pull request as ready for review March 17, 2025 13:36
@merelcht
Copy link
Member

@star-yar, the test failures are resolved, but now there's a coverage issue which is related to the changes you made. Can you add tests to cover the behaviour?

@star-yar
Copy link
Author

star-yar commented Mar 28, 2025

Before implementing such a change, I wanted to share some additional learning.

We should not invoke Databricks Connect here. I think the newer Databricks Connect is intended for establishing the connection once and then retrieving the created session via pyspark.sql.SparkSesssion.

So, in the context of a kedro project, the user should create a hook that'll establish the connection via databricks connect. Then, the catalog code should always get spark session, not make it (meaning we need to remove databricks connect invocation completely), relying on the user creating the connection first.

So maybe we should mention this approach somewhere in docs. And rely on the fact that spark session is created not handling the creation case. This will impact the error message we output.

workflow now; no spark session pre-init

flowchart LR
    n1["kedro session creates"]
    n2["dataset invoked"]
    n3["dataset creates session
 through databricks-connect/pyspark"]
    n1 --> n2 --> n3
Loading

suggested workflow; no spark session pre-init

flowchart LR
    n1["kedro session creates"]
    n2["dataset invoked"]
    n3["dataset gets session
 through pyspark*"]
    n1 --> n2 --> n3
Loading

* if databricks-connect is installed, it'll complain that you're trying to create a session through pyspark - we handle this by raising the error and notifying user that he first needs to create a hook creating session.
@merelcht

@star-yar
Copy link
Author

Any reflections @merelcht ?

@merelcht
Copy link
Member

Hi @star-yar , I'm really sorry for the late response.

You raise a very good point. In fact, this is what we recommend when using spark (without databricks) in the docs: https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#initialise-a-sparksession-using-a-hook

We've had an open issue to rewrite our spark datasets for ever (#135), but never got round to it so the Spark related datasets have just evolved through contributions but not with a proper architecture in mind.

We're currently working on a major release, but afterwards we can put this back on our priority list. You're also more than welcome to make a contribution if you have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SparkDataset failing with databricks-connect serveless cluster
3 participants