-
Notifications
You must be signed in to change notification settings - Fork 106
feat(datasets): SparkDataset Rewrite #1185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Digging in a bit, this feels less like a rewrite and more like a refactoring. Here are my initial thoughts:
- I've added a comment re my concern removing the
_dbfs_glob
logic. This needs to be validated carefully (perhaps Databricks improved performance of regular glob?) so we don't reintroduce a performance issue. I remember debugging this on a client project, because IIRC (it's been years) performance degrades to the point of unusability with a large number of versions. - Will this provide the best experience with
spark-connect
anddatabricks-connect
? (FWIWdatabricks-connect
is a bit annoying to look into since it's not open source.) Spark 3.4 introduced Spark Connect, and Spark 4 includes major refactors to really make it part of the core (e.g.pyspark.sql.classic
is moved to the same level aspyspark.sql.connect
, and they inherit from the same baseDataFrame
and all—wasn't the case before). IMO Spark Connect looks like the future of Spark, and aSparkDataset
refresh should work seamlessly with it. Spark Connect (and Databricks Connect) are also potentially great for users who struggle with the deployment experience (e.g. need to get code onto Databricks from local). That said, the classic experience is still likely a very common way for teams who are working more from within Databricks to operate. - I like the fact that HDFS is supported through PyArrow now. If there's still concern that people may need the old, separate HDFS client (not sure there is?
hdfs
hasn't had a release in two years and doesn't support Python 3.13 for example), maybe that could be handled through some sort of fallback logic?
logger = logging.getLogger(__name__) | ||
|
||
|
||
class SparkDatasetV2(AbstractVersionedDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @SajidAlamQB , In general the code looks good to me. I am not sure about the history of changes made as mentioned by @deepyaman . I am testing this dataset and will update here if there are any issues.
Regarding the rewrite vs refactor, as discussed I think it is a good idea to have a separate version to avoid breaking changes. At the same time keeping in mind if we want to continue maintain both versions in future ?
May be an input from @noklam on these changes would help understand better. Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the rewrite vs refactor, as discussed I think it is a good idea to have a separate version to avoid breaking changes.
@ravi-kumar-pilla Yeah, sorry if I wasn't clear. If there needs to be breaking changes, absolutely agree it should be a separate version; I just didn't see so much potential breaking changes (removing HDFS dependency doesn't have to be breaking). Maybe I treated the fsspec
changes too lightly when I looked through the PR; I got the impression it was mostly bugfixing to map the correct protocols, in addition to removing the custom logic to get protocol and such, but didn't understand it to be an overall behavior change.
Thanks @deepyaman you're right about the DBFS glob issue that is a good catch we'll add that back in. Regarding refactor vs rewrite, we chose V2 for safety, but I'm open to discussing whether we should refactor original instead if you think that's better. |
Yeah, if course. I think can get the V2 "ready", and then see if it's sufficiently different that it needs to be breaking/a separate dataset. |
Signed-off-by: Sajid Alam <[email protected]>
Signed-off-by: Sajid Alam <[email protected]>
@noklam would also appreciate your thoughts on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't have time to review this in details but I don't want to block this. A quick question top of my head:
- When should I use
databricks
specific dataset and spark.SparkDataset on Databricks? I recalled there are already something that is only possible with the databricks one. If we are re-writing this I think we should have a look at this. - dbfs is a bit annoying - Databricks already deprecated it, new cluster are default to UC's volume but still a lot of people are using dbfs in older cluster.
- Is there a goal/additional things that this rewrite improve? Or is it more like refactoring?
Hey @noklam thanks, I think the Databricks datasets are more for TABLE operations while the SparkDataset is for FILE operations. The new V2 handles both DBFS and UC Volumes properly, they still supports /dbfs/, dbfs:/, and /Volumes/ paths and we only do the DBFS optimisations only when needed. This goes a bit beyond a refactor I think, we solving some long standing issues such as the Databricks users can now actually use it, we add Spark Connect for Spark 4.0 and now the users can choose their deps instead of installing everything via pyproject.toml changes. It makes the dataset more usable. |
Description
This PR introduces
SparkDatasetV2
an alternative cleaner version ofSparkDataset
.The
SparkDataset
is currently frustrating to work with for several reasons that have been outlined in #135.split_filepath
vs. the normalget_protocol_and_path
), leading to duplication and inconsistencies (e.g., S3 paths handle "s3a://" differently).Dependency Issues:
spark-sparkdataset = ["kedro-datasets[spark-base,hdfs-base,s3fs-base]"]
forces all three, butHDFS
is rarely used nowadays. Databricks datasets rely on SparkDataset's parsing utils, creating circular deps.Testing and Bugs:
Development notes
This PR introduces
SparkDatasetV2
to:Dependency Restructuring
spark-core
with zero dependenciesspark-local
,spark-databricks
,spark-emr
)spark-s3
,spark-gcs
,spark-azure
)Code Improvements for
SparkDataset
TYPE_CHECKING
for lazy PySpark importsget_protocol_and_path
Now:
kedro-datasets[spark]
will need to choose specific bundlesspark-hdfs
)Checklist
jsonschema/kedro-catalog-X.XX.json
if necessaryRELEASE.md
file