-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Adding python flex template to convert raw text/image objects to ArrayRecord #2626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| type = Template.TemplateType.PYTHON, | ||
| displayName = "ArrayRecord Converter Job", | ||
| description = | ||
| "The ArrayRecord_Converter Template is used to convert bulk text/image dataset in GCS into ArrayRecord Datasets in GCS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably can make this template hidden for now or change the description with something like use as it is and no support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "use as is" and "no support" is fine. However, would it still require users to locally pull this repository to deploy their own flex template and run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can publicly publish a new template and not support it. Per https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/maintainers-guide.md#introducing-new-templates we generally are not accepting new templates, and I don't think we should take this as a result. I'd recommend adding this as a sample (maybe in https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates) so that customers can take it and build it themselves there if needed.
| def upload_to_gcs( | ||
| bucket_name, filename, prefix='', source_dir=self._WRITE_DIR | ||
| ): | ||
| source_filename = os.path.join(source_dir, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be expensive since you do this for every element. Better use the existing gcs IO.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2626 +/- ##
============================================
+ Coverage 49.82% 49.92% +0.09%
+ Complexity 5245 4913 -332
============================================
Files 954 954
Lines 58411 58476 +65
Branches 6320 6333 +13
============================================
+ Hits 29106 29192 +86
+ Misses 27222 27201 -21
Partials 2083 2083
🚀 New features to boost your workflow:
|
Adding python flex template to convert raw text/image GCS objects in bulk to ArrayRecord format. This allows ML researchers to take advantage of Grain pipelines.
input_path,input_format, andoutput_path.To build and test:
Build the flex template image and job(for example):
Deploy a new dataflow job(for example)