Skip to content

Conversation

@PAenugula
Copy link

@PAenugula PAenugula commented Aug 7, 2025

Adding python flex template to convert raw text/image GCS objects in bulk to ArrayRecord format. This allows ML researchers to take advantage of Grain pipelines.

  • Allows passing in input_path, input_format, and output_path.

To build and test:
Build the flex template image and job(for example):

gcloud builds submit . --tag us-central1-docker.pkg.dev/iamphani-gcp-grain/array-record-converter-base-image/my_base_image:latest --project iamphani-gcp-grain

gcloud dataflow flex-template build gs://flex-templating/main.json  --image us-central1-docker.pkg.dev/iamphani-gcp-grain/array-record-converter-base-image/my_base_image:latest --sdk-language "PYTHON" --metadata-file=metadata.json --project iamphani-gcp-grain

Deploy a new dataflow job(for example)

gcloud dataflow flex-template run "array-record-converter-job" --template-file-gcs-location "gs://flex-templating/main.json" --region "us-central1" --staging-location "gs://flex-templating/staging"  --parameters input_path="gs://converter-datasets/input-datasets/google-top-terms/csv/1k/google_trends.csv" --parameters input_format="text" --parameters output_path="gs://converter-datasets/output-final/" --project iamphani-gcp-grain

@PAenugula PAenugula marked this pull request as ready for review August 21, 2025 16:07
@liferoad liferoad requested review from Abacn and damccorm August 21, 2025 17:01
type = Template.TemplateType.PYTHON,
displayName = "ArrayRecord Converter Job",
description =
"The ArrayRecord_Converter Template is used to convert bulk text/image dataset in GCS into ArrayRecord Datasets in GCS"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably can make this template hidden for now or change the description with something like use as it is and no support.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "use as is" and "no support" is fine. However, would it still require users to locally pull this repository to deploy their own flex template and run?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can publicly publish a new template and not support it. Per https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/maintainers-guide.md#introducing-new-templates we generally are not accepting new templates, and I don't think we should take this as a result. I'd recommend adding this as a sample (maybe in https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates) so that customers can take it and build it themselves there if needed.

def upload_to_gcs(
bucket_name, filename, prefix='', source_dir=self._WRITE_DIR
):
source_filename = os.path.join(source_dir, filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be expensive since you do this for every element. Better use the existing gcs IO.

@codecov
Copy link

codecov bot commented Aug 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.92%. Comparing base (1503db3) to head (ed3d094).
⚠️ Report is 24 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2626      +/-   ##
============================================
+ Coverage     49.82%   49.92%   +0.09%     
+ Complexity     5245     4913     -332     
============================================
  Files           954      954              
  Lines         58411    58476      +65     
  Branches       6320     6333      +13     
============================================
+ Hits          29106    29192      +86     
+ Misses        27222    27201      -21     
  Partials       2083     2083              
Components Coverage Δ
spanner-templates 70.03% <ø> (+<0.01%) ⬆️
spanner-import-export 68.64% <ø> (+0.03%) ⬆️
spanner-live-forward-migration 79.09% <ø> (-0.03%) ⬇️
spanner-live-reverse-replication 76.54% <ø> (ø)
spanner-bulk-migration 88.09% <ø> (ø)
see 16 files with indirect coverage changes
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants