Adding python flex template to convert raw text/image objects to ArrayRecord #2626

PAenugula · 2025-08-07T18:11:42Z

Adding python flex template to convert raw text/image GCS objects in bulk to ArrayRecord format. This allows ML researchers to take advantage of Grain pipelines.

Allows passing in input_path, input_format, and output_path.

To build and test:
Build the flex template image and job(for example):

gcloud builds submit . --tag us-central1-docker.pkg.dev/iamphani-gcp-grain/array-record-converter-base-image/my_base_image:latest --project iamphani-gcp-grain

gcloud dataflow flex-template build gs://flex-templating/main.json  --image us-central1-docker.pkg.dev/iamphani-gcp-grain/array-record-converter-base-image/my_base_image:latest --sdk-language "PYTHON" --metadata-file=metadata.json --project iamphani-gcp-grain

Deploy a new dataflow job(for example)

gcloud dataflow flex-template run "array-record-converter-job" --template-file-gcs-location "gs://flex-templating/main.json" --region "us-central1" --staging-location "gs://flex-templating/staging"  --parameters input_path="gs://converter-datasets/input-datasets/google-top-terms/csv/1k/google_trends.csv" --parameters input_format="text" --parameters output_path="gs://converter-datasets/output-final/" --project iamphani-gcp-grain

liferoad · 2025-08-21T17:49:20Z

python/src/main/java/com/google/cloud/teleport/templates/python/ArrayRecordConverter.java

+    type = Template.TemplateType.PYTHON,
+    displayName = "ArrayRecord Converter Job",
+    description =
+        "The ArrayRecord_Converter Template is used to convert bulk text/image dataset in GCS into ArrayRecord Datasets in GCS"


we probably can make this template hidden for now or change the description with something like use as it is and no support.

I think the "use as is" and "no support" is fine. However, would it still require users to locally pull this repository to deploy their own flex template and run?

I don't think we can publicly publish a new template and not support it. Per https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/maintainers-guide.md#introducing-new-templates we generally are not accepting new templates, and I don't think we should take this as a result. I'd recommend adding this as a sample (maybe in https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates) so that customers can take it and build it themselves there if needed.

…yRecord in GCS

liferoad · 2025-08-21T21:32:52Z

python/src/main/python/arrayrecord-converter/main.py

+    def upload_to_gcs(
+        bucket_name, filename, prefix='', source_dir=self._WRITE_DIR
+    ):
+      source_filename = os.path.join(source_dir, filename)


this will be expensive since you do this for every element. Better use the existing gcs IO.

codecov · 2025-08-21T21:58:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.92%. Comparing base (1503db3) to head (ed3d094).
⚠️ Report is 24 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2626      +/-   ##
============================================
+ Coverage     49.82%   49.92%   +0.09%     
+ Complexity     5245     4913     -332     
============================================
  Files           954      954              
  Lines         58411    58476      +65     
  Branches       6320     6333      +13     
============================================
+ Hits          29106    29192      +86     
+ Misses        27222    27201      -21     
  Partials       2083     2083

Components	Coverage Δ
spanner-templates	`70.03% <ø> (+<0.01%)`	⬆️
spanner-import-export	`68.64% <ø> (+0.03%)`	⬆️
spanner-live-forward-migration	`79.09% <ø> (-0.03%)`	⬇️
spanner-live-reverse-replication	`76.54% <ø> (ø)`
spanner-bulk-migration	`88.09% <ø> (ø)`
see 16 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

pull-request-size bot added the size/L label Aug 7, 2025

PAenugula force-pushed the main branch from cc7e105 to a103ef9 Compare August 21, 2025 16:04

pull-request-size bot added size/XXL and removed size/L labels Aug 21, 2025

PAenugula marked this pull request as ready for review August 21, 2025 16:07

liferoad requested review from Abacn and damccorm August 21, 2025 17:01

liferoad reviewed Aug 21, 2025

View reviewed changes

Adding python flex template to convert raw text/image objects to Arra…

ed3d094

…yRecord in GCS

PAenugula force-pushed the main branch from feb7a8d to ed3d094 Compare August 21, 2025 20:19

liferoad reviewed Aug 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding python flex template to convert raw text/image objects to ArrayRecord #2626

Adding python flex template to convert raw text/image objects to ArrayRecord #2626

Uh oh!

PAenugula commented Aug 7, 2025 •

edited

Loading

Uh oh!

liferoad Aug 21, 2025

Uh oh!

PAenugula Aug 21, 2025

Uh oh!

damccorm Aug 21, 2025

Uh oh!

liferoad Aug 21, 2025

Uh oh!

codecov bot commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adding python flex template to convert raw text/image objects to ArrayRecord #2626

Are you sure you want to change the base?

Adding python flex template to convert raw text/image objects to ArrayRecord #2626

Uh oh!

Conversation

PAenugula commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liferoad Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

PAenugula Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

damccorm Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

liferoad Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Aug 21, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PAenugula commented Aug 7, 2025 •

edited

Loading