GoogleCloudPlatform
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎python/README_ArrayRecord_Converter.md‎
Lines changed: 158 additions & 0 deletions b/‎python/README_ArrayRecord_Converter.md‎
Lines changed: 158 additions & 0 deletions
diff --git a/‎python/src/main/java/com/google/cloud/teleport/templates/python/ArrayRecordConverter.java‎
Lines changed: 59 additions & 0 deletions b/‎python/src/main/java/com/google/cloud/teleport/templates/python/ArrayRecordConverter.java‎
Lines changed: 59 additions & 0 deletions
diff --git a/‎python/src/main/python/arrayrecord-converter/Dockerfile‎
Lines changed: 16 additions & 0 deletions b/‎python/src/main/python/arrayrecord-converter/Dockerfile‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎python/src/main/python/arrayrecord-converter/main.py‎
Lines changed: 141 additions & 0 deletions b/‎python/src/main/python/arrayrecord-converter/main.py‎
Lines changed: 141 additions & 0 deletions
@@ -123,6 +123,7 @@ follow [GitHub's branch renaming guide](https://docs.github.com/en/repositories/
     - [Bulk Decompress Files on Cloud Storage](https://github.com/search?q=repo%3AGoogleCloudPlatform%2FDataflowTemplates%20Bulk_Decompress_GCS_Files&type=code)
     - [Bulk Delete Entities in Firestore (Datastore mode)](https://github.com/search?q=repo%3AGoogleCloudPlatform%2FDataflowTemplates%20Firestore_to_Firestore_Delete&type=code)
     - [Convert file formats between Avro, Parquet & CSV](https://github.com/search?q=repo%3AGoogleCloudPlatform%2FDataflowTemplates%20File_Format_Conversion&type=code)
+    - [Convert text or images to ArrayRecord format](https://github.com/search?q=repo%3AGoogleCloudPlatform%2FDataflowTemplates%20ArrayRecord_Converter&type=code)
     - [Streaming Data Generator](https://github.com/search?q=repo%3AGoogleCloudPlatform%2FDataflowTemplates%20Streaming_Data_Generator&type=code)
 - Legacy Templates
     - [Bulk Delete Entities in Datastore [Deprecated]](https://github.com/search?q=repo%3AGoogleCloudPlatform%2FDataflowTemplates%20Datastore_to_Datastore_Delete&type=code)
 
@@ -0,0 +1,158 @@
+
+Array Record Converter (Python) template
+---
+Batch pipeline. Reads text or image files from Cloud Storage, and converts the files 1:1 into [array record](https://github.com/google/array_record) format.
+
+
+:memo: This is a Google-provided template! Please
+check [Provided templates documentation](https://cloud.google.com/dataflow/docs/guides/templates/provided-templates)
+on how to use it without having to build from sources using [Create job from template](https://console.cloud.google.com/dataflow/createjob?template=Word_Count_Python).
+
+:bulb: This is a generated documentation based
+on [Metadata Annotations](https://github.com/GoogleCloudPlatform/DataflowTemplates#metadata-annotations)
+. Do not change this file directly.
+
+## Parameters
+
+### Required Parameters
+
+* **input_path** (Input file(s) in Cloud Storage): The input file pattern Dataflow reads from. Use the example file (gs://dataflow-samples/shakespeare/kinglear.txt) or enter the path to your own using the same format: gs://your-bucket/your-file.txt.
+* **input_format** (Input file format): The format of the files intended for conversion. Currently supports `text` or `image`.
+* **output_path** (Output Cloud Storage file prefix): Path prefix for writing output files. Ex: gs://your-bucket/arrayrecord.
+
+### Optional Parameters
+
+
+
+
+## Getting Started
+
+### Requirements
+
+* Java 11
+* Maven
+* [gcloud CLI](https://cloud.google.com/sdk/gcloud), and execution of the
+  following commands:
+  * `gcloud auth login`
+  * `gcloud auth application-default login`
+
+:star2: Those dependencies are pre-installed if you use Google Cloud Shell!
+
+
+
+[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2FDataflowTemplates.git&cloudshell_open_in_editor=python/src/main/java/com/google/cloud/teleport/templates/python/WordCountPython.java)
+
+### Templates Plugin
+
+This README provides instructions using
+the [Templates Plugin](https://github.com/GoogleCloudPlatform/DataflowTemplates#templates-plugin)
+. Install the plugin with the following command before proceeding:
+
+```shell
+mvn clean install -pl plugins/templates-maven-plugin -am
+```
+
+### Building Template
+
+This template is a Flex Template, meaning that the pipeline code will be
+containerized and the container will be executed on Dataflow. Please
+check [Use Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates)
+and [Configure Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates)
+for more information.
+
+#### Staging the Template
+
+If the plan is to just stage the template (i.e., make it available to use) by
+the `gcloud` command or Dataflow "Create job from template" UI,
+the `-PtemplatesStage` profile should be used:
+
+```shell
+export PROJECT=<my-project>
+export BUCKET_NAME=<bucket-name>
+
+mvn clean package -PtemplatesStage  \
+-DskipTests \
+-DprojectId="$PROJECT" \
+-DbucketName="$BUCKET_NAME" \
+-DstagePrefix="templates" \
+-DtemplateName="ArrayRecord_Converter" \
+-pl python \
+-am
+```
+
+
+The command should build and save the template to Google Cloud, and then print
+the complete location on Cloud Storage:
+
+```
+Flex Template was staged! gs://<bucket-name>/templates/flex/ArrayRecord_Python
+```
+
+The specific path should be copied as it will be used in the following steps.
+
+#### Running the Template
+
+**Using the staged template**:
+
+You can use the path above run the template (or share with others for execution).
+
+To start a job with the template at any time using `gcloud`, you are going to
+need valid resources for the required parameters.
+
+Provided that, the following command line can be used:
+
+```shell
+export PROJECT=<my-project>
+export BUCKET_NAME=<bucket-name>
+export REGION=us-central1
+export TEMPLATE_SPEC_GCSPATH="gs://$BUCKET_NAME/templates/flex/ArrayRecord_Python"
+
+### Required
+export INPUT_PATH=<input>
+export INPUT_FORMAT=<text|image>
+export OUTPUT_PATH=<output>
+
+### Optional
+
+gcloud dataflow flex-template run "array-record-converter-job" \
+  --project "$PROJECT" \
+  --region "$REGION" \
+  --template-file-gcs-location "$TEMPLATE_SPEC_GCSPATH" \
+  --parameters "input_path=$INPUT_PATH" \
+  --parameters "input_format=$INPUT_FORMAT" \
+  --parameters "output_path=$OUTPUT_PATH"
+```
+
+For more information about the command, please check:
+https://cloud.google.com/sdk/gcloud/reference/dataflow/flex-template/run
+
+
+**Using the plugin**:
+
+Instead of just generating the template in the folder, it is possible to stage
+and run the template in a single command. This may be useful for testing when
+changing the templates.
+
+```shell
+export PROJECT=<my-project>
+export BUCKET_NAME=<bucket-name>
+export REGION=us-central1
+
+### Required
+export INPUT_PATH=<input>
+export INPUT_FORMAT=<text|image>
+export OUTPUT_PATH=<output>
+
+### Optional
+
+mvn clean package -PtemplatesRun \
+-DskipTests \
+-DprojectId="$PROJECT" \
+-DbucketName="$BUCKET_NAME" \
+-Dregion="$REGION" \
+-DjobName="array-record-converter-job" \
+-DtemplateName="ArrayRecord_Python" \
+-Dparameters="input_path=$INPUT_PATH,input_format=$INPUT_FORMAT,output_path=$OUTPUT_PATH" \
+-pl python \
+-am
+```
@@ -0,0 +1,59 @@
+/*
+ * Copyright (C) 2025 Google LLC
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not
+ * use this file except in compliance with the License. You may obtain a copy of
+ * the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+package com.google.cloud.teleport.templates.python;
+
+import com.google.cloud.teleport.metadata.Template;
+import com.google.cloud.teleport.metadata.TemplateCategory;
+import com.google.cloud.teleport.metadata.TemplateParameter;
+
+/** template class for ArrayRecordConverter in Python. */
+@Template(
+    name = "ArrayRecord_Converter",
+    category = TemplateCategory.UTILITIES,
+    type = Template.TemplateType.PYTHON,
+    displayName = "ArrayRecord Converter Job",
+    description =
+        "The ArrayRecord_Converter Template is used to convert bulk text/image dataset in GCS into ArrayRecord Datasets in GCS"
+            + "An input GCS path can be passed in and resulting data will be uploaded to the `output_path`",
+    flexContainerName = "arrayrecord-converter",
+    contactInformation = "https://cloud.google.com/support"
+    )
+public interface ArrayRecordConverter {
+  @TemplateParameter.GcsReadFile(
+      order = 1,
+      name = "input_path",
+      optional = false,
+      description = "Input GCS path to match all files(e.g., gcs://example/*.txt)",
+      helpText = "An input path in the form of a GCS path.")
+  String getInputPath();
+
+  @TemplateParameter.Text(
+      order = 2,
+      name = "input_format",
+      optional = false,
+      description = "Input format of the data, can be either text or image",
+      helpText = "The format of the input, which can be either text or image. This job does not support other values.")
+  String getInputFormat();
+
+  @TemplateParameter.GcsWriteFolder(
+      order = 3,
+      name = "output_path",
+      optional = false,
+      description = "Output GCS path where files are generated",
+      helpText =
+          "Output path in the form of a GCS path where files will be uploaded.")
+  String getOutputPath();
+}
@@ -0,0 +1,16 @@
+FROM gcr.io/dataflow-templates-base/python311-template-launcher-base
+
+ARG WORKDIR=/template
+RUN mkdir -p ${WORKDIR}
+COPY main.py /template
+COPY requirements.txt /template
+WORKDIR ${WORKDIR}
+
+ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE=requirements.txt
+ENV FLEX_TEMPLATE_PYTHON_PY_FILE=main.py
+
+# Install dependencies to launch the pipeline
+RUN pip install -U --require-hashes --no-deps -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
+RUN pip download --no-cache-dir --require-hashes --no-deps --dest /tmp/dataflow-requirements-cache -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
+
+ENTRYPOINT ["/opt/google/dataflow/python_template_launcher"]
@@ -0,0 +1,141 @@
+#
+# Copyright (C) 2025 Google Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"); you may not
+# use this file except in compliance with the License. You may obtain a copy of
+# the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+# License for the specific language governing permissions and limitations under
+# the License.
+#
+"""A template workflow to convert raw objects(text, images) into ArrayRecord files."""
+
+import argparse
+import logging
+import os
+import urllib
+
+import apache_beam as beam
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.options.pipeline_options import SetupOptions
+from array_record.python.array_record_module import ArrayRecordWriter
+from google.cloud import storage
+
+
+class ConvertToArrayRecordGCS(beam.DoFn):
+  """Write a tuple consisting of a filename and records to GCS ArrayRecords."""
+
+  _WRITE_DIR = '/tmp/'
+
+  def process(
+      self,
+      element,
+      path,
+      write_dir=_WRITE_DIR,
+      file_path_suffix='.arrayrecord',
+      overwrite_extension=False,
+  ):
+
+    ## Upload to GCS
+    def upload_to_gcs(
+        bucket_name, filename, prefix='', source_dir=self._WRITE_DIR
+    ):
+      source_filename = os.path.join(source_dir, filename)
+      blob_name = os.path.join(prefix, filename)
+      storage_client = storage.Client()
+      bucket = storage_client.get_bucket(bucket_name)
+      blob = bucket.blob(blob_name)
+      blob.upload_from_filename(source_filename)
+
+    ## Simple logic for stripping a file extension and replacing it
+    def fix_filename(filename):
+      base_name = os.path.splitext(filename)[0]
+      new_filename = base_name + file_path_suffix
+      return new_filename
+
+    parsed_gcs_path = urllib.parse.urlparse(path)
+    bucket_name = parsed_gcs_path.hostname
+    gcs_prefix = parsed_gcs_path.path.lstrip('/')
+
+    if overwrite_extension:
+      filename = fix_filename(os.path.basename(element[0]))
+    else:
+      filename = '{}{}'.format(os.path.basename(element[0]), file_path_suffix)
+
+    write_path = os.path.join(write_dir, filename)
+    writer = ArrayRecordWriter(write_path, 'group_size:1')
+
+    for item in element[1]:
+      writer.write(bytes(item, 'utf-8'))
+
+    writer.close()
+
+    upload_to_gcs(bucket_name, filename, prefix=gcs_prefix)
+    os.remove(os.path.join(write_dir, filename))
+
+
+def run(argv=None, save_main_session=True):
+  """Main entry point; defines and runs the wordcount pipeline."""
+  parser = argparse.ArgumentParser()
+  parser.add_argument(
+      '--input_path',
+      dest='input_path',
+      default=(
+          'gs://converter-datasets/input-datasets/google-top-terms/csv/1k/*.csv'
+      ),
+      help='Input file to process.',
+  )
+  parser.add_argument(
+      '--input_format',
+      dest='input_format',
+      default='text',
+      help='Input file format.',
+  )
+  parser.add_argument(
+      '--output_path',
+      dest='output_path',
+      required=True,
+      help='Output destination to write results to.',
+  )
+  known_args, pipeline_args = parser.parse_known_args(argv)
+
+  # We use the save_main_session option because one or more DoFn's in this
+  # workflow rely on global context (e.g., a module imported at module level).
+  pipeline_options = PipelineOptions(pipeline_args)
+  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
+
+  # The pipeline will be run on exiting the with block.
+  with beam.Pipeline(options=pipeline_options) as p:
+    # TODO(iamphani): Move this out to array_record/beam/pipelines.py once it is
+    # updated to support raw text.
+
+    files =  p | "Start" >> beam.Create([known_args.input_path])
+    if(known_args.input_format == 'text'):
+      parsed_files = files | "Read Text Files" >> beam.io.ReadAllFromText(
+          with_filename=True
+      )
+    elif(known_args.input_format == 'image'):
+      parsed_files = files | "Read Image Files" >> beam.io.ReadAllFromBinaryFiles(with_filename=True)
+    else:
+      raise ValueError(f"Unsupported input format: {known_args.input_format}")
+    _ = (
+        parsed_files
+        | "Group" >> beam.GroupByKey()
+        | "Write to ArrayRecord in GCS"
+        >> beam.ParDo(
+            ConvertToArrayRecordGCS(),
+            known_args.output_path,
+            file_path_suffix=".arrayrecord",
+            overwrite_extension=False,
+        )
+    )
+
+
+if __name__ == '__main__':
+  logging.getLogger().setLevel(logging.INFO)
+  run()