Using Python Beam pipelines with containers

This repo will show how to specify a custom container image using DataflowRunner and install packages in the pipeline that are baked into the container image.

Dataflow workers have many pre-installed packages for the Python SDK. To demonstrate using a package that does not come pre-installed, I used the tabulate package used for formatting output/pretty print.

The pipeline is very simple. It installs the package, reads lines from an input file, formats the input using tabulate, and outputs the result into a file. An example of the input.txt file is:

John Doe 001 98034
Alice Ryan 002 67678
Bob Riley 003 23450

An example of the formatted output is:

----  ---  ---  -----
John  Doe  001  98034
----  ---  ---  -----
-----  ----  ---  -----
Alice  Ryan  002  67678
-----  ----  ---  -----
---  -----  ---  -----
Bob  Riley  003  23450
---  -----  ---  -----

I've included a setup.py file to demonstrate the use of a multi-file pipeline. To execute this pipeline, run

python -m tabulate_example  --input $INPUT \
                            --output $OUTPUT \
                            --runner DataflowRunner \
                            --project $PROJECT_ID \
                            --region $REGION \
                            --staging_location $STAGING_LOCATION \
                            --temp_location $TEMP_LOCATION \
                            --experiment $EXPERIMENT \
                            --job_name $JOB_NAME \
                            --worker_harness_container_image $WORKER_HARNESS_CONTAINER_IMAGE \
                            --setup_file ./setup.py

Keep in mind that Dataflow requires Python version ['3.6', '3.7', '3.8'].

Implementation Notes

When running without a container, the job fails (as expected) with the following error:

After specifying the custom container in the PipelineOptions, the job runs successfully.

Docker Image

The Dockerfile for the image is below:

# Base Image
FROM apache/beam_python3.8_sdk:2.25.0
 
#Install Dependencies
RUN pip install tabulate

This image was pushed to Container Registry.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
img		img
my_module		my_module
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
input.txt		input.txt
setup.py		setup.py
tabulate_example.py		tabulate_example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Using Python Beam pipelines with containers

Implementation Notes

Docker Image

About

Uh oh!

Releases

Packages

Uh oh!

Languages

nehanene15/dataflow-containers

Folders and files

Latest commit

History

Repository files navigation

Using Python Beam pipelines with containers

Implementation Notes

Docker Image

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages