| name | python-development |
| description | Guides Python SDK development in Apache Beam, including environment setup, testing, building, and running pipelines. Use when working with Python code in sdks/python/. |
Python Development in Apache Beam
Project Structure
Key Directories
sdks/python/ - Python SDK root
apache_beam/ - Main Beam package
transforms/ - Core transforms (ParDo, GroupByKey, etc.)
io/ - I/O connectors
ml/ - Beam ML code (RunInference, etc.)
runners/ - Runner implementations and wrappers
runners/worker/ - SDK worker harness
container/ - Docker container configuration
test-suites/ - Test configurations
scripts/ - Utility scripts
Configuration Files
setup.py - Package configuration
pyproject.toml - Build configuration
tox.ini - Test automation
pytest.ini - Pytest configuration
ruff.toml - Linting rules
.isort.cfg - Import sorting
pyrefly.toml - Type checking
Environment Setup
Using pyenv (Recommended)
pyenv install 3.X
pyenv virtualenv 3.X beam-dev
pyenv activate beam-dev
Install in Editable Mode
cd sdks/python
pip install -e .[gcp,test]
Enable Pre-commit Hooks
pip install pre-commit
pre-commit install
pre-commit uninstall
Running Tests
Unit Tests (filename: *_test.py)
pytest -v apache_beam/io/textio_test.py
pytest -v apache_beam/io/textio_test.py::TextSourceTest
pytest -v apache_beam/io/textio_test.py::TextSourceTest::test_progress
Integration Tests (filename: *_it_test.py)
On Direct Runner
python -m pytest -o log_cli=True -o log_level=Info \
apache_beam/ml/inference/pytorch_inference_it_test.py::PyTorchInference \
--test-pipeline-options='--runner=TestDirectRunner'
On Dataflow Runner
pip install build && python -m build --sdist
python -m pytest -o log_cli=True -o log_level=Info \
apache_beam/ml/inference/pytorch_inference_it_test.py::PyTorchInference \
--test-pipeline-options='--runner=TestDataflowRunner --project=<project>
--temp_location=gs://<bucket>/tmp
--sdk_location=dist/apache-beam-2.XX.0.dev0.tar.gz
--region=us-central1'
Building Python SDK
Build Source Distribution
cd sdks/python
pip install build && python -m build --sdist
Build Wheel (faster installation)
./gradlew :sdks:python:bdistPy311linux
Build and Push SDK Container Image
./gradlew :sdks:python:container:py311:docker \
-Pdocker-repository-root=gcr.io/your-project/your-name \
-Pdocker-tag=custom \
-Ppush-containers
To use this container image, supply it via --sdk_container_image.
Running Pipelines with Modified Code
pip install /path/to/apache-beam.tar.gz[gcp]
python my_pipeline.py \
--runner=DataflowRunner \
--sdk_location=/path/to/apache-beam.tar.gz \
--project=my_project \
--region=us-central1 \
--temp_location=gs://my-bucket/temp
Common Issues
NameError when running DoFn
Global imports, functions, and variables in the main pipeline module are not serialized by default. Use:
--save_main_session
Specifying Additional Dependencies
Use --requirements_file=requirements.txt or custom containers.
Test Markers
@pytest.mark.it_postcommit - Include in PostCommit test suite
Gradle Commands for Python
./gradlew :sdks:python:wordCount
./gradlew :checkSetup
Code Quality Tools
ruff check apache_beam/
pyrefly check apache_beam/
yapf -i apache_beam/file.py
isort apache_beam/file.py