mit einem Klick
python-development
// Guides Python SDK development in Apache Beam, including environment setup, testing, building, and running pipelines. Use when working with Python code in sdks/python/.
// Guides Python SDK development in Apache Beam, including environment setup, testing, building, and running pipelines. Use when working with Python code in sdks/python/.
Guides YAML SDK development in Apache Beam, including environment setup, testing, and key concepts. Use when working with Beam YAML code in sdks/python/apache_beam/yaml/.
Guide on how to add and propagate new metadata fields in Apache Beam's WindowedValue, extending protos, windmill persistence, and runner interfaces to avoid metadata loss.
Guides understanding and working with Apache Beam runners (Direct, Dataflow, Flink, Spark, etc.). Use when configuring pipelines for different execution environments or debugging runner-specific issues.
Rewrite Apache Beam DoFn methods (@ProcessElement, @OnTimer, @OnWindowExpiration) to remove legacy ProcessContext or OnTimerContext usage. Use this skill when you encounter DoFn methods that use context.element(), context.output(), etc., and need to modernize them using parameter injection (@Element, @Timestamp, @Pane, OutputReceiver, MultiOutputReceiver).
Guides understanding and using the Gradle build system in Apache Beam. Use when building projects, understanding dependencies, or troubleshooting build issues.
Explains core Apache Beam programming model concepts including PCollections, PTransforms, Pipelines, and Runners. Use when learning Beam fundamentals or explaining pipeline concepts.
| name | python-development |
| description | Guides Python SDK development in Apache Beam, including environment setup, testing, building, and running pipelines. Use when working with Python code in sdks/python/. |
sdks/python/ - Python SDK root
apache_beam/ - Main Beam package
transforms/ - Core transforms (ParDo, GroupByKey, etc.)io/ - I/O connectorsml/ - Beam ML code (RunInference, etc.)runners/ - Runner implementations and wrappersrunners/worker/ - SDK worker harnesscontainer/ - Docker container configurationtest-suites/ - Test configurationsscripts/ - Utility scriptssetup.py - Package configurationpyproject.toml - Build configurationtox.ini - Test automationpytest.ini - Pytest configurationruff.toml - Linting rules.isort.cfg - Import sortingpyrefly.toml - Type checking# Install Python
pyenv install 3.X # Use supported version from gradle.properties
# Create virtual environment
pyenv virtualenv 3.X beam-dev
pyenv activate beam-dev
cd sdks/python
pip install -e .[gcp,test]
pip install pre-commit
pre-commit install
# To disable
pre-commit uninstall
*_test.py)# Run all tests in a file
pytest -v apache_beam/io/textio_test.py
# Run tests in a class
pytest -v apache_beam/io/textio_test.py::TextSourceTest
# Run a specific test
pytest -v apache_beam/io/textio_test.py::TextSourceTest::test_progress
*_it_test.py)python -m pytest -o log_cli=True -o log_level=Info \
apache_beam/ml/inference/pytorch_inference_it_test.py::PyTorchInference \
--test-pipeline-options='--runner=TestDirectRunner'
# First build SDK tarball
pip install build && python -m build --sdist
# Run integration test
python -m pytest -o log_cli=True -o log_level=Info \
apache_beam/ml/inference/pytorch_inference_it_test.py::PyTorchInference \
--test-pipeline-options='--runner=TestDataflowRunner --project=<project>
--temp_location=gs://<bucket>/tmp
--sdk_location=dist/apache-beam-2.XX.0.dev0.tar.gz
--region=us-central1'
cd sdks/python
pip install build && python -m build --sdist
# Output: sdks/python/dist/apache-beam-X.XX.0.dev0.tar.gz
./gradlew :sdks:python:bdistPy311linux # For Python 3.11 on Linux
./gradlew :sdks:python:container:py311:docker \
-Pdocker-repository-root=gcr.io/your-project/your-name \
-Pdocker-tag=custom \
-Ppush-containers
# Container image will be pushed to: gcr.io/your-project/your-name/beam_python3.11_sdk:custom
To use this container image, supply it via --sdk_container_image.
# Install modified SDK
pip install /path/to/apache-beam.tar.gz[gcp]
# Run pipeline
python my_pipeline.py \
--runner=DataflowRunner \
--sdk_location=/path/to/apache-beam.tar.gz \
--project=my_project \
--region=us-central1 \
--temp_location=gs://my-bucket/temp
NameError when running DoFnGlobal imports, functions, and variables in the main pipeline module are not serialized by default. Use:
--save_main_session
Use --requirements_file=requirements.txt or custom containers.
@pytest.mark.it_postcommit - Include in PostCommit test suite# Run WordCount
./gradlew :sdks:python:wordCount
# Check environment
./gradlew :checkSetup
# Linting
ruff check apache_beam/
# Type checking
pyrefly check apache_beam/
# Formatting (via yapf)
yapf -i apache_beam/file.py
# Import sorting
isort apache_beam/file.py