| name | yaml-development |
| description | Guides YAML SDK development in Apache Beam, including environment setup, testing, and key concepts. Use when working with Beam YAML code in sdks/python/apache_beam/yaml/. |
YAML Development in Apache Beam
Project Structure
Key Files in sdks/python/apache_beam/yaml/
integration_tests.py - Runs integration tests defined in YAML files or using testcontainers.
main.py - Entry point for running YAML pipelines from the command line.
pipeline.schema.yaml - JSON schema defining the valid structure for Beam YAML pipelines.
standard_io.yaml - Declarations of standard IO transforms and their mappings to providers.
standard_providers.yaml - Configuration for standard providers (e.g., Java expansion services).
yaml_combine.py - Implementations for aggregation and combining operations.
yaml_io.py - Mappings and logic for IO transforms (e.g., PubSub, BigQuery, Iceberg).
yaml_join.py - Implementations for join operations.
yaml_mapping.py - Implementations for mapping operations (e.g., MapToFields).
yaml_provider.py - Manages providers (Python, Java cross-language) that implement transforms.
yaml_transform.py - Core YAML expansion logic, parsing, and translation to Beam pipelines.
Environment Setup
Since Beam YAML is implemented within the Python SDK, the environment setup is identical to Python development. Refer to the python-development skill for details on using pyenv and installing in editable mode (e.g., use pip install -e sdks/python[gcp,test] from the root directory).
Running YAML Pipelines
You can run Beam YAML pipelines using the main.py script in the YAML directory.
Using main.py directly
python -m apache_beam.yaml.main --yaml_pipeline_file=/path/to/pipeline.yaml [pipeline_options]
Example: Running locally
python -m apache_beam.yaml.main \
--yaml_pipeline_file=sdks/python/apache_beam/yaml/examples/simple_filter.yaml \
--runner=DirectRunner
Example: Running on Dataflow
python -m apache_beam.yaml.main \
--yaml_pipeline_file=sdks/python/apache_beam/yaml/examples/simple_filter.yaml \
--runner=DataflowRunner \
--project=my-project \
--region=us-central1 \
--temp_location=gs://my-bucket/temp
Running Tests
Unit Tests
Beam YAML has extensive unit tests covering parsing, expansion, and specific transforms.
pytest sdks/python/apache_beam/yaml/yaml_transform_test.py
pytest sdks/python/apache_beam/yaml/yaml_transform_test.py::YamlTransformTest::test_simple_pipeline
Integration Tests
Integration tests often spin up Docker containers (via testcontainers) for external services like MongoDB, Kafka, or databases.
pytest sdks/python/apache_beam/yaml/integration_tests.py -k mongodb
Key Concepts
Providers
Beam YAML uses "providers" to find implementations for transforms requested in the YAML file.
- Inline/Python Providers: Leverage Python functions or PTransforms directly.
- Java/External Providers: Use Beam's cross-language capabilities to invoke Java transforms via an expansion service.
Preprocessing
Before execution, a YAML pipeline is preprocessed to resolve schemas, match transforms to providers, and expand shorthand notations (like chain or source/sink composites).
Common Issues
Cross-language Failures
If a test requires a Java transform, ensure that:
- Docker is running (if using testcontainers).
- The correct expansion service is available or can be started.
- Java environment is correctly configured (sometimes requires specific Java versions like Java 17/21).
Schema Mismatches
YAML relies heavily on Beam schemas. Ensure that fields produced by a transform match the fields expected by the next transform. Use explicit mapping if necessary.