بنقرة واحدة
yaml-development
// Guides YAML SDK development in Apache Beam, including environment setup, testing, and key concepts. Use when working with Beam YAML code in sdks/python/apache_beam/yaml/.
// Guides YAML SDK development in Apache Beam, including environment setup, testing, and key concepts. Use when working with Beam YAML code in sdks/python/apache_beam/yaml/.
Guide on how to add and propagate new metadata fields in Apache Beam's WindowedValue, extending protos, windmill persistence, and runner interfaces to avoid metadata loss.
Guides understanding and working with Apache Beam runners (Direct, Dataflow, Flink, Spark, etc.). Use when configuring pipelines for different execution environments or debugging runner-specific issues.
Rewrite Apache Beam DoFn methods (@ProcessElement, @OnTimer, @OnWindowExpiration) to remove legacy ProcessContext or OnTimerContext usage. Use this skill when you encounter DoFn methods that use context.element(), context.output(), etc., and need to modernize them using parameter injection (@Element, @Timestamp, @Pane, OutputReceiver, MultiOutputReceiver).
Guides Python SDK development in Apache Beam, including environment setup, testing, building, and running pipelines. Use when working with Python code in sdks/python/.
Guides understanding and using the Gradle build system in Apache Beam. Use when building projects, understanding dependencies, or troubleshooting build issues.
Explains core Apache Beam programming model concepts including PCollections, PTransforms, Pipelines, and Runners. Use when learning Beam fundamentals or explaining pipeline concepts.
| name | yaml-development |
| description | Guides YAML SDK development in Apache Beam, including environment setup, testing, and key concepts. Use when working with Beam YAML code in sdks/python/apache_beam/yaml/. |
sdks/python/apache_beam/yaml/integration_tests.py - Runs integration tests defined in YAML files or using testcontainers.main.py - Entry point for running YAML pipelines from the command line.pipeline.schema.yaml - JSON schema defining the valid structure for Beam YAML pipelines.standard_io.yaml - Declarations of standard IO transforms and their mappings to providers.standard_providers.yaml - Configuration for standard providers (e.g., Java expansion services).yaml_combine.py - Implementations for aggregation and combining operations.yaml_io.py - Mappings and logic for IO transforms (e.g., PubSub, BigQuery, Iceberg).yaml_join.py - Implementations for join operations.yaml_mapping.py - Implementations for mapping operations (e.g., MapToFields).yaml_provider.py - Manages providers (Python, Java cross-language) that implement transforms.yaml_transform.py - Core YAML expansion logic, parsing, and translation to Beam pipelines.Since Beam YAML is implemented within the Python SDK, the environment setup is identical to Python development. Refer to the python-development skill for details on using pyenv and installing in editable mode (e.g., use pip install -e sdks/python[gcp,test] from the root directory).
You can run Beam YAML pipelines using the main.py script in the YAML directory.
main.py directlypython -m apache_beam.yaml.main --yaml_pipeline_file=/path/to/pipeline.yaml [pipeline_options]
python -m apache_beam.yaml.main \
--yaml_pipeline_file=sdks/python/apache_beam/yaml/examples/simple_filter.yaml \
--runner=DirectRunner
python -m apache_beam.yaml.main \
--yaml_pipeline_file=sdks/python/apache_beam/yaml/examples/simple_filter.yaml \
--runner=DataflowRunner \
--project=my-project \
--region=us-central1 \
--temp_location=gs://my-bucket/temp
Beam YAML has extensive unit tests covering parsing, expansion, and specific transforms.
# Run all tests in a file
pytest sdks/python/apache_beam/yaml/yaml_transform_test.py
# Run a specific test
pytest sdks/python/apache_beam/yaml/yaml_transform_test.py::YamlTransformTest::test_simple_pipeline
Integration tests often spin up Docker containers (via testcontainers) for external services like MongoDB, Kafka, or databases.
# Run integration tests matching a specific keyword (e.g., mongodb)
pytest sdks/python/apache_beam/yaml/integration_tests.py -k mongodb
Beam YAML uses "providers" to find implementations for transforms requested in the YAML file.
Before execution, a YAML pipeline is preprocessed to resolve schemas, match transforms to providers, and expand shorthand notations (like chain or source/sink composites).
If a test requires a Java transform, ensure that:
YAML relies heavily on Beam schemas. Ensure that fields produced by a transform match the fields expected by the next transform. Use explicit mapping if necessary.