| name | gcp-spark |
| description | Develops and executes Spark code on Dataproc Clusters and Serverless.
Reads and writes data using BigLake Iceberg catalogs, BigQuery and Spanner.
Debugs execution failures.
Use when:
- Writing Spark ETL pipelines on GCP.
- Training or running inference with ML models with spark on GCP.
- Managing Spark clusters, jobs, batches, and interactive sessions.
Don't use when:
- Writing generic Python scripts that don't use Spark.
- Performing simple SQL queries that can be done directly in BigQuery.
|
| license | Apache-2.0 |
| metadata | {"version":"v2","publisher":"google"} |
Spark on Dataproc
[!IMPORTANT] You MUST ALWAYS follow the Task Execution Workflow when writing
spark code.
Task Execution Workflow
- Understand schemas: ALWAYS use
@skill:discovering-gcp-data-assets
skill or resources/schema_direct_inspection.md to understand input and
output schemas. Include the schema in your thought process BEFORE generating
any code. Do NOT guess column names.
- Generate spark code:
- Output Format: ALWAYS generate code in Python Notebooks
(.ipynb) format. Generate scripts (.py) only if explicitly requested.
- Read and Write data: ALWAYS Refer to
resources/read_write_data.md when reading or writing data.
- ML Tasks: Refer to
@skill:ml-best-practices skill and
resources/ml_tasks.md when generating ML code.
- Spark Optimizations: ALWAYS refer to
resources/spark_optimizations.md when generating spark code and apply
optimization whenever applicable.
- Verify schema before write: ALWAYS verify that the dataframe and
destination schema match, use
df.printSchema() for dataframe schema and
refer to @skill:discovering-gcp-data-assets skill or
resources/schema_direct_inspection.md to verify destination schema.
- Compile code before executing: For notebooks convert them to python
script using
jupyter nbconvert --to script your-notebook.ipynb first, then
compile code using python3 -m py_compile your-notebook.py.
- Execute script: ONLY when generating a
.py script refer to
resources/gcloud_dataproc.md on writing command to execute generated code
on Dataproc. This DOES NOT apply when generating notebooks.
Common Mistakes Checklist
[!CAUTION] Ensure you verify this checklist to avoid mistakes
Before submitting a job, verify:
IAM Requirements
The Dataproc service account needs:
roles/dataproc.worker: Job execution
roles/biglake.admin: Iceberg table management
roles/bigquery.jobUser: Query materialization
roles/storage.objectUser: Read/write GCS
roles/spanner.databaseUser: Spanner writes
Spark resource management
Refer to resources/gcloud_dataproc.md for detailed guidelines on managing
Spark clusters, jobs, batches, and interactive sessions.