Run any Skill in Manus with one click

$pwd:

gcp-spark

Name: Gcp Spark
Author: gemini-cli-extensions

// Develops and executes Spark code on Dataproc Clusters and Serverless. Reads and writes data using BigLake Iceberg catalogs, BigQuery and Spanner. Debugs execution failures. Use when: - Writing Spark ETL pipelines on GCP. - Training or running inference with ML models with spark on GCP. - Managing Spark clusters, jobs, batches, and interactive sessions. Don't use when: - Writing generic Python scripts that don't use Spark. - Performing simple SQL queries that can be done directly in BigQuery.

Run Skill in Manus

$ git log --oneline --stat

stars:73

forks:10

updated:May 14, 2026 at 23:41

File Explorer

6 files

SKILL.md

readonly

name	gcp-spark
description	Develops and executes Spark code on Dataproc Clusters and Serverless. Reads and writes data using BigLake Iceberg catalogs, BigQuery and Spanner. Debugs execution failures. Use when: - Writing Spark ETL pipelines on GCP. - Training or running inference with ML models with spark on GCP. - Managing Spark clusters, jobs, batches, and interactive sessions. Don't use when: - Writing generic Python scripts that don't use Spark. - Performing simple SQL queries that can be done directly in BigQuery.
license	Apache-2.0
metadata	{"version":"v2","publisher":"google"}

Spark on Dataproc

[!IMPORTANT] You MUST ALWAYS follow the Task Execution Workflow when writing spark code.

Task Execution Workflow

Understand schemas: ALWAYS use @skill:discovering-gcp-data-assets skill or resources/schema_direct_inspection.md to understand input and output schemas. Include the schema in your thought process BEFORE generating any code. Do NOT guess column names.
Generate spark code:
- Output Format: ALWAYS generate code in Python Notebooks (.ipynb) format. Generate scripts (.py) only if explicitly requested.
- Read and Write data: ALWAYS Refer to resources/read_write_data.md when reading or writing data.
- ML Tasks: Refer to @skill:ml-best-practices skill and resources/ml_tasks.md when generating ML code.
- Spark Optimizations: ALWAYS refer to resources/spark_optimizations.md when generating spark code and apply optimization whenever applicable.
Verify schema before write: ALWAYS verify that the dataframe and destination schema match, use df.printSchema() for dataframe schema and refer to @skill:discovering-gcp-data-assets skill or resources/schema_direct_inspection.md to verify destination schema.
Compile code before executing: For notebooks convert them to python script using jupyter nbconvert --to script your-notebook.ipynb first, then compile code using python3 -m py_compile your-notebook.py.
Execute script: ONLY when generating a .py script refer to resources/gcloud_dataproc.md on writing command to execute generated code on Dataproc. This DOES NOT apply when generating notebooks.

Common Mistakes Checklist

[!CAUTION] Ensure you verify this checklist to avoid mistakes

Before submitting a job, verify:

All imports present (col, when, lit, etc. from pyspark.sql.functions)
vector_to_array from correct module use from pyspark.ml.functions import vector_to_array (NOT pyspark.sql.functions)
DataFrame schema matches target Iceberg table verify with df.printSchema() before writing
CSV files read with header and inferSchema without these, the header row becomes data and all columns are strings
Avoid toPandas() Converting a pyspark dataframe to pandas by calling toPandas() can lead to out of memory errors. Only acceptable for building visualizations in Spark 3.5

IAM Requirements

The Dataproc service account needs:

roles/dataproc.worker: Job execution
roles/biglake.admin: Iceberg table management
roles/bigquery.jobUser: Query materialization
roles/storage.objectUser: Read/write GCS
roles/spanner.databaseUser: Spanner writes

Spark resource management

Refer to resources/gcloud_dataproc.md for detailed guidelines on managing Spark clusters, jobs, batches, and interactive sessions.

related-skills.json

same repository

bigquery-data-transfer-service.md

from "gemini-cli-extensions/data-agent-kit-starter-pack"

Discovers and inspects BigQuery Data Transfer Service (DTS) configurations. Use this to identify existing ingestion pipelines and extract datasource or transfer config metadata for data pipelines. Use when a user asks for ingestion scenarios while building or managing data pipelines or when a user asks to "ingest" or "add" data that may already be managed by a DTS transfer.

2026-05-1473

dataform-bigquery.md

from "gemini-cli-extensions/data-agent-kit-starter-pack"

Expertise in generating clean, correct, and efficient Dataform pipeline code for BigQuery ELT. Use this when creating or modifying Dataform pipelines, actions, or source declarations, when Dataform, SQLX, or BigQuery are mentioned in a transformation, when data needs to be ingested from GCS into BigQuery via Dataform, or when setting up a new Dataform project or configuring workflow_settings.yaml.

2026-05-1473

dbt-bigquery.md

from "gemini-cli-extensions/data-agent-kit-starter-pack"

Expert guidance for creating, modifying, and optimizing dbt pipelines for BigQuery. Use this skill whenever user asks for generating or modifying a dbt model or project. Activate this skill when the user - Creates, modifies, or troubleshoots **dbt models or pipelines** - Needs to **optimize SQL** within a dbt project - Is **setting up a new dbt project** or configuring existing one

2026-05-1473

discovering-gcp-data-assets.md

from "gemini-cli-extensions/data-agent-kit-starter-pack"

Finds and inspects data assets within Google Cloud. Relevant when any of the following conditions are true: 1. The user request involves finding, exploring, or inspecting data assets in Google Cloud, such as: - BigQuery datasets, tables, or views - BigLake catalog or tables - Spanner instances, databases or tables - etc. 2. You need to retrieve the schema, metadata, or governance policies for a GCP data asset. 3. You have a keyword or topic (e.g., "sales data") but lack the specific table or resource ID. 4. You are attempting to find data using `bq ls`, as this skill offers a superior approach. Don't use when: - Assets are outside Google Cloud

2026-05-1473

gcp-dataflow.md

from "gemini-cli-extensions/data-agent-kit-starter-pack"

Provides guidance for writing, packaging and executing Apache Beam pipelines on GCP using Cloud Dataflow. Use when: - Creating an Apache Beam Dataflow pipeline. - Creating a Google Flex Template.

2026-05-1473

gcp-pipeline-orchestration.md

from "gemini-cli-extensions/data-agent-kit-starter-pack"

This skill helps the agent generate or update orchestration pipeline definitions for Google Cloud Composer to initialize orchestration pipeline or update the orchestration definition for orchestration of various data pipelines, like dbt pipelines, notebooks, Spark jobs, Dataform, Python scripts or inline BigQuery SQL queries. This skill also helps deploy and trigger orchestration pipelines.

2026-05-1473

package.json

"author": "gemini-cli-extensions"

"repository": "gemini-cli-extensions/data-agent-kit-starter-pack"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	gcp-spark
description	Develops and executes Spark code on Dataproc Clusters and Serverless. Reads and writes data using BigLake Iceberg catalogs, BigQuery and Spanner. Debugs execution failures. Use when: - Writing Spark ETL pipelines on GCP. - Training or running inference with ML models with spark on GCP. - Managing Spark clusters, jobs, batches, and interactive sessions. Don't use when: - Writing generic Python scripts that don't use Spark. - Performing simple SQL queries that can be done directly in BigQuery.
license	Apache-2.0
metadata	{"version":"v2","publisher":"google"}

Spark on Dataproc

[!IMPORTANT] You MUST ALWAYS follow the Task Execution Workflow when writing spark code.

Task Execution Workflow

Understand schemas: ALWAYS use @skill:discovering-gcp-data-assets skill or resources/schema_direct_inspection.md to understand input and output schemas. Include the schema in your thought process BEFORE generating any code. Do NOT guess column names.
Generate spark code:
- Output Format: ALWAYS generate code in Python Notebooks (.ipynb) format. Generate scripts (.py) only if explicitly requested.
- Read and Write data: ALWAYS Refer to resources/read_write_data.md when reading or writing data.
- ML Tasks: Refer to @skill:ml-best-practices skill and resources/ml_tasks.md when generating ML code.
- Spark Optimizations: ALWAYS refer to resources/spark_optimizations.md when generating spark code and apply optimization whenever applicable.
Verify schema before write: ALWAYS verify that the dataframe and destination schema match, use df.printSchema() for dataframe schema and refer to @skill:discovering-gcp-data-assets skill or resources/schema_direct_inspection.md to verify destination schema.
Compile code before executing: For notebooks convert them to python script using jupyter nbconvert --to script your-notebook.ipynb first, then compile code using python3 -m py_compile your-notebook.py.
Execute script: ONLY when generating a .py script refer to resources/gcloud_dataproc.md on writing command to execute generated code on Dataproc. This DOES NOT apply when generating notebooks.

Common Mistakes Checklist

[!CAUTION] Ensure you verify this checklist to avoid mistakes

Before submitting a job, verify:

All imports present (col, when, lit, etc. from pyspark.sql.functions)
vector_to_array from correct module use from pyspark.ml.functions import vector_to_array (NOT pyspark.sql.functions)
DataFrame schema matches target Iceberg table verify with df.printSchema() before writing
CSV files read with header and inferSchema without these, the header row becomes data and all columns are strings
Avoid toPandas() Converting a pyspark dataframe to pandas by calling toPandas() can lead to out of memory errors. Only acceptable for building visualizations in Spark 3.5

IAM Requirements

The Dataproc service account needs:

roles/dataproc.worker: Job execution
roles/biglake.admin: Iceberg table management
roles/bigquery.jobUser: Query materialization
roles/storage.objectUser: Read/write GCS
roles/spanner.databaseUser: Spanner writes

Spark resource management

Refer to resources/gcloud_dataproc.md for detailed guidelines on managing Spark clusters, jobs, batches, and interactive sessions.

gcp-spark

Spark on Dataproc

Task Execution Workflow

Common Mistakes Checklist

IAM Requirements

Spark resource management

More from this repository

More from this repository

Spark on Dataproc

Task Execution Workflow

Common Mistakes Checklist

IAM Requirements

Spark resource management