| name | scaffold-connector |
| description | Build a new OpenMetadata connector from scratch — scaffold JSON Schema, Python boilerplate, and AI context using schema-first architecture with code generation across Python, Java, TypeScript, and auto-rendered UI forms. |
| user-invocable | true |
| argument-hint | [connector name or description] |
| allowed-tools | ["Bash","Read","Write","Edit","Glob","Grep","Agent"] |
| hooks | {"SessionStart":"Load the OpenMetadata connector standards before starting:\nRead the standards at ${CLAUDE_SKILL_DIR}/standards/main.md\n"} |
OpenMetadata Connector Building Skill
When to Activate
When a user asks to build, create, add, or scaffold a new connector, source, or integration for OpenMetadata.
Core Insight
One JSON Schema definition cascades through 6 layers: Python Pydantic models, Java models, UI forms (RJSF auto-render), API validation, test fixtures, and documentation. Define the schema once — everything else is generated or guided.
Workflow: 7 Phases
Phase 0: ENVIRONMENT — Set Up Python Dev Environment
Before any make or python commands, set up the environment from the repo root:
python3.11 -m venv env
source env/bin/activate
make install_dev generate
Always activate before running commands: source env/bin/activate
Phase 1: SCAFFOLD — Generate Boilerplate
Run the scaffold CLI to collect inputs and generate files:
source env/bin/activate
metadata scaffold-connector
Interactive mode collects: connector name, service type, connection type, auth types, capabilities, docs URL, SDK package, API endpoints, implementation notes, Docker image, container port.
Non-interactive mode:
metadata scaffold-connector \
--name my_db \
--service-type database \
--connection-type sqlalchemy \
--scheme "mydb+pymydb" \
--auth-types basic \
--capabilities metadata lineage usage profiler \
--docs-url "https://docs.example.com/api" \
--sdk-package "mydb-sdk" \
--docker-image "mydb/mydb:latest" \
--docker-port 5432
Output: JSON Schema + test connection JSON + Python files + CONNECTOR_CONTEXT.md as an AI working document. SQLAlchemy database connectors get concrete code templates; all others get skeleton files with pointers to reference connectors.
CONNECTOR_CONTEXT.md handling: The scaffold generates CONNECTOR_CONTEXT.md in the connector directory as a working document for any AI tool (Claude Code, Cursor, Codex, Copilot, Windsurf). It is gitignored — it stays local and is never committed to the repo. No cleanup needed.
Phase 2: CLASSIFY — Understand the Source
The scaffold classifies along 3 dimensions. Verify the choices:
Dimension 1 — Service Type (determines directory + base class):
| Service Type | Base Class | Reference |
|---|
database | CommonDbSourceService | mysql/ |
dashboard | DashboardServiceSource | metabase/ |
pipeline | PipelineServiceSource | airflow/ |
messaging | MessagingServiceSource | kafka/ |
mlmodel | MlModelServiceSource | mlflow/ |
storage | StorageServiceSource | s3/ |
search | SearchServiceSource | elasticsearch/ |
api | ApiServiceSource | rest/ |
Dimension 2 — Connection Type (database only):
sqlalchemy → BaseConnection[Config, Engine] + SQLAlchemy dialect
rest_api → get_connection() + custom REST client (ref: salesforce/)
sdk_client → get_connection() + vendor SDK wrapper
Dimension 3 — Capabilities (determines extra files):
metadata (always), lineage, usage, profiler, stored_procedures, data_diff
Read the source-type-specific standard at ${CLAUDE_SKILL_DIR}/standards/source_types/{service_type}.md for detailed patterns.
Phase 3: RESEARCH — API/SDK Discovery
Read the CONNECTOR_CONTEXT.md generated by the scaffold. Then research the source's API/SDK.
If you can dispatch sub-agents (Claude Code): Launch a connector-researcher agent:
Agent: openmetadata-skills:connector-researcher
Prompt: "Research {source_name} for an OpenMetadata {service_type} connector.
Find: API docs, auth methods, key endpoints, pagination, rate limits, SDK packages."
If you cannot dispatch sub-agents: Perform the research yourself using WebSearch and WebFetch.
Phase 4: IMPLEMENT — Fill in the TODO Items
The scaffold generates files with # TODO markers. Read the relevant standards before implementing:
${CLAUDE_SKILL_DIR}/standards/connection.md — Connection patterns
${CLAUDE_SKILL_DIR}/standards/patterns.md — Error handling, pagination, auth
${CLAUDE_SKILL_DIR}/standards/performance.md — Pagination, lookup optimization, anti-patterns
${CLAUDE_SKILL_DIR}/standards/memory.md — Memory management, streaming, OOM prevention
${CLAUDE_SKILL_DIR}/standards/source_types/{service_type}.md — Service-specific patterns
SQLAlchemy database: Templates are mostly complete. Customize _get_client() if needed.
Non-SQLAlchemy: Study the reference connector, then implement each skeleton file.
Critical for JSON Schema:
- Make auth fields (
username, password, token) required when the service needs authentication by default. If omitting a field means an opaque 401 at runtime, make it required so the UI validates upfront.
- Include SSL/TLS config (
verifySSL + sslConfig $ref) for any connector that communicates over HTTPS — enterprise deployments use internal CAs.
- SSL must be wired end-to-end: schema →
connection.py (resolve with get_verify_ssl_fn) → client.py (session.verify = verify_ssl). Missing wiring triggers SonarQube Security Review failure.
- See
${CLAUDE_SKILL_DIR}/standards/schema.md for the $ref patterns and required fields guidance.
Critical for Pydantic API models (models.py):
- Always set
model_config = ConfigDict(populate_by_name=True) when using Field(alias=...) — without this, constructing instances with Python attribute names raises ValidationError.
- See
${CLAUDE_SKILL_DIR}/standards/code_style.md for the full pattern.
Critical for non-database connectors (client.py):
- Every list endpoint MUST implement pagination if the API supports it. Check the API docs.
- Missing pagination causes silent data loss — only the first page is ingested.
- Build dicts for repeated lookups (e.g., folder path → folder name) instead of iterating lists.
- See
${CLAUDE_SKILL_DIR}/standards/performance.md for correct patterns and anti-patterns.
Critical for storage connectors and any connector that reads files:
- Never
.read() entire files without a size check — causes OOM on production instances.
- Use framework streaming readers (
metadata/readers/dataframe/) for data files.
del large objects after processing and call gc.collect().
- See
${CLAUDE_SKILL_DIR}/standards/memory.md for correct patterns.
Critical for lineage:
- Never use wildcard
table_name="*" in search queries — this links every table in a database to each entity, producing incorrect lineage.
- If the source doesn't provide table-level info, skip lineage and document the limitation.
- See
${CLAUDE_SKILL_DIR}/standards/lineage.md for correct patterns.
Phase 5: REGISTER — Integration Points
Read ${CLAUDE_SKILL_DIR}/standards/registration.md for detailed instructions. Summary:
| Step | File | Change |
|---|
| 1 | openmetadata-spec/.../entity/services/{serviceType}Service.json | Add to type enum + connection oneOf |
| 2 | openmetadata-ui/.../utils/{ServiceType}ServiceUtils.tsx | Import schema + add switch case |
| 3 | openmetadata-ui/.../locale/languages/ | Add i18n display name keys |
Phase 6: GENERATE & FORMAT — Run Code Generation and Formatting
This step is mandatory — always run it before committing. Ensure the Python environment is set up:
source env/bin/activate
pip install -e ".[dev]" 2>/dev/null || make install_dev
make generate
mvn clean install -pl openmetadata-spec
cd openmetadata-ui/src/main/resources/ui && yarn parse-schema
cd /path/to/repo/root
make py_format
mvn spotless:apply
If make py_format fails: The most common cause is missing dev dependencies. Run make install_dev first, then retry.
Never skip formatting — unformatted code will fail CI.
Phase 7: VALIDATE — Run Static Analysis and Checklist
Run the static analyzer as a self-check before submitting:
python skills/connector-review/scripts/analyze_connector.py {service_type} {name}
Fix any issues it reports. Then verify the full checklist:
[ ] JSON Schema: validates, $ref resolves, supports* flags correct
[ ] JSON Schema: auth fields required when service mandates authentication
[ ] JSON Schema: SSL/TLS config included for HTTPS connectors
[ ] Code gen: make generate + mvn install + yarn parse-schema succeed
[ ] Connection: creates client, test_connection passes all steps
[ ] Source: create() validates config type, ServiceSpec is discoverable
[ ] Pydantic models: populate_by_name=True on all aliased models
[ ] Client: all list endpoints paginate (check API docs for pagination support)
[ ] Client: dict lookups in prepare(), not list iteration per entity
[ ] Lineage: no wildcard table_name="*" — skip if no table-level info available
[ ] Tests: unit + connection integration + metadata integration pass (no empty stubs)
[ ] Formatting: make py_format + mvn spotless:apply pass with no changes
[ ] Cleanup: CONNECTOR_CONTEXT.md is gitignored (verify it's not staged)
[ ] Cleanup: no leftover TODO scaffolding comments
Phase 8: TEST LOCALLY — Deploy and Test in the UI
Build everything and bring up a full local OpenMetadata stack with Docker:
Full build (first time or after Java/UI changes):
./docker/run_local_docker.sh -m ui -d mysql -s false -i true -r true
Fast rebuild (ingestion-only changes, ~2-3 minutes):
./docker/run_local_docker.sh -m ui -d mysql -s true -i true -r false
Once services are up (~3-5 minutes):
- Open http://localhost:8585
- Go to Settings → Services → {Your Service Type}
- Click Add New Service and select your connector
- Configure connection details and click Test Connection
- If test passes, run metadata ingestion to verify entities are created
Other service URLs:
Tear down: cd docker/development && docker compose down -v
Troubleshooting:
- Connector not in dropdown → check service schema registration, rebuild without
-s true
- Test connection fails → check
test_fn keys match test connection JSON step names
- Container logs:
docker compose -f docker/development/docker-compose.yml logs ingestion
Phase 9: CREATE PR — Submit with Quality Summary
When creating a PR for the connector, include the review summary in the PR description so reviewers see the quality assessment upfront:
analysis=$(python skills/connector-review/scripts/analyze_connector.py {service_type} {name} --json)
gh pr create --title "feat(ingestion): Add {Name} {service_type} connector" --body "$(cat <<'EOF'
## Summary
- New {service_type} connector for {Name}
- Capabilities: {list capabilities}
## Test plan
- [ ] Unit tests pass (`pytest ingestion/tests/unit/topology/{service_type}/test_{name}.py`)
- [ ] Integration tests pass
- [ ] Local Docker test: connector appears in UI, test connection passes
## Connector Quality Review
**Verdict**: {VERDICT} | **Score**: {SCORE}/10
| Category | Score |
|----------|-------|
| Schema & Registration | X/10 |
| Connection & Auth | X/10 |
| Source, Topology & Performance | X/10 |
| Test Quality | X/10 |
| Code Quality & Style | X/10 |
**Blockers**: 0 | **Warnings**: {count} | **Suggestions**: {count}
<details>
<summary>Static analysis output</summary>
{paste analyze_connector.py output here}
</details>
🤖 Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"
The quality summary gives maintainers confidence about the connector's state without needing to review every file manually.
Standards Reference
All standards are in ${CLAUDE_SKILL_DIR}/standards/:
| Standard | Content |
|---|
main.md | Architecture overview, connector anatomy, service types |
patterns.md | Error handling, logging, pagination, auth, filters |
testing.md | Unit test patterns, integration tests, pytest style |
code_style.md | Python style, JSON Schema conventions, naming |
schema.md | Connection schema patterns, $ref usage, test connection JSON |
connection.md | BaseConnection vs function patterns, SSL, client wrapper |
service_spec.md | DefaultDatabaseSpec vs BaseSpec |
registration.md | Service enum, UI utils, i18n |
performance.md | Pagination, batching, rate limiting |
memory.md | Memory management, streaming, OOM prevention |
lineage.md | Lineage extraction methods, dialect mapping, query logs |
sql.md | SQLAlchemy patterns, URL building, auth, multi-DB |
source_types/*.md | Service-type-specific patterns |
References
Architecture guides in ${CLAUDE_SKILL_DIR}/references/:
| Reference | Content |
|---|
architecture-decision-tree.md | Service type, connection type, base class selection |
connection-type-guide.md | SQLAlchemy vs REST API vs SDK client |
capability-mapping.md | Capabilities by service type, schema flags, generated files |