| name | hugging-face-datasets |
| description | Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows. |
Overview
This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, content management, and SQL-based data manipulation. It is designed to complement the existing Hugging Face MCP server by providing dataset editing and querying capabilities.
Integration with HF MCP Server
- Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
- Use This Skill for: Dataset creation, content editing, SQL queries, data transformation, and structured data formatting
Version
2.1.0
Dependencies
- huggingface_hub
- duckdb (for SQL queries)
- datasets (for pushing query results to Hub)
- json (built-in)
- time (built-in)
Core Capabilities
1. Dataset Lifecycle Management
- Initialize: Create new dataset repositories with proper structure
- Configure: Store detailed configuration including system prompts and metadata
- Stream Updates: Add rows efficiently without downloading entire datasets
2. SQL-Based Dataset Querying (NEW)
Query any Hugging Face dataset using DuckDB SQL via scripts/sql_manager.py:
- Direct Queries: Run SQL on datasets using the
hf:// protocol
- Schema Discovery: Describe dataset structure and column types
- Data Sampling: Get random samples for exploration
- Aggregations: Count, histogram, unique values analysis
- Transformations: Filter, join, reshape data with SQL
- Export & Push: Save results locally or push to new Hub repos
3. Multi-Format Dataset Support
Supports diverse dataset types through template system:
- Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
- Text Classification: Sentiment analysis, intent detection, topic classification
- Question-Answering: Reading comprehension, factual QA, knowledge bases
- Text Completion: Language modeling, code completion, creative writing
- Tabular Data: Structured data for regression/classification tasks
- Custom Formats: Flexible schema definition for specialized needs
4. Quality Assurance Features
- JSON Validation: Ensures data integrity during uploads
- Batch Processing: Efficient handling of large datasets
- Error Recovery: Graceful handling of upload failures and conflicts
Usage Instructions
The skill includes two Python scripts:
scripts/dataset_manager.py - Dataset creation and management
scripts/sql_manager.py - SQL-based dataset querying and transformation
Prerequisites
huggingface_hub library: uv add huggingface_hub
duckdb library (for SQL): uv add duckdb
datasets library (for pushing): uv add datasets
HF_TOKEN environment variable must be set with a Write-access token
- Activate virtual environment:
source .venv/bin/activate
SQL Dataset Querying (sql_manager.py)
Query, transform, and push Hugging Face datasets using DuckDB SQL. The hf:// protocol provides direct access to any public dataset (or private with token).
Quick Start
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"
python scripts/sql_manager.py describe --dataset "cais/mmlu"
python scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5
python scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"
SQL Query Syntax
Use data as the table name in your SQL - it gets replaced with the actual hf:// path:
SELECT * FROM data LIMIT 10
SELECT * FROM data WHERE subject='nutrition'
SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC
SELECT question, choices[answer] AS correct_answer FROM data
SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')
SELECT regexp_replace(question, '\n', '') AS cleaned FROM data
Common Operations
1. Explore Dataset Structure
python scripts/sql_manager.py describe --dataset "cais/mmlu"
python scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject"
python scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20
2. Filter and Transform
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"
python scripts/sql_manager.py transform \
--dataset "cais/mmlu" \
--select "subject, COUNT(*) as cnt" \
--group-by "subject" \
--order-by "cnt DESC" \
--limit 10
3. Create Subsets and Push to Hub
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--sql "SELECT * FROM data WHERE subject='nutrition'" \
--push-to "username/mmlu-nutrition-subset" \
--private
python scripts/sql_manager.py transform \
--dataset "ibm/duorc" \
--config "ParaphraseRC" \
--select "question, answers" \
--where "LENGTH(question) > 50" \
--push-to "username/duorc-long-questions"
4. Export to Local Files
python scripts/sql_manager.py export \
--dataset "cais/mmlu" \
--sql "SELECT * FROM data WHERE subject='nutrition'" \
--output "nutrition.parquet" \
--format parquet
python scripts/sql_manager.py export \
--dataset "cais/mmlu" \
--sql "SELECT * FROM data LIMIT 100" \
--output "sample.jsonl" \
--format jsonl
5. Working with Dataset Configs/Splits
python scripts/sql_manager.py query \
--dataset "ibm/duorc" \
--config "ParaphraseRC" \
--sql "SELECT * FROM data LIMIT 5"
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--split "test" \
--sql "SELECT COUNT(*) FROM data"
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--split "*" \
--sql "SELECT * FROM data LIMIT 10"
6. Raw SQL with Full Paths
For complex queries or joining datasets:
python scripts/sql_manager.py raw --sql "
SELECT a.*, b.*
FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a
JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b
ON a.id = b.id
LIMIT 100
"
Python API Usage
from sql_manager import HFDatasetSQL
sql = HFDatasetSQL()
results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10")
schema = sql.describe("cais/mmlu")
samples = sql.sample("cais/mmlu", n=5, seed=42)
count = sql.count("cais/mmlu", where="subject='nutrition'")
dist = sql.histogram("cais/mmlu", "subject")
results = sql.filter_and_transform(
"cais/mmlu",
select="subject, COUNT(*) as cnt",
group_by="subject",
order_by="cnt DESC",
limit=10
)
url = sql.push_to_hub(
"cais/mmlu",
"username/nutrition-subset",
sql="SELECT * FROM data WHERE subject='nutrition'",
private=True
)
sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")
sql.close()
HF Path Format
DuckDB uses the hf:// protocol to access datasets:
hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet
Examples:
hf://datasets/cais/mmlu@~parquet/default/train/*.parquet
hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet
The @~parquet revision provides auto-converted Parquet files for any dataset format.
Useful DuckDB SQL Functions
LENGTH(column)
regexp_replace(col, '\n', '')
regexp_matches(col, 'pattern')
LOWER(col), UPPER(col)
choices[0]
array_length(choices)
unnest(choices)
COUNT(*), SUM(col), AVG(col)
GROUP BY col HAVING condition
USING SAMPLE 10
USING SAMPLE 10 (RESERVOIR, 42)
ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)
Dataset Creation (dataset_manager.py)
Recommended Workflow
1. Discovery (Use HF MCP Server):
search_datasets("conversational AI training")
get_dataset_details("username/dataset-name")
2. Creation (Use This Skill):
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"
3. Content Management (Use This Skill):
python scripts/dataset_manager.py quick_setup \
--repo_id "your-username/dataset-name" \
--template classification
python scripts/dataset_manager.py add_rows \
--repo_id "your-username/dataset-name" \
--template qa \
--rows_json "$(cat your_qa_data.json)"
Template-Based Data Structures
1. Chat Template (--template chat)
{
"messages": [
{"role": "user", "content": "Natural user request"},
{"role": "assistant", "content": "Response with tool usage"},
{"role": "tool", "content": "Tool response", "tool_call_id": "call_123"}
],
"scenario": "Description of use case",
"complexity": "simple|intermediate|advanced"
}
2. Classification Template (--template classification)
{
"text": "Input text to be classified",
"label": "classification_label",
"confidence": 0.95,
"metadata": {"domain": "technology", "language": "en"}
}
3. QA Template (--template qa)
{
"question": "What is the question being asked?",
"answer": "The complete answer",
"context": "Additional context if needed",
"answer_type": "factual|explanatory|opinion",
"difficulty": "easy|medium|hard"
}
4. Completion Template (--template completion)
{
"prompt": "The beginning text or context",
"completion": "The expected continuation",
"domain": "code|creative|technical|conversational",
"style": "description of writing style"
}
5. Tabular Template (--template tabular)
{
"columns": [
{"name": "feature1", "type": "numeric", "description": "First feature"},
{"name": "target", "type": "categorical", "description": "Target variable"}
],
"data": [
{"feature1": 123, "target": "class_a"},
{"feature1": 456, "target": "class_b"}
]
}
Advanced System Prompt Template
For high-quality training data generation:
You are an AI assistant expert at using MCP tools effectively.
## MCP SERVER DEFINITIONS
[Define available servers and tools]
## TRAINING EXAMPLE STRUCTURE
[Specify exact JSON schema for chat templating]
## QUALITY GUIDELINES
[Detail requirements for realistic scenarios, progressive complexity, proper tool usage]
## EXAMPLE CATEGORIES
[List development workflows, debugging scenarios, data management tasks]
Example Categories & Templates
The skill includes diverse training examples beyond just MCP usage:
Available Example Sets:
training_examples.json - MCP tool usage examples (debugging, project setup, database analysis)
diverse_training_examples.json - Broader scenarios including:
- Educational Chat - Explaining programming concepts, tutorials
- Git Workflows - Feature branches, version control guidance
- Code Analysis - Performance optimization, architecture review
- Content Generation - Professional writing, creative brainstorming
- Codebase Navigation - Legacy code exploration, systematic analysis
- Conversational Support - Problem-solving, technical discussions
Using Different Example Sets:
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
--rows_json "$(cat examples/training_examples.json)"
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
--rows_json "$(cat examples/diverse_training_examples.json)"
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
--rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
Commands Reference
List Available Templates:
python scripts/dataset_manager.py list_templates
Quick Setup (Recommended):
python scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification
Manual Setup:
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here"
python scripts/dataset_manager.py add_rows \
--repo_id "your-username/dataset-name" \
--template qa \
--rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'
View Dataset Statistics:
python scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"
Error Handling
- Repository exists: Script will notify and continue with configuration
- Invalid JSON: Clear error message with parsing details
- Network issues: Automatic retry for transient failures
- Token permissions: Validation before operations begin
Combined Workflow Examples
Example 1: Create Training Subset from Existing Dataset
python scripts/sql_manager.py describe --dataset "cais/mmlu"
python scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject"
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')" \
--push-to "username/mmlu-medical-subset" \
--private
Example 2: Transform and Reshape Data
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--sql "SELECT question, choices[answer] as correct_answer, subject FROM data" \
--push-to "username/mmlu-qa-format"
Example 3: Merge Multiple Dataset Splits
python scripts/sql_manager.py export \
--dataset "cais/mmlu" \
--split "*" \
--output "mmlu_all.parquet"
Example 4: Quality Filtering
python scripts/sql_manager.py query \
--dataset "squad" \
--sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20" \
--push-to "username/squad-filtered"
Example 5: Create Custom Training Dataset
python scripts/sql_manager.py export \
--dataset "cais/mmlu" \
--sql "SELECT question, subject FROM data WHERE subject='nutrition'" \
--output "nutrition_source.jsonl" \
--format jsonl
python scripts/dataset_manager.py init --repo_id "username/nutrition-training"
python scripts/dataset_manager.py add_rows \
--repo_id "username/nutrition-training" \
--template qa \
--rows_json "$(cat processed_data.json)"