| name | investigate-dataset |
| description | Investigate datasets from HuggingFace, CSV, or JSON files to understand their structure, fields, and data quality. Trigger whenever you need to explore or inspect a dataset yourself without using pre-written scripts. |
Investigate Dataset
This workflow helps you explore and understand datasets used in evaluations. It covers HuggingFace datasets, CSV files, and JSON/JSONL files.
Key Concepts
For detailed information on Inspect's dataset types (datasets.Dataset vs inspect_ai.dataset.Dataset), the hf_dataset() pipeline, caching behaviour, and test utilities, see references/inspect-dataset-patterns.md.
Common Patterns in Evals
Evals typically define:
DATASET_PATH: HuggingFace repo path (e.g., "qiaojin/PubMedQA")
DATASET_REVISION: Optional git revision/tag for reproducibility
record_to_sample(): Function converting raw records to Sample objects
Prerequisites
- Access to the evaluation code to find dataset configuration
- Python environment with
datasets, pandas, and inspect_ai installed
Steps
1. Identify the Dataset Source
Look for these patterns in the evaluation code:
DATASET_PATH = "org/dataset-name"
DATASET_REVISION = "v1.0"
hf_dataset(path=DATASET_PATH, name="subset", split="train", ...)
csv_dataset("path/to/file.csv", ...)
load_csv_dataset("https://example.com/file.csv", eval_name="myeval", ...)
json_dataset("path/to/file.json", ...)
load_json_dataset("https://example.com/file.jsonl", eval_name="myeval", ...)
2. Load the Raw Dataset
For investigation, load the raw data directly (not through Inspect's sample_fields transformation). Use standard datasets.load_dataset() for HuggingFace, pd.read_csv() for CSV, or pd.read_json() for JSON/JSONL. For gated datasets, ensure HF_TOKEN is set or run huggingface-cli login.
3. Explore Structure and Quality
Use standard pandas/datasets methods to explore:
- Schema:
ds.features (HF) or df.dtypes (pandas)
- Shape:
len(ds), ds.column_names (HF) or df.info(), df.columns (pandas)
- Sample data:
ds[:3] (HF) or df.head() (pandas)
- Missing values: Check for
None, empty strings, empty lists
- Duplicates: Check ID uniqueness if an ID field exists
- Value distributions:
value_counts() for categorical columns, length stats for text fields
For converting an Inspect Dataset (which has no .to_pandas()) to a DataFrame, see references/inspect-dataset-patterns.md.
4. Understand the Sample Conversion
Look at the record_to_sample function to understand how raw data maps to Inspect samples. Key questions:
- Which fields become
input? Are they combined/formatted?
- What is the
target format? (letter, text, JSON, etc.)
- Are there
choices for multiple choice?
- What goes into
metadata?
- Are any records filtered out?
5. Test the Inspect Loading Pipeline
See references/inspect-dataset-patterns.md for the pattern to load through Inspect's hf_dataset() and verify sample conversion works correctly.
Quick Reference Commands
uv run python -c "from datasets import load_dataset_builder; b = load_dataset_builder('org/name'); print(b.info)"
uv run python -c "from datasets import get_dataset_config_names; print(get_dataset_config_names('org/name'))"
uv run python -c "from datasets import load_dataset; print(load_dataset('org/name', split=None).keys())"
Caching and Troubleshooting
For cache locations (HuggingFace native, Inspect AI, Inspect Evals), force re-download commands, and test utilities, see references/inspect-dataset-patterns.md.
- Gated dataset: Run
huggingface-cli login or set HF_TOKEN
- Rate limited: The
hf_dataset wrapper in inspect_evals.utils.huggingface has built-in retry with backoff
- Large dataset: Use
streaming=True or split="train[:1000]" for sampling
- Missing revision: Check the dataset's "Files and versions" tab on HuggingFace