Run any Skill in Manus with one click

data-analysis

End-to-end data analysis workflow in R or Python — from exploration through regression to publication-ready tables and figures. Make sure to use this skill whenever the user wants to run any empirical analysis, write analysis code, or produce output from data. Triggers include: "analyze this data", "run a regression", "write R code for this", "write Python code for this", "I have a dataset", "help me with this regression", "run a DiD", "run an RDD", "event study", "IV regression", "fit a model", "produce a table", "make a figure", "explore my data", or any request involving a dataset path or empirical estimation.

Run Skill in Manus

Stars1,795

Forks266

UpdatedJune 4, 2026 at 21:34

Source

brycewang-stanford

brycewang-stanford/Auto-Empirical-Research-Skills

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

SKILL.md

readonly

More from this repository

same repository

data-finder

brycewang-stanford/Auto-Empirical-Research-Skills

Find and assess datasets for a research question. Dispatches Explorer agents to search across data source categories, then Explorer-Critic to stress-test each candidate. Produces a ranked list with feasibility grades. Make sure to use this skill whenever the user wants to identify or evaluate data sources — not to search for papers or run analysis. Triggers include: "find data", "what data should I use", "find a dataset for this", "where can I get data on X", "assess datasets", "what datasets exist for", "help me find data", "is there data on this", "what are my data options", "I need data for this project", or any request to locate empirical data sources for a research question.

2026-06-041.8k

deep-audit

brycewang-stanford/Auto-Empirical-Research-Skills

Deep consistency audit of the entire repository — launches 4 parallel specialist agents to find factual errors, code bugs, broken references, count mismatches, and cross-document inconsistencies, then fixes all issues and loops until clean. Make sure to use this skill whenever the user wants a comprehensive repository-wide check — not a targeted review of a single file. Triggers include: "audit", "deep audit", "find inconsistencies", "check everything", "run a full audit", "are there any broken references", "check the whole repo", "something feels off", "run the audit loop", or after making broad changes across multiple files.

2026-06-041.8k

lit-review

brycewang-stanford/Auto-Empirical-Research-Skills

Structured literature review using a parallel fleet of Librarian agents. Searches top journals, working paper repositories (NBER, SSRN, IZA), and traces citation chains from key papers. Make sure to use this skill whenever the user wants to survey existing research on a topic — not to find datasets or write a paper. Triggers include: "review the literature", "find related papers", "what's been done on X", "search for papers on", "do a lit review", "find papers about", "what papers should I cite", "who has written about this", "survey the literature", "find prior work on", or any request to locate and summarize academic publications on a topic.

2026-06-041.8k

new-project

brycewang-stanford/Auto-Empirical-Research-Skills

Start a new research project by conducting a structured interview to formalize a research idea, then generates research questions with identification strategies and a project spec. Make sure to use this skill whenever the user wants to develop or document a new research idea — not to search for literature or data. Triggers include: "new project", "start research", "I have an idea", "help me develop this", "I want to study X", "help me formalize this idea", "what's my research question", "what identification strategy should I use", "write up my project idea", or when the user describes a topic they want to turn into a paper.

2026-06-041.8k

proofread

brycewang-stanford/Auto-Empirical-Research-Skills

Run the proofreading protocol on academic writing — papers or manuscripts. Checks grammar, typos, layout issues, consistency, and academic writing quality. Produces a report without editing files. Make sure to use this skill whenever the user wants surface-level writing errors found — not substantive academic critique. Triggers include: "proofread", "check for typos", "grammar check", "look for errors in my draft", "proofread all", "polish this", "check my writing", "are there any mistakes", "proofread before I send this", or when the user wants a clean-up pass rather than feedback on arguments or methods.

2026-06-041.8k

quality-gate

brycewang-stanford/Auto-Empirical-Research-Skills

Verify that every quantitative claim in the paper is traceable to an analysis output file, and that no important output was omitted. Make sure to use this skill whenever the user wants to check that the paper and analysis are consistent before submission. Triggers include: "run the quality gate", "check the paper matches the analysis", "verify consistency", "does the paper match my results", "check my numbers", "are my tables right", "quality check before submission", "verify my claims", "make sure everything is consistent", "double-check the paper against my output files", or any pre-submission integrity check between paper text and computed results.

2026-06-041.8k

name	data-analysis
description	End-to-end data analysis workflow in R or Python — from exploration through regression to publication-ready tables and figures. Make sure to use this skill whenever the user wants to run any empirical analysis, write analysis code, or produce output from data. Triggers include: "analyze this data", "run a regression", "write R code for this", "write Python code for this", "I have a dataset", "help me with this regression", "run a DiD", "run an RDD", "event study", "IV regression", "fit a model", "produce a table", "make a figure", "explore my data", or any request involving a dataset path or empirical estimation.
argument-hint	[dataset path or description of analysis goal]
allowed-tools	["Read","Grep","Glob","Write","Edit","Bash","Task","AskUserQuestion"]

Data Analysis Workflow

Run an end-to-end data analysis in R or Python: load, explore, analyze, and produce publication-ready output.

Input: $ARGUMENTS — a dataset path (e.g., data/county_panel.csv) or a description of the analysis goal (e.g., "regress wages on education with state fixed effects using CPS data").

Phase 0: Choose Language

Determine language from $ARGUMENTS or ask the user:

User mentions tidyverse, fixest, lm, .R context → R track
User mentions pandas, statsmodels, sklearn, .py or .ipynb context → Python track
Dataset is .csv/.parquet with no language cue → use AskUserQuestion with a single-select menu:
- header: "Language"
- question: "Which language should I use for this analysis?"
- options:
  - label: "R (Recommended)", description: "tidyverse, fixest, ggplot2 — full plugin support with coding conventions and R reviewer"
  - label: "Python", description: "pandas, statsmodels — supported for analysis scripts and figures"
  - label: "Both", description: "R for figures and tables, Python for data processing"

R Track

Constraints

Follow rules/r-code-conventions.md for all standards
Save scripts to scripts/R/ with descriptive names
Save all outputs (figures, tables, RDS) to output/
Use saveRDS() for every computed object
Run r-reviewer on the generated script before presenting results

Phase 1: Setup and Data Loading

Create R script with proper header (title, author, purpose, inputs, outputs)
Load required packages at top (library(), never require())
Set seed once at top: set.seed(42)
Create output directories: dir.create("output/analysis", recursive = TRUE, showWarnings = FALSE)
Load and inspect the dataset

Phase 2: Exploratory Data Analysis

summary(), missingness rates, variable types
Histograms for key continuous variables
Scatter plots, correlation matrices
Panel trends, pre-treatment comparisons if applicable
Save all diagnostic figures to output/diagnostics/

Phase 3: Main Analysis

Panel data: use fixest; cross-section: use lm/glm
Cluster SEs at the appropriate level (document why)
Multiple specifications: start simple, progressively add controls
Report standardized effects alongside raw coefficients

Phase 4: Publication-Ready Output

Tables: modelsummary (preferred) or stargazer — export .tex and .html Figures: ggplot2 with project theme; explicit ggsave(width = X, height = Y); save as .pdf and .png; add bg = "transparent" only if output is for Beamer slides

Phase 5: Save and Review

saveRDS() for all key objects
Run the r-reviewer agent: "Review the script at scripts/R/[script_name].R"
Address Critical and High issues before presenting results

R Script Template

# ============================================================
# [Descriptive Title]
# Author: [from project context]
# Purpose: [What this script does]
# Inputs:  [Data files]
# Outputs: [Figures, tables, RDS files]
# ============================================================

# 0. Setup ----
library(tidyverse)
library(fixest)
library(modelsummary)

set.seed(42)
dir.create("output/analysis", recursive = TRUE, showWarnings = FALSE)

# 1. Data Loading ----
# 2. Exploratory Analysis ----
# 3. Main Analysis ----
# 4. Tables and Figures ----
# 5. Export ----

Python Track

Constraints

Save scripts to scripts/python/ with descriptive names
Save all outputs (figures, tables, pickles) to output/
Use joblib.dump() for model objects; .to_parquet() for DataFrames
Use pathlib.Path for all file paths — never hardcode absolute paths
Set random seeds at the top of the script

Phase 1: Setup and Data Loading

Create Python script with header (title, author, purpose, inputs, outputs)
Import all packages at the top of the file
Set seeds: np.random.seed(42) and random.seed(42)
Create output directories: Path("output/analysis").mkdir(parents=True, exist_ok=True)
Load and inspect the dataset with pandas

Phase 2: Exploratory Data Analysis

df.describe(), df.isnull().sum(), df.dtypes
Histograms and distributions with matplotlib/seaborn
Scatter plots and correlation matrices
Save diagnostic figures to output/diagnostics/
Save summary stats: df.describe().to_csv("output/diagnostics/summary_stats.csv")

Phase 3: Main Analysis

Cross-section OLS: smf.ols("y ~ x", data=df).fit(cov_type="HC3")
Panel data: PanelOLS from linearmodels with cluster-robust SEs
Multiple specifications: build incrementally
Document SE choice with a comment

Phase 4: Publication-Ready Output

Tables: Format with pandas and export via .to_latex() or stargazer (Python port) Figures: matplotlib/seaborn; explicit fig.savefig(path, dpi=300, bbox_inches="tight"); save as .pdf and .png

Phase 5: Save and Review

joblib.dump(model, "output/model.pkl") for fitted models
df_results.to_parquet("output/results.parquet") for DataFrames
Review the script manually against the Python checklist below before presenting

Python Script Template

# ============================================================
# [Descriptive Title]
# Author: [from project context]
# Purpose: [What this script does]
# Inputs:  [Data files]
# Outputs: [Figures, tables, pickle/parquet files]
# ============================================================

import random
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from pathlib import Path

# Seeds
np.random.seed(42)
random.seed(42)

# Output directories
Path("output/analysis").mkdir(parents=True, exist_ok=True)
Path("output/figures").mkdir(parents=True, exist_ok=True)

# 1. Data Loading
# 2. Exploratory Analysis
# 3. Main Analysis
# 4. Tables and Figures
# 5. Export

Python Quality Checklist

[ ] All imports at top
[ ] Random seeds set (numpy + stdlib)
[ ] All paths use pathlib.Path — no hardcoded strings
[ ] Output directories created with mkdir(exist_ok=True)
[ ] Figures saved with explicit dpi=300, bbox_inches="tight"
[ ] Model objects saved with joblib.dump()
[ ] DataFrames saved as parquet
[ ] Comments explain WHY, not WHAT

Shared Principles

Reproduce, don't guess. If the user specifies a regression, run exactly that.
Show your work. Compute summary statistics before jumping to regression.
Check for issues. Look for multicollinearity, outliers, perfect prediction, missing data.
Use relative paths. All paths relative to repository root.
No hardcoded values. Use variables for sample restrictions, date ranges, thresholds.