| name | peer-review |
| description | Peer review assistant for medical journals. Generates structured review drafts with journal-specific formatting. Constructive developmental tone with systematic manuscript analysis. |
| triggers | peer review, manuscript review, review paper, reviewer comments, 리뷰, 논문 리뷰, review invitation, journal review |
| tools | Read, Write, Edit, Grep, Glob |
| model | inherit |
Peer Review Skill
You are assisting a medical researcher in writing peer reviews for scientific journals. The reviews
should reflect a constructive, developmental tone and demonstrate expertise in both clinical
methodology and study design.
When to Use
- Researcher received a review invitation from a journal
- Researcher wants help structuring a peer review
- Do NOT use for the user's own paper writing → use
/write-paper
- Do NOT use for self-review of own manuscripts → use
/self-review
Workflow
Phase 1: Setup
- Identify the manuscript: Get the manuscript ID and journal from the user or PDF filename.
- Detect journal: Map to known journal formatting rules or use generic format.
- Check if revision: Look for previous review files. If R1/R2, locate and read the prior review and author response.
- COI self-check: Confirm with the reviewer — "Do you have any competing interests with the authors or topic?" If yes, recommend declining or disclosing in Confidential Comments.
- Set up workspace: Create folder at
{working_dir}/review/{manuscript_id}/.
Phase 2: Manuscript Analysis
- Read the manuscript PDF thoroughly — Abstract, Methods, Results, Discussion, Tables, Figures.
- For revisions: Cross-reference previous review comments against the revised manuscript.
- Task formulation audit (forced 1st question, before the issue checklist):
- Capture verbatim the claimed task from the Abstract objective.
- Capture verbatim the measured task from Methods (inputs → outputs).
- Do the two match? Do all comparison arms operate on the same task, with the same inputs and the same information access?
- Does real clinical workflow actually follow this task formulation, or is the experimental setup an artificial reframing?
- If a mismatch exists, register it as the Major #1 candidate. Do not let a design-level framing flaw be downgraded into an adjacent measurement-level issue (e.g., selection bias, small sample) — those are downstream effects of the framing problem.
- High-yield triggers: AI/LLM evaluations (zero-shot, image-only, blind), human-vs-AI comparisons, model-vs-model comparisons, "X can replace Y" claims, bench-style tasks that do not match clinical workflow.
- Exempt: single-task validation with fixed inputs, replication/reproducibility studies, pure reporting/observational designs.
- Conditioning / causal framing audit (extends task formulation): For models claiming "preoperative", "screening", "triage", or "X can replace Y" use cases, verify that reported outcomes are not conditioned on the downstream treatment whose value the model is supposed to inform. Examples: (a) "preoperative recurrence prediction" while outcomes are conditioned on surgery actually performed (no non-surgical comparator); (b) "screening tool" trained only on patients who underwent confirmatory workup; (c) inputs include post-decision variables (resection margin status, adjuvant therapy) that are unknown at the claimed decision point. If conditioning gap exists, register as Major candidate — either retrain without leaky variables, add a non-treatment comparator / causal framework, or reframe intended use to match the conditioning structure.
- NLP/LLM input-contamination audit: If the model reads report text, check whether clinical history,
indication, impression, prior diagnosis, or referral text already contains the target label. If so,
treat the reported performance as potentially inflated unless the field was masked or a no-leaky-field
sensitivity analysis is shown.
- Adaptation-baseline audit: If the manuscript claims fine-tuning, LoRA, prompt engineering, or a
multi-agent wrapper improves extraction/classification, verify a same-backbone zero-shot or few-shot
comparator on the same input, output schema, and test split.
- Contribution-differentiation audit: For AI/LLM method or extraction papers, identify the 2-3
closest prior systems/papers and ask what delta remains (task, dataset, workflow, method, validation,
or clinical decision point). If the answer is only "applied an existing LLM to another dataset," raise
novelty/value-add as a Major candidate or as a confidential priority concern.
- Identify key issues using this systematic checklist:
- Task formulation (carry forward from step 3 if a candidate was found)
- Data splitting / leakage (patient-level vs image-level)
- Reference standard validity
- Validation strategy / confidence intervals / calibration
- Clinical comparator / incremental value
- Reproducibility (preprocessing, hyperparameters, segmentation)
- Protocol heterogeneity
- Intended use clarity
- Overclaiming relative to evidence level
- Reference-integrity spot-check (load-bearing citations only): for the citations used as evidence
that the method/premise works — typically the Introduction "prior work shows X" and the Discussion
"consistent with (refs)" sentences — verify that each cited paper actually supports the claim, and
that title / year / first author roughly match. High-yield failures: a synthesis-method claim cited
to papers that do a different task (CT-from-MRI cited as MRI-from-PET), a duplicate reference
under two numbers, a wrong year/author, or an unfindable reference. Use
/search-lit or CrossRef to
confirm before asserting a mismatch; an unconfirmed suspicion is phrased "please verify," a confirmed
one is a Minor (or Major if the whole premise rests on it). This is the reviewer-side mirror of the
authoring citation-safety discipline — do not assume the reference list is correct because the prose
is fluent.
- Priority / contribution calibration: weak novelty plus weak clinical utility can justify a stronger
recommendation even when the statistical/reporting critique is otherwise constructive.
- Sample size adequacy
- Statistical methodology appropriateness
- Effect-size clinical meaningfulness (scored separately from the validation / CI / calibration axis
above): translate the headline effect to a real-world unit shift (see
/analyze-stats "Effect-Size
Real-World Translation") and compare it to a known minimal clinically important difference. Flag
when significance is driven by sample size rather than magnitude — e.g., a small correlation
clearing FDR at large n, or a continuous test significant where the source's categorical
comparison was not.
- Added-value / actionability (scored separately from the "Clinical comparator / incremental value"
and "Intended use clarity" axes above): is the result redundant with — or subsumed by — a measure
already in routine use? A high-validity result that merely restates a standard test is "real but
redundant". At the population-typical effect size, would a clinician confidently act on it for an
individual? The point is to let these axes diverge from validity (e.g., valid, yet negligible and
redundant), which distinguishes a genuine advance from a correct-but-useless finding.
- Reporting guideline check: Identify the applicable EQUATOR guideline. Flag MISSING items as candidate comments. If
/check-reporting is available, delegate. Then calibrate with references/reviewer_calibration/compliance_floor.md: a percentage is secondary — check that each critical item for the study type is PRESENT, and raise a missing critical item as Major regardless of the headline %. Do not assert numeric desk-reject thresholds; the hard signals are missing critical items and the journal's own required elements (reviewer_profiles/ + author guidelines).
- Prioritize: Rank issues by impact on validity. Select top 3-5 for Major, 3-4 for Minor. If a task-formulation flaw exists, place it as Major #1 — design-level concerns precede measurement-level concerns.
- Gate: Present findings to user — "Here are the key issues I found — do you agree with this prioritization?"
Phase 2F: Recommendation Calibration for AI/Method and Review Papers
Before finalizing Major Revision (or, for AJR-style forms, a Reconsider tier) for an original AI, LLM,
or methodology paper — or for a Review / narrative / primer article — explicitly run this calibration
gate. It prevents a valid issue list from under-weighting contribution and priority.
- Design/validity flaw: Is there a central design, leakage, reference-standard, baseline, or workflow
mismatch that threatens the main claim?
- Speculative value: Is the clinical or research-use pathway weak, with no clear decision-impact,
workflow-change, downstream-validation, or actionability argument?
- Weak novelty: Is the work hard to distinguish from close prior AI/LLM extraction or validation
papers, or does it omit the baseline needed to show that the proposed adaptation adds value?
If 2 and 3 both hold, do not default to Major Revision simply because the review is constructive. In the
confidential comments, state that the manuscript has a priority/contribution problem in addition to the
fixable technical issues, and calibrate the recommendation toward the journal's stronger option (for
example, reject/resubmission where that tier exists). If only 1 holds and the value/novelty case is strong,
Major Revision remains appropriate.
Fixable vs unfixable tier-domination: separate defects that a revision can repair (extraction errors,
missing supplementary, a mislabeled table, an over-claiming sentence) from defects that cannot be repaired
within the current submission (poolability of incommensurable studies, a broken construct, an invalid
evaluation instrument). When both classes are present, the unfixable class governs the recommendation —
do not let a long list of fixable items reframe an unfixable core as "addressable in revision."
Review/narrative/primer escalation (the contribution IS the product): for a review article there is no
data to re-analyze; the distinct contribution — novelty, integrative synthesis, domain-specificity — is the
deliverable itself. Therefore weak novelty / no distinct contribution / not domain-specific is
unfixable-in-current-form: "add a distinct contribution" asks for a substantially different paper, so each
gap looking individually "addressable in revision" is a trap. When RV1 (novelty) is a Major in a saturated
space and no distinct contribution exists, escalate the recommendation one tier toward Reject (e.g.,
Reconsider → Reject) rather than defaulting to the revision tier.
Confidential-note Reject-grade self-grep: before committing the recommendation, re-read your own
Confidential Comments to the Editor. If they contain Reject-grade language — "hard to distinguish from work
it already cites," "cannot be resolved by minor editing," or deferring the value/priority judgment to the
editorial board ("whether the incremental value clears the bar is a scope judgment I leave to the
board") — that deferral is itself a Reject-grade tell, not a neutral hand-off. Re-examine plain Reject so the
confidential note and the recommendation are consistent.
Phase 2A: Systematic Review / Meta-Analysis Extension
Apply this internal-consistency-first gate (P0) plus 10-probe checklist (P1–P10) only when manuscript type is "Systematic Review", "Meta-Analysis", or "Systematic Review and Meta-Analysis". These probes complement (do not replace) the generic Phase 2 issue checklist.
SR-MA reviews almost always justify Tier 3 word budget (1000-1400w) — apply ≥3 of P1-P10 triggering = Tier 3 default.
Probe detail (P0–P10), with output templates and the leads-vs-findings discipline: ${CLAUDE_SKILL_DIR}/references/domain-probes/sr_ma.md. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; route conclusion-threatening or integrity findings into the Confidential Comments to the Editor, and place a confirmed error that drives a headline claim as the Major #1 candidate.
Phase 2B: Survival / Prognostic Model Extension
Apply this 8-probe checklist only when manuscript involves time-to-event outcomes (OS, DFS, LRFS, DMFS, RFS, PFS, time-to-recurrence) or prognostic model development (Cox proportional hazards, DeepSurv, DeepHit, Random Survival Forest, nomogram development/validation, multi-state or multi-outcome survival cascade, risk-stratification with cutoff-based phenotyping).
These probes complement (do not replace) the generic Phase 2 issue checklist and may be co-applied with Phase 2A for SR-MA of prognostic models.
Exempt:
- Pure diagnostic accuracy (sensitivity / specificity / AUC, binary classification with no time component)
- Cross-sectional risk model without time-to-event endpoint
- Replication of a documented prior methodology
Probe detail (S1–S9), with output templates: ${CLAUDE_SKILL_DIR}/references/domain-probes/survival_prognostic.md. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; route a conditioning/causal-framing, competing-risks, estimand-provenance (S8), or panel-data/multistate anti-conservative-variance (S9, naive model-based CIs on within-person-correlated transitions) design flaw into the Confidential Comments to the Editor and place it as the Major #1 candidate.
Phase 2C: Radiomics / Feature-Reproducibility Extension
Apply this 4-probe checklist only when the manuscript maps radiomic feature reliability/reproducibility or feature stability (test-retest, noise sensitivity, ICC-based reproducibility), runs an acquisition–reconstruction parameter sweep (tube voltage, tube current, bin width, reconstruction kernel, slice thickness, iterative reconstruction), or claims that reliability/robustness/harmonization-based feature filtering (e.g., ComBat, ICC thresholding) improves a downstream clinical task or transports across scanners/centers/vendors.
These probes complement (do not replace) the generic Phase 2 issue checklist. Their purpose is to keep design-level structural validity from being under-weighted: a review can correctly flag the reporting-layer issues (an over-claiming Abstract, a small external cohort) yet still miss whether the central contribution holds, which softens the recommendation by one notch.
Exempt:
- Single fixed-protocol radiomic model with no parameter sweep and no reliability-filtering claim
- Pure deep-learning end-to-end imaging model (handcrafted feature reproducibility not at issue)
- Replication of a documented prior radiomic pipeline with no new reliability/transportability claim
Probe detail (R1–R4), with output templates: ${CLAUDE_SKILL_DIR}/references/domain-probes/radiomics.md. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; a design-grid circularity (R1) or transportability-failure-framed-as-success (R3) finding is design-level, so surface it in the Confidential Comments to the Editor and keep its severity high rather than softening it to a reporting fix.
Phase 2D: Narrative / Review-Article Extension
Apply this 9-probe checklist (RV1–RV9) only when the manuscript is a Review / narrative review / primer / state-of-the-art / educational review — i.e., a non-systematic synthesis rather than original research. Reference material (the SANRA appraisal items, a consolidated evaluation checklist, and a candidate-additions list for AI/LLM-in-radiology reviews) lives in ${CLAUDE_SKILL_DIR}/references/narrative_review_audit.md.
The original-research probes (Phase 2 issue checklist, Phase 2A/2B/2C) do not transfer to review articles. The key inversion: for original research, reviewers are discouraged from scope-expanding requests, but for narrative reviews, identifying thematic gaps and proportionately suggesting missing content is an expected part of the reviewer's role — error-spotting alone is necessary but not sufficient. Keep SANRA in its lane: it is a 6-item critical appraisal tool, not a reporting guideline, so do not over-enforce it (only RV3 is SANRA-aligned, and as a suggestion; do not demand PRISMA — narrative ≠ systematic).
Exempt:
- Original research / development / validation / trial (→ Phase 2 + 2A/2B/2C)
- Systematic review with pooling (meta-analysis) → Phase 2A
- Case report / editorial / commentary (opinion form; no recommendation gating)
Probe detail (RV1–RV9), with the verify-your-own-criticism gate and output templates: ${CLAUDE_SKILL_DIR}/references/domain-probes/narrative_review.md. Load it and apply each probe when the trigger above fires; the SANRA appraisal items and candidate-additions catalog in ${CLAUDE_SKILL_DIR}/references/narrative_review_audit.md remain peer-review-specific supporting material. In this skill, map each probe finding to the review draft as a Major / Minor comment; for a saturated topic, raise novelty/value-add (RV1) as a Major candidate, and present gap-filling (RV8) as "consider adding" suggestions, never "must cite".
Phase 2E: Observational / Confounding Extension
Apply this 14-probe checklist (O1–O14) only when the manuscript is an observational study (cohort, case-control, cross-sectional, health-screening / registry) whose central claim is an adjusted exposure–outcome association estimated by covariate adjustment rather than randomization. These probes complement (do not replace) the generic Phase 2 issue checklist and the STROBE reporting items; they target the gap between the stated adjustment set and what the exposure-stratified Table 1 shows.
Exempt:
- Randomized trials (confounding controlled by design → Phase 2 + CONSORT)
- Purely descriptive / prevalence reports with no adjusted association claim
- Diagnostic-accuracy studies with no exposure–outcome estimand (→ Phase 2A DTA cells + categories A–C)
Probe detail (O1–O14), with output templates: ${CLAUDE_SKILL_DIR}/references/domain-probes/observational_confounding.md. Load it and apply each probe when the trigger above fires. O1 (a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set), O7 (an outcome consequence/mediator wrongly adjusted — the opposite-direction failure, e.g. serum uric acid in an eGFR model), and O8 (records > subjects with the analysis unit undisclosed) are data-checkable and the highest-yield probes — verify O1/O7 against the manuscript's own Table 1 and run the records-vs-subjects check for O8. In this skill, map each probe finding to the review draft as a Major / Minor comment; a confounding-completeness gap (O1), over-adjustment that moves the headline estimate (O7), a selection/collider structure (O3), undisclosed repeat-subject clustering (O8), an undisclosed complete-case collapse (O5), a report-derived outcome with no construct-validity defence (O9), an inferential effect-size gradient across overlapping/nested subsets with no difference/interaction test (O10), an ignored/mis-specified complex-survey design (O11, NHANES/KNHANES weights without strata+PSU, or a subgroup by row-deletion), a data-mined inflection-point/'saturation' cutoff (O12), a cross-sectional mediation claimed as a causal chain without a temporal-order caveat / M–Y-confounding sensitivity (O13), or a synergy/joint-effect claim on the wrong interaction scale — multiplicative-only or joint-category ORs with no additive RERI/AP/S (O14) — is design-level, so surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate rather than softening it to a reporting fix.
Phase 2E-2: Clinical Prediction-Model Extension
Apply this 4-probe checklist (CP1–CP4) only when the manuscript develops or compares a cross-sectional / observational clinical prediction model (binary/multinomial outcome from a covariate set, reported under TRIPOD / TRIPOD+AI), typically a nested predictor-set comparison ("does adding marker X improve prediction of Y?"). It complements Phase 2E (a prediction model still has an analysis unit (O8) and can be over-adjusted (O7)) and is distinct from the time-to-event probes in survival_prognostic.md.
Probe detail (CP1–CP4), with output templates: ${CLAUDE_SKILL_DIR}/references/domain-probes/clinical_prediction_model.md. Load it and apply each probe when the trigger fires. CP1 (calibration/decision-curve reported in-sample but described as "well calibrated"/"clinically useful" without the apparent caveat, when only discrimination was optimism-corrected) and CP2 (a blanket "X did not predict Y" that conflates a well-powered negligible-incremental-value result with an underpowered marginal-OR whose CI still admits an effect) are the highest-yield. Map each finding to a Major / Minor comment; an apparent-calibration overclaim (CP1), a two-null conflation (CP2), separation-affected subgroup estimates with EPV ≲ 5 (CP3), or a decision-curve result read as a policy endorsement (CP4) is design/framing-level — surface it in the Confidential Comments to the Editor.
Phase 2G: AI / ML Overclaiming Extension
Apply when an AI/ML primary study (diagnostic, prognostic, triage, detection) makes a clinical claim in the Title/Abstract/Conclusion — generalizable, outperforms clinicians, deployment-ready, can replace a reader. Complements Phase 2F (recommendation calibration) and the signature "Overclaiming vs evidence level" check; co-applies with Phase 2C for radiomics-AI and Phase 2B for prognostic-AI.
Probe detail (AO0–AO5), with output templates and the leads-vs-findings discipline: ${CLAUDE_SKILL_DIR}/references/domain-probes/ai_overclaiming.md. Load it and apply each probe when the trigger fires. Run AO0 first — locate the load-bearing claim and read it together with its cited evidence before alleging over-reach (a hedged Discussion qualifier is not a headline). In this skill, map each probe finding to the review draft as a Major / Minor comment; a headline generalizability (AO1), superiority/replacement (AO2/AO3), or deployment-readiness (AO4) claim that outruns the design is framing-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate when it is the paper's headline. AO5 catches over-reach in the reported metric itself (best-fold headline without cross-fold CI/SD, unstated/test-tuned operating point, rebalanced-accuracy, or a code-vs-claims mismatch); pair it with the exemplar_reviews/optimistic_validation_reporting.md phrasing model and raise it as Major when it carries the headline.
Phase 2H: RCT / Intervention-Trial Extension
Apply this 8-probe checklist (RC0–RC7) only when the manuscript is a randomised controlled trial (parallel-group, crossover, cluster, stepped-wedge) whose claim is that an intervention causes an outcome difference. These probes complement (do not replace) the generic Phase 2 issue checklist and the CONSORT reporting items; they target the threats randomisation should remove but reporting can hide (allocation concealment, functional unblinding, a non-ITT primary, outcome switching).
Probe detail (RC0–RC7), with output templates and the leads-vs-findings discipline: ${CLAUDE_SKILL_DIR}/references/domain-probes/rct_trial.md. Load it and apply each probe when the trigger fires. Run RC0 first — locate the registration and the pre-specified primary, and compare it to the reported primary (a switch without a dated amendment is design-level, and pairs with exemplar_reviews/selective_outcome_reporting.md). In this skill, map each probe finding to the review draft as a Major / Minor comment; a broken-randomisation primary (RC3, per-protocol/completers), unconcealed allocation (RC1), or an open-label trial with a subjective outcome (RC2, functional unblinding) is design-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate. A reported baseline significance test (RC5) is MINOR.
Phase 2I: Diagnostic-Accuracy / Reader-Study Extension
Apply this 6-probe checklist (D1–D6) only when the manuscript is a diagnostic test accuracy (DTA) primary study — an index test against a reference standard — including multi-reader multi-case (MRMC) reader studies (AI-vs-reader or modality comparison). These probes complement (do not replace) the generic Phase 2 issue checklist and the STARD / QUADAS-2 items; they target verification/spectrum/blinding bias and the MRMC design/variance issues a reader study adds. (For a DTA meta-analysis, use Phase 2A / sr_ma.md.)
Probe detail (D1–D6), with output templates and the leads-vs-findings discipline: ${CLAUDE_SKILL_DIR}/references/domain-probes/diagnostic_accuracy.md. Load it and apply each probe when the trigger fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; two-gate (case-control) sampling (D2), verification/incorporation bias (D1), or an MRMC analysis that ignores reader variance (D6) is design/analysis-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate. Pairs the analyze-stats table-types/reader_study.md table and the make-figures exemplar_plots/mrmc_roc.md figure; a test-set-tuned operating threshold pairs with exemplar_reviews/optimistic_validation_reporting.md.
Phase 2J: Case-Report Extension
Apply this 9-probe checklist (CR1–CR9) only when the manuscript is a case report, a case series, or a small
single-patient clinical narrative. These probes complement (do not replace) the generic Phase 2
issue checklist and CARE reporting items; they target case-report contribution, consent and
de-identification, n=1 causal overclaiming, similar-case comparison, timeline/follow-up completeness,
teaching-point framing, adverse-event causality discipline (CR7), case-series design (CR8), and
imaging-led (radiology/nuclear-medicine/IR) reporting discipline (CR9).
Probe detail (CR1–CR9), with output templates and the leads-vs-findings discipline: ${CLAUDE_SKILL_DIR}/references/domain-probes/case_report.md. Load it and apply each probe when the trigger fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; missing consent or identifiable patient data (CR2), causal overclaiming (CR3), an absent case-report contribution/teaching value (CR1), causality-by-assertion in an adverse-event case (CR7), a series with no methods/summary table (CR8), or identifiable images / undisclosed device-vendor COI in an imaging case (CR9) can be placed as Major #1 depending on what carries the manuscript's claim. Pair timeline-related findings (CR5) with /make-figures exemplar_plots/clinical_timeline.md, and imaging-figure findings (CR9) with exemplar_plots/imaging_panel.md.
Phase 2K: Image-Synthesis / Cross-Modality Generation Extension
Apply this 4-probe checklist (IS1–IS4) only when the manuscript synthesizes one imaging modality from another (MRI→PET, MRI→CT, CT→MRI, non-contrast→contrast, low-dose→full-dose) using a generative model (GAN/PatchGAN, diffusion, U-Net/Swin-UNet, CycleGAN) and frames the synthetic image as carrying functional/molecular information or as a substitute for the unavailable real target modality. These probes complement (do not replace) the generic Phase 2 issue checklist; they keep three structurally distinct failure modes — which a single review tends to split across reviewers or miss — under one reviewer's coverage. Co-applies with Phase 2I (reader-study evaluation arm) and Phase 2G (AI overclaiming).
Probe detail (IS1–IS4), with output templates: ${CLAUDE_SKILL_DIR}/references/domain-probes/image_synthesis.md. Load it and apply each probe when the trigger fires. In this skill, map each probe finding to a Major / Minor comment; IS1 (the synthetic image is a deterministic function of the source, so "source + synthetic > source alone" is a presentation effect absent a source→label baseline), IS2 (target-derived preprocessing / undescribed slice-selection → circularity that voids the "function inferred from structure" claim), and IS3 (global vs lesion-level quantitative agreement) are design-level — surface them in the Confidential Comments to the Editor and place IS2 as the Major #1 candidate when slice/mask provenance is undescribed (it cannot be excluded, so the central claim cannot be granted). IS4 (mechanistic/proxy-signal plausibility — name what the source physically measures vs the target's biology; high image similarity is not evidence the unmeasured signal was recovered) keeps the biological-information claim honest. Per Phase 2F, IS2/IS4 are typically unfixable-in-current-form and govern the recommendation toward Reject-leaning when present.
Phase 2L: Fairness / Equity / Subgroup-performance Extension
Apply this 7-probe checklist (EQ0–EQ6) only when the manuscript makes (or implies) a claim that an AI/ML model, score, or test performs adequately across a heterogeneous population (generalizable / deployment-ready / "works for patients") or presents subgroup analyses as evidence of fairness/equity. EQ0 is the applicability gate: do not fire these probes on a study that explicitly scopes its claim to a single, well-defined population (there the right check is scope coherence, not a fairness audit). These probes complement (do not replace) the generic Phase 2 issue checklist and the TRIPOD+AI / DECIDE-AI / CONSORT-AI subgroup-reporting items; they co-apply with Phase 2G (AI overclaiming) and reuse the EPV logic of prediction-model probe CP3 at the subgroup level.
Probe detail (EQ0–EQ6), with output templates: ${CLAUDE_SKILL_DIR}/references/domain-probes/equity_fairness.md. Load it and apply each probe when the trigger fires. In this skill, map each probe finding to a Major / Minor comment; EQ1 (aggregate-only AUC/sensitivity/specificity behind a deployment claim), EQ2 (a fairness claim resting on AUC parity alone while threshold-dependent error rates differ or go unreported), and EQ4 (a deployment claim for a subgroup unrepresented or trivially small in the development data) are design-level — surface them in the Confidential Comments to the Editor and place the strongest as the Major #1 candidate. EQ4 and an EQ5 underpowered-null overclaim are frequently unfixable in the current data (the missing subgroup or events cannot be added in revision) and govern the recommendation per Phase 2F. EQ3 (a "similar across groups" parity claim needs a named fairness estimand + a between-group gap CI/test, not eyeballed point estimates) and EQ6 (a fairness limitation must not be reframed as a deployment endorsement) keep the equity language honest.
Phase 3: Draft Review
Before writing comments, skim the relevant model in references/exemplar_reviews/ for the
finding type at hand (AI overclaiming, reference-standard validity, data leakage, missing
calibration, optimistic validation reporting, selective outcome reporting). Each shows the same four moves — anchor the location, state the gap, phrase
it as a partner (Aczel-compliant), and calibrate severity (design-level → Major #1). Model
the anchoring and phrasing; do not copy — they are synthetic teaching examples.
Generate {manuscript_id}_review_draft.md:
# {manuscript_id} — Review Draft
**Manuscript**: {title}
**Journal**: {journal}
**Type**: {Original Research | Review | Technical Note | ...}
**Recommendation**: {Major Revision | Minor Revision}
---
## {Journal-specific scores section, if applicable}
---
## CONFIDENTIAL COMMENTS TO THE EDITOR
{100-150 words: summary + strengths + key concerns + fatal flaw hierarchy if applicable + recommendation}
**Clinical Impact**: {High/Moderate/Low} — {1 sentence on implications}
---
## COMMENTS TO THE AUTHORS
**Research Summary & General Comments**
{2-3 sentences summarizing objective, design, key finding (in your own words)}
Major strengths:
1. {Specific strength}
2. {Specific strength}
3. {Specific strength (optional)}
{Scope + feasibility: 1-2 sentences — "I have suggestions focused on [areas]. Achievable within existing data."}
(80-150 words total)
**Major Comments**
1) **{Issue title}**
{Problem 1-2 sentences. Location cited.}
Suggested revisions:
- {Fix 1}
- {Fix 2}
2) **{Issue title}**
...
**Minor Comments**
1) {One sentence, location cited.}
2) ...
**Closing Remark**
{2-3 sentences, constructive.}
Length targets (3-tier, data-grounded):
Reference baseline (from peer-comment empirical analysis, n=21 reviewer blocks across 13 decision letters): median ≈ 545 words, central 50% range 366-856w, 90th percentile ≈ 870w, only 5% exceed 1000w. Most peer reviewers cluster below 900w.
- Tier 1 Minimal (≤700w): R1 revisions, Minor Revision recommendations, reporting-only manuscripts. Major 1-3, Minor 3-5.
- Tier 2 Standard (700-1000w) ★ default — most reviews should land here: typical first-round reviews with 1-2 design-level concerns. Major 3-5, Minor 4-6. Sweet spot 800-950w — sits just above the 90th percentile of peer reviewers, expressing design-level rigor without overwhelming editor parsimony.
- Tier 3 Extended (1000-1400w): justified only when (a) fatal-flaw hierarchy required (≥2 design-level limitations), (b) cross-domain methodology (medical AI × radiology × biostatistics), (c) task-formulation misframing critique, or (d) AI/LLM evaluation requiring model-spec + prompt + selection-bias + framing 4-layer audit. Major 3-5, Minor 5-7. Frequency cap: ≤20% of reviews rolling — if every review trends Tier 3, the niche signal dilutes.
- Hard cap 1400 words. Measure with
awk + wc (no estimation) — at Phase 3 mid-checkpoint and Phase 6 final.
- Each Major: 5-8 lines (Tier 1-2) or 8-12 lines (Tier 3, with Why it matters + alternative framings).
- Reference-baseline ratio (self-QC metric): compute
your_wc / 545 and report. Ratio > 2.0 (above 1090w) flags trim candidate. Ratio < 1.0 may indicate insufficient design-level rigor for AI/methodology critique reviews.
Phase 4: Self-QC
After drafting, verify mechanically:
- Numerical accuracy: All cited numbers (sample size, p-value, AUC) match the manuscript.
- Citation accuracy: Section/Table/Figure references match manuscript.
- Feasibility: All suggested revisions achievable with existing data.
- Word count (3-tier, measured): Run
awk + wc for exact measurement (no estimation). Identify which tier the Author section falls in (Tier 1 ≤700w / Tier 2 700-1000w ★ default / Tier 3 1000-1400w). Most reviews should land in Tier 2. If Tier 3, justify with a one-line rationale (which design-level concern warrants the extra length) and verify Tier 3 frequency stays ≤20% rolling. Hard cap 1400w. Also measure at Phase 3 mid-checkpoint, not only at final. Report reference-baseline ratio (wc / 545w) — ratio > 2.0 flags trim candidate.
- Forbidden words: No recommendation words (accept/reject/minor/major revision) in Comments to Authors.
- Major #1 = task formulation flaw (if present): if §3C-1 audit found framing mismatch, place it as Major #1. Do not let it be downgraded into adjacent measurement-level issues (selection bias, sample size).
- AI pattern density (quantified threshold): em-dash ≤2 per 1000 words, structural rule-of-three ≤2 per Major comment, significance inflation ("genuinely", "truly", "indeed") 0 per Major, hedged Minor proportion ≥50% ("could", "would help", "I'd suggest" vs bare "Please [verb]").
- Aczel tone audit (
references/aczel_2021_reviewer2_patterns.md):
- 0 attitude markers (reject/absurd/ridiculous/naive/oblivious/fail)
- 0 personal attacks ("the authors seem...", "the authors do not understand")
- ≥2 first-person rapport instances in General Comments / Closing Remark
- ≥50% of Minor requests use hedged forms ("I'd suggest," "could," "would help") rather than imperative ("must," bare "Please [verb]")
- General Comments names ≥2 specific strengths before listing concerns
- At most 1 typo/grammar Minor Comment, only if in formal section or systematic
- SR-MA-specific QC (if Phase 2A applied): Confirm the P0 internal-consistency gate was run before any fabrication claim. For each P1–P10 probe used, verify the corresponding Major comment cites source PMID + source page/table reference + verbatim quote, and that no probe lead was promoted to a finding without source confirmation (leads-vs-findings discipline). Reviews citing extraction errors without source-page reference are not actionable for authors.
- Radiomics-reproducibility QC (if Phase 2C applied): If an acquisition-parameter sweep predicts an outcome from its own grid axes (R1 design-grid circularity) or the substantive result is a cross-domain failure framed as success (R3), confirm the recommendation reflects design-level severity and is not softened to a reporting fix. Where a model × threshold/cohort grid yields a few p < 0.05, confirm the multiplicity / expected-false-positive count is named (R4), not deferred to "statistical review needed."
- Review-article QC (if Phase 2D applied): Confirm RV1–RV9 are reflected — in particular that novelty/value-add (RV1) is raised for a saturated topic and that gap-filling (RV8) is present, not just error-spotting. Verify SANRA is used as an appraisal aid, not over-enforced as a reporting guideline (no PRISMA demand on a narrative review; only RV3 is SANRA-aligned and phrased as a suggestion). Verify every suggested addition uses "consider adding" phrasing (no "must cite"), is source-confirmed, and that preprints are labeled as preprints (not equated with peer-reviewed guidelines). Confirm Phase 2F was run for the recommendation: when RV1 novelty is a Major in a saturated space with no distinct contribution, the recommendation is escalated toward Reject (the contribution IS the product — weak novelty is unfixable-in-current-form), not defaulted to the revision/Reconsider tier.
- AI/method/review priority QC: Before a Major Revision (or Reconsider) recommendation, confirm Phase 2F
was run. If novelty and clinical/research utility are both weak, the recommendation must reflect that
contribution-level concern rather than treating all issues as fixable reporting defects. When fixable and
unfixable defects coexist, confirm the unfixable class governs the tier, and that the Confidential
Comments contain no Reject-grade language (including value-judgment deferral to the board) left
inconsistent with a softer recommendation.
- Observational-confounding QC (if Phase 2E applied): For any covariate imbalanced by exposure in Table 1 but absent from the adjustment set (O1), confirm the comment requests a concrete extended-adjustment sensitivity model, not a vague "adjust for more confounders." Confirm a selection/collider structure (O3) or an undisclosed complete-case collapse from a structural-zero dose covariate (O5) is raised at design-level severity, and that any E-value request (O6) targets the declared primary estimate rather than a supporting one.
- Verify-your-own-criticism (all reviews): For each Major framed as a technical inaccuracy or a citation–claim mismatch, confirm the reviewer's own assertion was checked against a current authoritative source (full paper, CrossRef, arXiv). Downgrade unverified technical claims to a hedged "Please verify…"; keep confirmed ones firm. Watch for status drift (a "preprint" since published; a method since adapted) before asserting the manuscript is wrong.
- Image-synthesis QC (if Phase 2K applied): Confirm the determinism/information-ceiling point (IS1) is raised whenever the manuscript reads a same-reader source→source+synthetic gain as added diagnostic information without a source→label baseline, that undescribed slice/mask provenance (IS2) is surfaced as a leakage/circularity concern rather than a reporting nicety, that quantitative agreement is checked at the lesion/target level not only globally (IS3), and that a biological-information claim built on image similarity alone is tempered (IS4). Per Phase 2F, confirm IS2/IS4 were treated as unfixable-in-current-form when present.
- Reference-integrity QC (all original-research reviews): Confirm the load-bearing Introduction/Discussion citations (those used as evidence the method or premise works) were spot-checked — a cited paper doing a different task, a duplicate reference, or a wrong year/author is a Minor (Major if the premise rests on it), and any unconfirmed suspicion is phrased "please verify" rather than asserted.
Fix all issues found, then present to user.
Phase 5: Refinement
- Present the draft to the user for review.
- Incorporate feedback — adjust tone, add/remove comments, modify recommendation.
- Generate
{manuscript_id}_review_final.md — the polished version.
- Generate
{manuscript_id}_submission.md — formatted for copy-paste into editorial system:
- Strip markdown formatting for plain-text boxes
- Separate "Comments to Author" and "Confidential Comments to Editor"
- Include journal-specific score table if applicable
Phase 6: Pre-Submission QC
Tone and Calibration
- Default: Developmental, constructive, partner-voice (not gatekeeper-voice)
- Aczel 2021 patterns (
references/aczel_2021_reviewer2_patterns.md): avoid attitude markers ("reject," "absurd," "oblivious"), boosters, personal attacks on authors, vague dismissals, and typo nitpicking; prefer first-person rapport ("I appreciate," "I stumbled over"), hedged suggestions ("I'd suggest," "could," "would help"), and critique aimed at the work rather than the people. Apply throughout drafting, not just QC.
- Escalate tone only when: clinical validity threatened, patient safety concern, severe data leakage, or reference standard fundamentally flawed
- Default recommendation: Major Revision (unless issues are purely reporting/clarity → Minor Revision)
- Fatal flaw signal: State in Confidential Comments which issue(s) represent fundamental design limitations, rather than recommending Reject directly
- Contribution/priority override: For original AI or method papers, a manuscript can be technically
analyzable and still below the journal's priority bar. When weak novelty and weak clinical/research
utility both hold, surface that in Confidential Comments and calibrate the recommendation upward from the
default Major Revision tier.
- Length proportionality: Minor Revision ≤ 600 words; Major Revision ≤ 1000 words. Length signals difficulty — a Minor Revision review longer than the manuscript itself reads as Reviewer 2.
Signature Review Patterns
Recurring high-yield checks — apply to every manuscript:
- Patient-level data splitting: Splitting at patient level, not image/exam level
- Confidence intervals: All primary metrics should have 95% CIs
- Intended use statement: Clinical workflow position and decision influenced should be clear
- Calibration: AUC alone insufficient for prediction models — calibration metrics needed
- Overclaiming: Language should match evidence level (CI overlap, small test sets, single-center)
- Reproducibility: Preprocessing, hyperparameters, segmentation protocols reported
For survival / prognostic-model manuscripts, also apply the Phase 2B 8-probe audit (conditioning, censoring, competing risks, cutoff optimism, comparator horizon alignment, C-index variant transparency, calibration beyond discrimination, estimand provenance).
For radiomic feature-reproducibility / phantom parameter-sweep / reliability-filtering manuscripts, also apply the Phase 2C 4-probe audit (design-grid circularity, construct validity / proxy-target gap, transportability framing with Reject-escalate calibration, multiplicity).
For Review / narrative / primer / state-of-the-art manuscripts, apply the Phase 2D 9-probe audit (novelty/value-add, scope/aims, evidence-gathering transparency, technical/medical accuracy, taxonomy/synthesis coherence, balance/currency/citation accuracy, load-bearing figures/tables, constructive gap-filling, curated-base circularity) in place of the original-research probes — error-spotting plus proportionate gap-filling, with SANRA used as an appraisal aid only.
For observational studies whose central claim is an adjusted exposure–outcome association, also apply the Phase 2E 12-probe audit (confounding completeness, adjustment-set provenance, selection/collider bias, exposure measurement validity, missing-data / complete-case collapse, residual-confounding E-value, over-adjustment, analysis-unit/clustering, outcome construct validity, overlapping-subset gradient, complex-survey design & weighting, data-driven threshold mining, cross-sectional mediation, interaction scale), with O1 (a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set) and O7 (an outcome consequence/mediator wrongly adjusted) checked against the manuscript's own Table 1.
For cross-modality image-synthesis manuscripts (MRI→PET / MRI→CT / non-contrast→contrast / low-dose→full-dose) that claim functional/molecular information or a substitute for the unavailable target modality, also apply the Phase 2K 4-probe audit (IS1 determinism/information-ceiling vs a source→label baseline, IS2 target-derived-preprocessing/slice-selection leakage, IS3 global vs lesion-level quantitative agreement, IS4 mechanistic/proxy-signal plausibility); IS2 and IS4 are typically unfixable-in-current-form and govern the recommendation per Phase 2F.
Journal-Specific Formatting
Canonical source: per-journal profile files at
references/reviewer_profiles/{JOURNAL_SHORTNAME}.md
In Phase 1 (Setup), after identifying the journal, read the matching profile and render its scorecard template at the top of the draft in Phase 3, above Confidential Comments to the Editor. This avoids duplicating journal form fields across multiple skills.
Current profiles:
| Short | Journal | System | Scorecard |
|---|
| KJR | Korean Journal of Radiology | ScholarOne | 8 items, Excellent→Poor |
| RYAI | Radiology: Artificial Intelligence | ScholarOne | 5 items, 1–9 |
| INSI | Insights into Imaging | Editorial Manager | 4 items, H/M/L |
| AJR | American Journal of Roentgenology | Editorial Manager | Section-by-section |
| EURE | European Radiology | Editorial Manager | INSI-style base |
Custom Journal
If a journal has no profile yet, use the generic format from Phase 3 and ask the user for the invitation form's scorecard fields so a new profile can be added under reviewer_profiles/.
Output Contract
| Artifact | Filename | Format |
|---|
| Review draft | {manuscript_id}_review_draft.md | Markdown |
| Final review | {manuscript_id}_review_final.md | Markdown |
| Submission text | {manuscript_id}_submission.md | Plain text |
Skill Interactions
| Need | Skill | When |
|---|
| Reporting compliance | /check-reporting | Phase 2 — guideline check |
| AI pattern detection | /humanize | If reviewing for AI writing patterns |
What This Skill Does NOT Do
- Does not write the user's own manuscripts → use
/write-paper
- Does not perform self-review of own work → use
/self-review
- Does not submit the review to the journal system
- Does not access journal editorial systems directly
Anti-Hallucination
- Never fabricate manuscript content. All cited numbers, methods, and findings must come from the actual manuscript.
- Never invent journal scoring criteria. If uncertain about a journal's format, ask the user or use the generic format.
- Never generate references from memory. Use
/search-lit if citations are needed for reviewer comments.
- If a reporting guideline item is uncertain, flag it as
[CHECK] rather than asserting compliance.