| name | add-yardstick-metric |
| description | Guide for creating new yardstick metrics. Use when a developer needs to extend yardstick with a custom performance metric, including numeric, class, probability, ordered probability, survival (static, dynamic, integrated, linear predictor), and quantile metrics. |
Add Yardstick Metric
Guide for developing new metrics that extend the yardstick package. This skill
provides best practices, code templates, and testing patterns for creating
custom performance metrics.
Two Development Contexts
This skill supports two distinct development contexts:
🆕 Extension Development (Default)
Creating a new R package that extends yardstick with custom metrics.
-
✅ Use this for: New packages, standalone metrics, CRAN submissions
-
⚠️ Constraint: Can only use exported functions (no :::)
🔧 Source Development (Advanced)
Contributing directly to yardstick via pull requests.
Getting Started
INSTRUCTIONS FOR CLAUDE: Run the verification script first to determine the
development context:
Rscript -e 'source(Sys.glob(path.expand("~/.claude/plugins/cache/tidymodels-skills/tidymodels-dev/*/tidymodels/shared-references/scripts/verify-setup.R"))[1])'
Then follow the appropriate path based on the output:
Overview
Creating a custom yardstick metric provides:
-
Standardization with existing metrics
-
Automatic error handling for types and lengths
-
Support for multiclass implementations
-
NA handling
-
Grouped data frame support
-
Integration with metric_set()
-
Optional autoplot support for visualization (curves and confusion matrices)
Repository Access (Optional but Recommended)
INSTRUCTIONS FOR CLAUDE: Check if repos/yardstick/ exists in the current
working directory. Use this to guide development:
If repos/yardstick/ exists:
-
✅ Use it as a reference throughout development
-
Read source files (e.g., repos/yardstick/R/prob-brier_class.R) to study
implementation patterns
-
Read test files (e.g., repos/yardstick/tests/testthat/test-prob-roc_auc.R)
for testing patterns
-
Reference these files when answering complex questions or solving problems
-
Look at actual code structure, validation patterns, and edge case handling
If repos/yardstick/ does NOT exist:
-
Suggest cloning the repository using the scripts in Repository Access
Guide
-
This is optional but strongly recommended for high-quality development
-
If the user declines, reference files using GitHub URLs:
When to use repository references:
-
Complex implementation questions (e.g., "How does yardstick handle multiclass
averaging?")
-
Debugging issues (compare user's code to working implementation)
-
Understanding patterns (study similar metrics)
-
Test design (see how yardstick tests edge cases)
-
Architecture decisions (understand internal structure)
See Repository Access Guide for setup
instructions.
Quick Navigation
Development Guides:
Reference Files:
Shared References (Extension Development):
Source Development Specific:
Development Workflow
See Development Workflow for
complete details.
Fast iteration cycle (run repeatedly):
devtools::document() - Generate documentation
devtools::load_all() - Load your package
devtools::test() - Run tests
Final validation (run once at end):
devtools::check() - Full R CMD check
WARNING: Do NOT run check() during iteration. It takes 1-2 minutes and is
unnecessary until you're done.
Critical: Tidymodels Code Standards
These requirements ensure your code aligns with tidymodels patterns and passes
review.
Source Development (PRs to yardstick)
When contributing to yardstick itself, you MUST:
1. Use internal helpers - Don't reimplement what exists:
mae_impl <- function(truth, estimate, case_weights = NULL) {
errors <- abs(truth - estimate)
yardstick_mean(errors, case_weights = case_weights)
}
mae_impl <- function(truth, estimate, case_weights = NULL) {
errors <- abs(truth - estimate)
if (is.null(case_weights)) {
mean(errors)
} else {
weighted.mean(errors, w = as.double(case_weights))
}
}
Common internal helpers: yardstick_mean(), finalize_estimator_internal(),
check_*_metric(), yardstick_remove_missing(), yardstick_any_missing()
2. NO package prefix - Functions are in the same package:
yardstick_mean(errors, case_weights = case_weights)
check_numeric_metric(truth, estimate, case_weights)
yardstick::yardstick_mean(errors, case_weights = case_weights)
yardstick::check_numeric_metric(truth, estimate, case_weights)
3. Include @examples - Required for all exported functions:
Extension Development (new packages)
When creating packages that extend yardstick, you MUST:
1. ALWAYS use package prefix - Explicitly namespace all yardstick functions:
yardstick::check_numeric_metric(truth, estimate, case_weights)
yardstick::new_numeric_metric(wape, direction = "minimize")
yardstick::numeric_metric_summarizer(...)
check_numeric_metric(truth, estimate, case_weights)
new_numeric_metric(wape, direction = "minimize")
2. NEVER use internal functions - Cannot access ::::
yardstick:::yardstick_mean(errors, case_weights)
yardstick:::finalize_estimator_internal(estimator, ...)
if (!is.null(case_weights)) {
case_weights <- as.double(case_weights)
weighted.mean(errors, w = case_weights)
} else {
mean(errors)
}
3. Include @examples - Required for usability:
Both Contexts: Case Weights Handling
ALWAYS convert hardhat weight objects to numeric. This is required in all
_impl functions that accept case weights:
Extension development example:
wape_impl <- function(truth, estimate, case_weights = NULL) {
errors <- abs(truth - estimate)
if (!is.null(case_weights)) {
if (inherits(case_weights, c("hardhat_importance_weights",
"hardhat_frequency_weights"))) {
case_weights <- as.double(case_weights)
}
weighted.mean(errors, w = case_weights)
} else {
mean(errors)
}
}
Source development example:
mae_impl <- function(truth, estimate, case_weights = NULL) {
errors <- abs(truth - estimate)
if (!is.null(case_weights)) {
if (inherits(case_weights, c("hardhat_importance_weights",
"hardhat_frequency_weights"))) {
case_weights <- as.double(case_weights)
}
}
yardstick_mean(errors, case_weights = case_weights)
}
Why this matters: Hardhat weights are S3 objects, not plain numerics.
Without conversion, arithmetic operations and functions like weighted.mean()
will fail. This conversion is mandatory in every _impl function that handles
case weights.
Choosing Your Metric Type
┌─────────────────────────────────────────────────────────────┐
│ What type of data do you have? │
└─────────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┬───────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
Regression Classification Survival Analysis Quantile Forecasting
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ NUMERIC │ │ CLASS-BASED │ │ SURVIVAL │ │ QUANTILE │
│ METRICS │ │ METRICS │ │ METRICS │ │ METRICS │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
│ ├───────────────────┤ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Classes │ │Probabilities│ │
│ └─────────────┘ └─────────────┘ │
│ │ │ │
│ ├───────────────────┤ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │Unordered? │ │Unordered? │ │
│ │ CLASS │ │ PROBABILITY │ │
│ └─────────────┘ └─────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │Ordered? │ │Ordered? │ │
│ │ORDERED PROB │ │ORDERED PROB │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ ┌───────────────────┬───────────────┬─────────┤
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Examples: Examples: Examples: Examples: Examples:
- MAE - Accuracy - ROC AUC - Concordance - WIS
- RMSE - Precision - Log Loss - Brier - Pinball
- R² - F1 Score - PR AUC - Royston's D - Coverage
- Ranked Prob Score
Survival Metrics Breakdown:
- STATIC: Single overall value (e.g., Concordance)
- DYNAMIC: Value per time point (e.g., Time-dependent Brier)
- INTEGRATED: Averaged across time (e.g., Integrated Brier)
- LINEAR PRED: From Cox models (e.g., Royston's D)
Decision guide:
Complete Example: Numeric Metric (MAE)
For a complete, step-by-step implementation of a numeric metric (MAE), see the
comprehensive example with all required components in:
👉 Numeric Metrics Reference
This reference includes:
-
Implementation function (mae_impl) with case weights handling
-
Vector interface (mae_vec) with validation and NA handling
-
Data frame method with new_numeric_metric() wrapper
-
Complete test suite covering correctness, NA handling, input validation, and
case weights
-
Working examples you can adapt for your own metrics
Quick preview of the pattern:
-
_impl() function: Core calculation logic
-
_vec() function: Validation and NA handling
-
.data.frame() method: Integration with yardstick system
See also Extension Development Guide for the
complete implementation walkthrough.
Implementation Guide by Metric Type
Numeric Metrics
Use for: Regression metrics where truth and predictions are continuous
numbers.
Pattern: Three-function approach (_impl, _vec, data.frame method)
Complete guide: Numeric Metrics
Key points:
-
Always use check_numeric_metric() for validation
-
Handle case weights with weighted.mean()
-
Use yardstick_remove_missing() for NA handling
-
Return .estimator = "standard"
Examples: MAE, RMSE, MSE, Huber Loss, R-squared
Reference implementations:
-
Simple metrics: R/num-mae.R, R/num-rmse.R, R/num-mse.R
-
Parameterized metrics: R/num-huber_loss.R (has delta parameter)
-
Complex metrics: R/num-ccc.R (correlation-based)
Class Metrics
Use for: Classification metrics where truth and predictions are factor
classes.
Pattern: Uses confusion matrix from yardstick_table()
Complete guide: Class Metrics
Key points:
-
Use yardstick_table() to create weighted confusion matrix
-
Implement separate _binary and _estimator_impl functions
-
Handle factor level ordering with event_level parameter
-
Support multiclass with macro, micro, macro_weighted averaging
Examples: Accuracy, Precision, Recall, F1, Specificity
Reference implementations:
-
Simple metrics: R/class-accuracy.R, R/class-precision.R,
R/class-recall.R
-
Combined metrics: R/class-f_meas.R (F1 score)
-
Balanced metrics: R/class-bal_accuracy.R (handles class imbalance)
Probability Metrics
Use for: Metrics that evaluate predicted probabilities against true classes.
Pattern: Similar to class metrics but uses probability columns
Complete guide: Probability Metrics
Key points:
-
Truth is factor, estimate is probabilities
-
Convert factor to binary for binary metrics
-
Handle multiple probability columns for multiclass
-
Use check_prob_metric() for validation
Examples: ROC AUC, Log Loss, Brier Score, PR AUC
Reference implementations:
-
Curve-based: R/prob-roc_auc.R, R/prob-pr_auc.R
-
Scoring rules: R/prob-brier_class.R, R/prob-mn_log_loss.R
Ordered Probability Metrics
Use for: Ordinal classification metrics where class ordering matters.
Pattern: Three-function approach with cumulative probabilities
Complete guide: Ordered Probability
Metrics
Key points:
-
Truth must be ordered factor
-
Uses cumulative probabilities to respect ordering
-
Use check_ordered_prob_metric() for validation
-
No averaging types (works same for any number of classes)
Examples: Ranked Probability Score (RPS)
Static Survival Metrics
Use for: Overall survival metrics with single numeric predictions.
Pattern: Three-function approach with Surv objects
Complete guide: Static Survival
Metrics
Key points:
-
Truth is Surv object from survival package
-
Estimate is single numeric per observation
-
Handles right-censoring with comparable pairs
-
Use check_static_survival_metric() for validation
Examples: Concordance Index (C-index)
Dynamic Survival Metrics
Use for: Time-dependent survival metrics at specific evaluation times.
Pattern: Three-function approach with list-column predictions
Complete guide: Dynamic Survival
Metrics
Key points:
-
Truth is Surv object
-
Estimate is list-column of data.frames with .eval_time, .pred_survival,
.weight_censored
-
Returns multiple rows (one per eval_time)
-
Uses inverse probability of censoring weights (IPCW)
Examples: Time-dependent Brier Score, Time-dependent ROC AUC
Integrated Survival Metrics
Use for: Overall survival metrics integrated across evaluation times.
Pattern: Two-function approach (calls dynamic metric, then integrates)
Complete guide: Integrated Survival
Metrics
Key points:
-
Same input format as dynamic survival metrics
-
Integrates across time using trapezoidal rule
-
Normalizes by max evaluation time
-
Requires at least 2 evaluation times
Examples: Integrated Brier Score, Integrated ROC AUC
Linear Predictor Survival Metrics
Use for: Metrics for linear predictors from Cox models.
Pattern: Three-function approach with transformations
Complete guide: Linear Predictor Survival
Metrics
Key points:
-
Truth is Surv object
-
Estimate is linear predictor values (unbounded)
-
Often uses transformations (e.g., normal scores)
-
Use check_linear_pred_survival_metric() for validation
Examples: Royston's D statistic, R²_D
Quantile Metrics
Use for: Quantile prediction metrics for uncertainty quantification.
Pattern: Three-function approach with quantile_pred objects
Complete guide: Quantile Metrics
Key points:
-
Truth is numeric
-
Estimate is hardhat::quantile_pred object
-
Handles missing quantiles (impute, drop, or propagate)
-
Uses fn_options for additional parameters
Examples: Weighted Interval Score (WIS), Pinball Loss
Documentation
See Roxygen Documentation for
complete templates.
Required roxygen tags:
Testing
See Testing Patterns
(Extension)
for comprehensive guide.
Required test categories: 1. Correctness: Metric calculates correctly 2.
NA handling: Both na_rm = TRUE and FALSE 3. Input validation: Wrong
types, mismatched lengths 4. Case weights: Weighted and unweighted differ 5.
Edge cases: All correct, all wrong, empty data
Common Patterns
Using the confusion matrix
See Confusion Matrix for complete guide.
xtab <- yardstick::yardstick_table(truth, estimate, case_weights)
tp <- xtab[2, 2]
tn <- xtab[1, 1]
fp <- xtab[1, 2]
fn <- xtab[2, 1]
Handling case weights
See Case Weights for complete guide.
IMPORTANT: Always convert hardhat weights to numeric using as.double()
if (!is.null(case_weights)) {
if (inherits(case_weights, c("hardhat_importance_weights",
"hardhat_frequency_weights"))) {
case_weights <- as.double(case_weights)
}
}
if (is.null(case_weights)) {
mean(values)
} else {
weighted.mean(values, w = case_weights)
}
Why as.double() is required:
-
hardhat weight objects are S3 classes, not plain numeric vectors
-
Base R functions like weighted.mean() expect numeric vectors
-
Without conversion, you'll get errors like "non-numeric argument"
-
Source development can use yardstick_mean() which handles this internally
-
Extension development must do the conversion manually
Multiclass averaging
See Class Metrics for complete guide.
Macro averaging: Average of per-class metrics (treats all classes equally)
Micro averaging: Pool all observations, calculate once (treats all
observations equally) Macro-weighted averaging: Weighted average by class
prevalence
Advanced Topics
Combining Metrics with metric_set()
Once you've created your metric, you can combine it with other metrics using
metric_set():
my_metrics <- metric_set(mae, rmse, my_custom_metric)
my_metrics(data, truth = y, estimate = y_pred)
Key benefits:
-
Calculate multiple metrics at once
-
More efficient (shared calculations)
-
Integrates with tune package
-
Works with grouped data
See Combining Metrics for complete guide.
Creating Groupwise Metrics
Groupwise metrics measure disparity in metric values across groups (useful for
fairness):
accuracy_diff <- new_groupwise_metric(
fn = accuracy,
name = "accuracy_diff",
aggregate = function(x) diff(range(x$.estimate))
)
accuracy_diff_by_group <- accuracy_diff(group_column)
accuracy_diff_by_group(data, truth, estimate)
Use cases:
See Groupwise Metrics for complete guide.
Package-Specific Patterns (Source Development)
If you're contributing to yardstick itself, you have access to internal
functions and conventions not available in extension development.
File Naming Conventions
Yardstick uses strict naming patterns:
-
Numeric: R/num-[name].R → R/num-mae.R
-
Class: R/class-[name].R → R/class-accuracy.R
-
Probability: R/prob-[name].R → R/prob-roc_auc.R
-
Tests: tests/testthat/test-num-mae.R
Internal Functions Available
When developing yardstick itself, you can use:
-
yardstick_mean() - Weighted mean with case weight handling
-
finalize_estimator_internal() - Estimator selection for multiclass
-
check_numeric_metric(), check_class_metric(), check_prob_metric() -
Validation
-
Test helpers: data_altman(), data_three_class(), data_hpc_cv1()
Documentation Templates
Yardstick uses templates in man-roxygen/:
Snapshot Testing
Yardstick uses testthat::expect_snapshot() extensively:
test_that("mae returns correct structure", {
expect_snapshot(mae(df, truth, estimate))
})
Complete source development guide: Source Development
Guide
Package Integration
Package-level documentation
See Package Imports for complete guide.
Create R/{packagename}-package.R:
"_PACKAGE"
NULL
Exports
All metrics must be exported:
mae <- function(data, ...) {
UseMethod("mae")
}
Best Practices
See Best Practices
(Extension) for
complete guide.
Key principles:
-
Use base pipe |> not magrittr pipe %>%
-
Prefer for-loops over purrr::map() for better error messages
-
Use cli::cli_abort() for error messages
-
Keep functions focused on single responsibility
-
Validate early (in _vec), trust data in _impl
Troubleshooting
See Troubleshooting
(Extension)
for complete guide.
Common issues:
-
"No visible global function definition" → Add to package imports
-
"Object not found" in tests → Use devtools::load_all() before testing
-
NA handling bugs → Check both na_rm = TRUE and FALSE cases
-
Case weights not working → Convert hardhat weights to numeric
File Creation Guidelines
Extension development (creating new package):
-
R/[metric_name].R (with complete roxygen docs and examples)
-
tests/testthat/test-[metric_name].R (comprehensive tests)
-
README.md (ONLY if package has no README - check first!)
Typically creates 2 files (3 if README needed); acceptable to reach 4-6 if
implementation requires it.
Source development (PR to yardstick):
-
R/[type]-[metric_name].R (e.g., R/num-mae.R, R/class-accuracy.R)
-
tests/testthat/test-[type]-[metric_name].R
Creates exactly 2 files; acceptable to reach 3-5 if implementation requires it.
❌ AVOID creating these files:
-
README.md or README.txt (for PRs - yardstick already has one)
-
NEWS_entry.md (mention in conversation - maintainer adds to NEWS.md)
-
IMPLEMENTATION_SUMMARY.md, IMPLEMENTATION_NOTES.md, IMPLEMENTATION_NOTES.txt
-
QUICKSTART.md, QUICK_REFERENCE.md, INTEGRATION_GUIDE.md
-
example_usage.R, USAGE_EXAMPLE.R (examples go in roxygen @examples)
-
PR_CHECKLIST.md, PR_DESCRIPTION.md, PR_SUMMARY.md
-
INDEX.md, FILE_GUIDE.md, SUMMARY.md, SUMMARY.txt, OVERVIEW.md
-
metric_examples.R, test_examples.R
-
METRIC_DESIGN.md, VALIDATION_APPROACH.md
-
pkgdown_update.txt, pkgdown_addition.yml
-
WORKFLOW_COMMANDS.sh, setup.sh
-
verification_script.R, check_metric.R
-
ANY other .md, .txt, .yml, .sh files beyond the 2-3 core files
Where everything goes:
-
Examples → roxygen @examples in R file
-
Implementation notes → roxygen @details in R file
-
Metric design rationale → roxygen @details in R file
-
PR description → conversation with user
-
NEWS entry → conversation (maintainer adds it)
-
Usage guide → README.md (extension dev only) or roxygen examples
Related Skills
Next Steps
For Extension Development (creating new packages):
- Extension prerequisites: Extension
Prerequisites - START HERE
For Source Development (contributing to yardstick):
- Start here: Source Development Guide
- Clone repository: See Repository
Access
- Study existing metrics: Browse
R/num-*.R, R/class-*.R, etc.
- Follow package conventions: File naming, internal functions, templates
- Test with internal helpers: See Testing Patterns
(Source)
- Submit PR: See Source Development Guide for
PR process