| Interview duration | 45-60 min | Standard moderated session | Keep guides scoped to fit |
| Usability sample (qualitative) | 5-8 users | Uncovers ~85% of frequent issues | Do not over-recruit before first findings |
| Usability sample (quantitative) | โฅ30 users | Statistical validity for benchmarks | Required for SUS/NPS/task-completion benchmarking |
| Benchmark precision (ยฑ20%) | 20 users | Rough directional benchmark | Acceptable for early-stage internal comparison |
| Benchmark precision (ยฑ10%) | ~80 users | Reliable benchmark comparison | Recommended for cross-release or competitor benchmarking |
| Benchmark precision (ยฑ5%) | ~320 users | High-precision benchmark | Required for published reports or regulatory claims |
| Usability-only sample | 5-6 users | Small focused tests | Use for fast evaluative studies |
| Focus group | 6-8 per group | Discussion balance | Avoid larger groups |
| Diary study | 10-15 participants | Longitudinal signal | Use only when behavior unfolds over time |
| Tasks per usability session | 3-4 max | Avoids priming and fatigue | Exceeding 4 risks earlier tasks biasing later task paths |
| Task completion | โฅ78% (industry avg); >92% top quartile | Usability success baseline | Investigate if below 78%; target >92% for best-in-class UX |
| SUS | >68 (avg); >70 good; >85 excellent | Perceived usability scale | SUS 80+ correlates with ~100% task completion |
| SEQ | >5.5/7 (avg) | Post-task ease rating | Investigate tasks scoring below average |
| NPS (consumer software) | >21% (industry avg) | Loyalty benchmark | Context-dependent; compare within vertical |
| AI transcription accuracy | 95โ98% (clear audio) | Automated transcription reliability | Verify against source for accented/noisy audio; drops below 90% for non-native speakers |
| AI theme extraction agreement | 80โ85% vs expert coders | First-pass coding reliability | Always human-review the 15โ20% gap; AI misses context-dependent nuance |
| AI researcher adoption | 80% of researchers | AI is baseline in research workflows (Maze 2026) | Design for AI-augmented workflows; ensure human judgment on interpretation |
| AI synthesis time reduction | up to 80% | Qualitative coding acceleration | AI handles transcription/initial coding; researcher owns interpretation and synthesis |
| AI moderation pilot | 2-3 self-runs + 5-10 participant sessions | Pre-scale validation | Pilot yourself 2-3 times, then review 5-10 real sessions before launching AI-moderated interviews at scale |
| UEQ (User Experience Questionnaire) | 26 items, โ3 to +3 scale | Pragmatic + hedonic UX quality with public benchmarks | Use alongside SUS for richer quality assessment; compare against UEQ benchmark dataset |
| Research strategic adoption | 22% of orgs (up from 8% in 2025) | Research essential to all business strategy levels (Maze 2026) | Frame research as strategic asset; design for org-wide research integration |
| Synthetic-real split | 80/20 | Rapid hypothesis via synthetic, deep insight via human | Use synthetic for iterations/screening/hypothesis; reserve human interviews for emotional depth, edge cases, cultural nuance |
| CASTLE (workplace UX) | 6 dimensions | Cognitive load, Advanced feature usage, Satisfaction, Task efficiency, Learnability, Errors | Use instead of SUS/HEART for compulsory workplace software where users cannot choose the product |
| Calibration | 3+ studies | Minimum evidence to adjust method weights | Do not recalibrate before this |