一键导入
review-xla-pr
// Review an XLA pull request with deep expertise in HLO optimizations, GPU backend, Triton codegen, autotuner, and AMD/ROCm parity. Use when asked to review an XLA PR or check an XLA change.
// Review an XLA pull request with deep expertise in HLO optimizations, GPU backend, Triton codegen, autotuner, and AMD/ROCm parity. Use when asked to review an XLA PR or check an XLA change.
| name | review-xla-pr |
| description | Review an XLA pull request with deep expertise in HLO optimizations, GPU backend, Triton codegen, autotuner, and AMD/ROCm parity. Use when asked to review an XLA PR or check an XLA change. |
| argument-hint | ["PR-number"] |
| context | fork |
| agent | general-purpose |
| allowed-tools | Bash(gh pr view *), Bash(gh pr diff *), Bash(git log *), Bash(git show *), Bash(git diff *), Bash(git blame *), Bash(head *), Bash(grep *), Read, Grep, Glob |
This skill is read-only. Do NOT post any comments, reviews, or reactions to GitHub.
Do NOT use gh pr comment, gh pr review, gh api with POST/PUT/PATCH/DELETE,
or any other write operation. Only return your findings as text output — the caller
is responsible for posting them.
Before reviewing, fetch the PR context by running these commands:
gh pr view $ARGUMENTS — PR title, description, authorgh pr diff $ARGUMENTS --name-only — list of changed filesgh pr diff $ARGUMENTS — full diffgh pr view $ARGUMENTS --comments 2>/dev/null || echo "(no comments yet)" — existing review commentsReview the PR above thoroughly using the checklist below. Be specific: cite file paths and line numbers from the diff. Describe each finding clearly and explain why it matters. Do NOT assign severity labels (no BLOCKER/WARNING/NIT) — let the human reviewer judge importance.
CRITICAL SCOPE RULE: Only flag issues that exist in the PR diff itself — lines added or modified by this PR. Do NOT flag pre-existing code that the PR did not touch, even if that code is in the same files. If a pre-existing problem is worth noting, mention it briefly in a separate "Pre-existing issues (out of scope)" section at the end, never as a BLOCKER or WARNING against this PR.
HloModule / HloComputation / HloInstruction invariants?DfsHloVisitor Handle* methods updated?HloPassPipeline / HloPassFix used correctly? Fixed-point passes must not loop infinitely (check kMaxIterations = 25 and cycle detection).Cleanup() before dependent passes run?GpuCompiler::RunHloPasses() / OptimizeHloPostLayoutAssignment()?HloPassFix<>, has convergence been verified?HloCSE and HloDCE run after the new pass where needed?PriorityFusion, MultiOutputFusion, or gpu_fusible.cc:
time_unfused - time_fused) accurate?FusionFitsInParameterLimit() and FusionFitsInBudget() (max 96 operands)?kMaxIRSize = 10000, kMaxBasicBlockSplitsPerFusion = 10) respected?HloFusionAnalysisCache properly invalidated after graph mutations?EmitterFusionKind is added, does it have a corresponding emitter in xla/backends/gpu/codegen/emitters/?ReductionSplitter, SplitKGemmRewriter, VariadicOpSplitter): are correctness constraints documented and tested?FusionProcessDumpProto updated to log new fusion decisions?xla/backends/gpu/codegen/triton/:
xla/backends/gpu/codegen/triton/support.cc)?ttxla.* dialect ops defined in triton_xla_ops.td with correct semantics?/transforms/ handle rank-1 edge cases and TMA constraints (Hopper+)?compilation_pipeline_rocm.cc updated alongside the CUDA path?device_info.shared_memory_per_block_optin()?ttxla.block_barrier / ttxla.atomic_write semantics correct?BlockLevelFusionConfig changes: are all parameters (num_warps, num_ctas, num_stages) validated as > 0?xla/backends/autotuner/ or xla/service/gpu/autotuning/:
autotune_cache_key.h)?TritonDotFusionSearchSpace::GenerateConfigs() produce valid configs for the affected shapes and hardware?CanProduceWrongResults() changes for any backend, is the relative tolerance adjusted?factory_rocm.cc updated if new backends are added? (Backend order: Triton → MIOpen → rocBLAS → hipBLASLt)default_configs/rocm.txtpb and CUDA equivalents updated for affected architectures?READ_WRITE and READ (inference-time) cache modes?compilation_pipeline_cuda.cc ↔ compilation_pipeline_rocm.ccfactory_cuda.cc ↔ factory_rocm.ccstream_executor/cuda/ ↔ stream_executor/rocm/convert_float_amd.cc handle them? (BF16 and F8 semantics differ between vendors.)gfx90a, gfx942, etc.) consistent with CUDA compute capability checks?rocm_status.h?*_rocm.cu.cc) added where CUDA-specific kernels are added?BUILD files include ROCm targets where CUDA targets are added?CollectivePipeliner still correctly overlap compute and communication?RaggedAllToAllDecomposer) accounted for?GpuHloCostAnalysis::HandleAllReduce() (or equivalent) implemented?GpuHloCostAnalysis or GpuPerformanceModel is modified:
BytesTransferred, FLOPs, and IR size estimates accurate for the new op/fusion?CommonElementwiseUtilization updated for new elementwise patterns?ProducerConsumerMergedTooLarge() guard against oversized IR?HloTestBase)?TritonFusionNumericsVerifier-compatible test added?std::unique_ptr? Are raw pointers used only for non-owning observation? Are there dangling references from absl::string_view, absl::Span, or llvm::StringRef outliving the data they point to?std::move used on the last use of a local? Are moved-from objects not accessed afterward?absl::MutexLock, smart pointers) rather than manual acquire/release?ABSL_GUARDED_BY(mu_)? Are mutexes mutable when locked inside const methods? Is absl::MutexLock used rather than manual lock/unlock?int64_t used for sizes and indices consistently (XLA convention)? Are signed/unsigned comparisons avoided?const where appropriate?{}) used to avoid narrowing?TF_RETURN_IF_ERROR or TF_ASSIGN_OR_RETURN, are those resources cleaned up?CHECK / DCHECK used appropriately (CHECK for invariants that indicate bugs, DCHECK for expensive debug-only assertions)?absl::Status / absl::StatusOr<T> used for fallible operations? Are TF_RETURN_IF_ERROR(), TF_ASSIGN_OR_RETURN(), and TF_RET_CHECK() used correctly (not tsl::Status / tsl::StatusOr, not tsl::Status::OK() — use OkStatus() instead)?check_contents.yml):
tsl::Status or tsl::StatusOr — use unqualified Status / StatusOrtsl::Status::OK() — use OkStatus()std::call_once — use absl::call_onceabsl::optional, absl::any, absl::variant, absl::make_unique, absl::nullopt) — use std:: equivalentsgtl::FlatMap, gtl::FlatSet, gtl::InlinedVector, strings::StrCat, strings::Printf, str_util::*, tensorflow::StringPiece) — use Abseil equivalentsabsl::flat_hash_map / absl::flat_hash_set over std::unordered_*. Use absl::btree_map when ordered iteration is needed. Use absl::InlinedVector<T, N> for small vectors.absl::string_view for parameters, absl::StrCat() / absl::StrFormat() / absl::StrJoin() / absl::StrAppend() / absl::Substitute() for string construction.LOG(INFO/WARNING/ERROR) and VLOG(level). Use XLA_SCOPED_LOGGING_TIMER() for performance instrumentation.CHECK() / CHECK_EQ() / CHECK_NE() etc. for fatal invariants. Use DCHECK() for debug-only assertions. Use TF_RET_CHECK() when the caller can handle the error.FindOrDie(), FindOrDefault(), ContainsKey(), InsertOrDie() from xla/map_util.h where appropriate.namespace xla { ... }. Use anonymous namespaces for file-local helpers. Use namespace aliases for verbose paths (e.g., namespace se = ::stream_executor;)..clang-format): (1) corresponding header, (2) C/C++ system headers separated by blank line, (3) third-party headers grouped by library: gtest/gmock, absl, llvm, mlir, protobuf, xla, tsl, triton..clang-format BasedOnStyle: Google). Pointer binds to type (int* p, not int *p). CI enforces clang-format v17 on all .cc/.h files.buildifier linting is enforced. ROCm targets must be included alongside CUDA targets.FusionProcessDumpProto)?Begin directly with the structured review below. Do not include reasoning, analysis steps, or thinking-out-loud before it.
## Summary
<2–3 sentence overview of what the PR does>
## Findings
- [file:line] <description and why it matters>