| name | mixed-precision |
| description | Use FP16/BF16 mixed precision to accelerate training and reduce memory. Use when optimizing GPU performance. |
| metadata | {"category":"tooling","trigger-keywords":"training,gpu,memory,speed,precision,fp16,bf16","applicable-stages":"10,12","priority":"5","version":"1.0","author":"researchclaw","references":"Micikevicius et al., Mixed Precision Training, ICLR 2018","code-template":"scaler = torch.cuda.amp.GradScaler()\nfor batch in dataloader:\n optimizer.zero_grad()\n with torch.cuda.amp.autocast():\n output = model(batch)\n loss = criterion(output, target)\n scaler.scale(loss).backward()\n scaler.step(optimizer)\n scaler.update()\n"} |
Mixed Precision Training Best Practice
Use torch.cuda.amp for automatic mixed precision:
- Wrap forward pass in torch.cuda.amp.autocast()
- Use GradScaler for loss scaling
- BF16 preferred over FP16 on Ampere+ GPUs (RTX 3xxx, A100, RTX 4xxx)
- Watch for NaN gradients — reduce learning rate if needed
- Do NOT use amp with custom CUDA kernels unless tested