| name | action-quantization-behavior-cloning |
| title | Understanding Behavior Cloning with Action Quantization: Regret Bounds and Quantizer Design |
| version | 0.0.3 |
| engine | skillxiv-v0.0.3-claude-opus-4.6 |
| license | MIT |
| url | https://arxiv.org/abs/2603.20538 |
| keywords | ["Behavior Cloning","Action Quantization","Imitation Learning","Theoretical Bounds","Quantizer Design"] |
| description | Establish regret bounds for behavior cloning with discretized actions combining statistical error and quantization error terms. Prove smoothness requirements for safe quantizer design, show that learning-based quantizers fail these requirements, and propose model-based augmentation to reduce error dependence from H² to H. |
Understanding Behavior Cloning with Action Quantization
Research Question
Why do some action discretization schemes fail catastrophically in behavior cloning despite low in-distribution quantization error? What are the theoretical limits on BC with quantized actions?
Analytical Instrument: Regret Decomposition
Decompose regret into two independent error sources:
Statistical Component: Uncertainty from finite samples: H√(log|Π|/n)
- Standard BC sample complexity
- Depends on policy class size |Π| and horizon H
Quantization Component: Discretization error propagation: H²·εq
- Key insight: Quadratic scaling in horizon H
- Comes from compounding state divergence across steps
- Even small per-step quantization error εq explodes over long horizons
Regret Bound: R(n) = H√(log|Π|/n) + H²·εq
The quadratic term reveals why quantization is more damaging than sample complexity.
Controls & Theoretical Framework
Probabilistic Incremental Input-to-State Stability (P-IISS): Formal condition for dynamics robustness under action perturbations. Captures when small action errors remain localized (don't compound exponentially).
Relaxed Total Variation Continuity (RTVC): Characterizes when quantized policies maintain smooth decision boundaries. Proves that general learning-based quantizers violate this requirement.
Binning-Based Quantizers Satisfy RTVC: Uniform discretization (binning) naturally preserves smoothness; learned quantizers optimize in-distribution error at expense of smoothness guarantees.
Key Findings
Finding 1: Quantization Dominates at Long Horizons
For H=100, εq=0.01: quantization term (100²·0.01=100) dominates statistical term (100·√(0.001)≈3). Improving sample complexity via more data is futile if quantizer is poor.
Finding 2: Smoothness Requirement is Non-Negotiable
Learned quantizers that minimize training error but violate RTVC suffer exponential error amplification. No amount of better data collection fixes this.
Finding 3: Model-Based Augmentation Reduces Dependency
By learning auxiliary transition model to keep rollouts in-distribution, quantization error reduces from H²·εq to H·εq—dramatic improvement from quadratic to linear scaling.
Implications for Practitioners
-
Use Binning, Not Learned Quantizers: Even if learned quantizer has lower training error, binning provides safety guarantees critical for deployment.
-
Horizon Matters: Short-horizon tasks (H<20) tolerate poor quantizers; long-horizon tasks (H>50) require careful quantizer design.
-
Dynamics Stability is Critical: Check system stability (P-IISS property) before deployment. Marginal/unstable dynamics + quantization = failure.
-
Model Augmentation ROI: If rollouts frequently go out-of-distribution, training auxiliary transition model is worth 2×+ error reduction.
-
Sample Size Strategies:
- For short horizons: increase samples (improves statistical term)
- For long horizons: improve quantizer design (fixes exponential scaling)
Information-Theoretic Limits
Lower bounds prove that regret decomposition (statistical + quantization) is unavoidable—no algorithm achieves better rates. However, practitioners can optimize:
- Quantizer smoothness (reduce εq via careful discretization)
- Horizon length (design tasks to require fewer steps)
- Dynamics stability (prefer deterministic or low-variance environments)