TD3+BC Algorithm for Offline RL
The Foundation: Understanding Standard TD3+BC
TD3+BC, introduced by Fujimoto and Gu, is a solution to the distributional shift problem. It starts with the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm—an online actor-critic method—and adds a behavioral cloning (BC) term.
The actor (policy) is trained not only to maximize the Q-value (the "RL objective") but also to stay close to the actions in the dataset (the "imitation objective"). The resulting actor loss is a weighted combination:
The hyperparameter \(\alpha\) is critical:
- If \(\alpha\) is too high, the algorithm reverts to pure BC and cannot improve upon the dataset.
- If \(\alpha\) is too low, the policy is free to exploit OOD actions, leading to the very distributional shift and value overestimation it aims to avoid.
Our Custom Variant: Four Key Enhancements
Our research group developed a customized implementation of TD3+BC designed to improve stability and performance on medium-quality datasets. We introduced four specific modifications:
- Dynamic Hyperparameter Selection: Instead of a fixed \(\alpha\) for all environments, we tune the BC weight based on the dataset's characteristics.
- BC-Weight Annealing: We start with a higher \(\alpha\) (stronger imitation) early in training to stabilize the policy, and gradually decay it. This allows the agent to adhere to the data initially and then slowly prioritize maximizing returns (RL) as the Q-function becomes more accurate.
- State Normalization: We apply standard scaler normalization (\(z = \frac{x - \mu}{\sigma}\)) to all input states. This ensures that the optimizer treats all state features equally, preventing features with large magnitudes from dominating the learning process.
- Episode Filtering for Medium Datasets: Medium datasets are noisy by definition, containing both good and bad trajectories. Training on the failed trajectories is harmful. Our
filter_topk_by_return()function calculates the total return for every episode and discards the bottom 50% for all medium-quality datasets. This significantly improves the signal-to-noise ratio of the training data.
Experimental Results: Custom vs. Default TD3+BC
These enhancements were not just theoretical. We compared our customized implementation against a default d3rlpy implementation (default.py) that uses fixed hyperparameters. The results, presented in Table II of our paper, show differences.
On medium-quality datasets, where robustness is critical, our custom variant achieved performance gains:
| Environment | Default TD3+BC | Custom TD3+BC | Improvement |
|---|---|---|---|
| Swimmer-Medium | \(3.63 \pm 16.82\) | \(\mathbf{235.21} \pm 5.60\) | +6381% |
| Walker2D-Medium | \(273.30 \pm 6.36\) | \(\mathbf{6163.22} \pm 38.00\) | +2155% |
| Pusher-Medium | \(-46.51 \pm 2.05\) | \(\mathbf{-34.80} \pm 1.35\) | +25% |
| Reacher-Medium | \(-11.24 \pm 2.66\) | \(\mathbf{-7.97} \pm 0.90\) | +29% |
These results validate that simple algorithmic adjustments—specifically filtering poor data and normalizing inputs—can drastically alter the performance of offline RL baselines.