Benchmarking Offline RL Algorithms

After implementing and benchmarking four distinct offline RL algorithms (Behavioral Cloning, our custom TD3+BC, IQL, and CQL) across eight MuJoCo environments and two dataset qualities (expert and medium) for 10 random seeds each, our research yielded insights.

The reality of practice is that the "best" algorithm is always context-dependent. Our findings, summarized in Table III of our paper, reveal clear patterns.

TABLE III from our report, showing the final mean return ( $\pm$ std. dev.) for all four algorithms across all tasks. Best results per row are highlighted.

Lesson 1: On Expert Data, Simplicity Wins

Our finding was that Behavioral Cloning (BC) performs well.

On expert-level datasets, where trajectories are already near-optimal, BC was the most consistent algorithm.

HalfCheetah-Expert: BC achieved a return of 7758.72, while our TD3+BC scored 270.00 and IQL scored -1.
Walker2D-Expert: BC achieved 1034.11.

Standard RL algorithms (TD3+BC, CQL) often degraded performance on expert datasets. This is likely because the "imitation" constraint in these algorithms is imperfect, and the "RL" component attempts to improve upon a policy that is already optimal, effectively introducing noise.

Takeaway: If you trust your data source completely, simple imitation is often a very strong baseline.

Lesson 2: On Medium Data, Processing Matters

On "Medium" datasets—collected from a policy trained only partially—the story changes. These datasets contain a mix of successful and failed actions. Here, pure BC fails because it copies the mistakes alongside the successes.

Our Custom TD3+BC, which filters out the bottom 50% of trajectories, showed improvements:

HalfCheetah-Medium: It scored 11972.09, significantly outperforming the expert baseline (implying the "Medium" dataset actually contained high-return segments that the agent successfully stitched together).
Walker2D-Medium: It achieved 6163.22, outperforming the default implementations significantly.

Lesson 3: IQL is the Stability King

Implicit Q-Learning (IQL) was rarely the absolute highest scorer, but it was the most stable across all tasks. It never crashed catastrophically.

InvertedPendulum: IQL achieved a perfect score of 1000.00.
Reacher-Medium: It was the top performer.

Summary of Algorithm Performance

Behavioral Cloning (BC):

* Strengths: Unbeatable computational efficiency. High performance on expert data.

* Weaknesses: Fails completely on noisy or mixed data. No ability to improve beyond the demonstrator.

TD3+BC (Custom):

* Strengths: Excellent on medium/mixed data when combined with trajectory filtering. Can stitch together sub-optimal parts to create a super-optimal policy.

* Weaknesses: Can be unstable on expert data, where the RL component can add noise and degrade a near-perfect policy.

Implicit Q-Learning (IQL):

* Strengths: Consistent and stable across medium-quality datasets. Its expectile regression mechanism identifies and extracts value from mixed data. Showed low variance across runs.

* Weaknesses: Offers limited benefit on expert data where there is no "advantage" to weight.

Conservative Q-Learning (CQL):

* Strengths: Theoretically robust and provides a "safe" lower-bound value.

* Weaknesses: In practice, it was often pessimistic and difficult to tune. It was rarely competitive on either expert or medium tasks in our benchmark, suggesting its practical utility may be limited to specific safety-critical applications or datasets with poor coverage.

Final Conclusion

Our research reinforces that modern offline RL: there is no single best algorithm. Success requires matching the algorithm's philosophy to the dataset's characteristics.

If you have expert data, start with BC.
If you have medium/mixed data, a well-engineered TD3+BC (like our custom variant) or IQL are good choices.
Data preprocessing is not optional. Filtering bad trajectories and normalizing states are steps that can yield performance gains.

Offline RL requires data engineering and tuning as well as algorithmic design.