SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
Published in arXiv preprint · ICML 2026 submission (under review), 2026
SciAgentGym is an in-progress benchmark project for evaluating scientific AI agents in realistic research workflows.
Current status: Submitted to ICML 2026 (under review).
Motivation
General-purpose agent benchmarks are often not enough for scientific workflows, where reproducibility, evidence quality, and multi-step reasoning are all critical.
SciAgentGym focuses on evaluating agents in settings closer to practical research tasks.
Planned evaluation dimensions
- Task completion quality: whether final outputs satisfy scientific task requirements
- Process reliability: whether intermediate steps and tool calls are coherent and stable
- Reproducibility: whether key results can be regenerated under fixed protocols
- Evidence grounding: whether conclusions are supported by verifiable artifacts
Benchmark design direction
- multi-step tasks instead of single-turn QA only
- explicit scoring rubrics for both process and final output
- support for ablation-style evaluation of tool-use and planning modules
- structured logs for auditability and error analysis
Notes
This publication page records the current submission status first.
I will update it with the public preprint/code links and complete experimental details after release.
Citation
@article{abs-2602-12984,
author = {Yujiong Shen and
Yajie Yang and
Zhiheng Xi and
Binze Hu and
Huayu Sha and
Jiazheng Zhang and
Qiyuan Peng and
Junlin Shang and
Jixuan Huang and
Yutao Fan and
Jingqi Tong and
Shihan Dou and
Ming Zhang and
Lei Bai and
Zhenfei Yin and
Tao Gui and
Xingjun Ma and
Qi Zhang and
Xuanjing Huang and
Yu{-}Gang Jiang},
title = {SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in {LLM}
Agents},
journal = {CoRR},
volume = {abs/2602.12984},
year = {2026},
url = {https://doi.org/10.48550/arXiv.2602.12984},
doi = {10.48550/ARXIV.2602.12984},
eprinttype = {arXiv},
eprint = {2602.12984},
biburl = {https://dblp.org/rec/journals/corr/abs-2602-12984.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}