SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
Published in arXiv preprint · ICML 2026 submission (under review), 2026
SciAgentGym benchmarks multi-step scientific tool use for LLM agents with 1,780 tools and long-horizon workflows. It reports systematic failures on extended trajectories and introduces SciForge data synthesis to improve tool-use training. Current manuscript status: ICML 2026 submission (under review).