Publications

This page collects my current papers and preprints. I am especially interested in rigorous evaluation, medical benchmarks, and scientific-intelligence systems for research workflows.

Papers & Preprints


SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Published in arXiv preprint · ICML 2026 submission (under review), 2026

Yujiong Shen * , Yajie Yang * , Zhiheng Xi * , Binze Hu , Huayu Sha , Jiazheng Zhang , Qiyuan Peng , Junlin Shang , Jixuan Huang , Yutao Fan , Jingqi Tong , Shihan Dou , Ming Zhang , Lei Bai , Zhenfei Yin , Tao Gui , Xingjun Ma , Qi Zhang , Xuanjing Huang , Yu-Gang Jiang

SciAgentGym benchmarks multi-step scientific tool use for LLM agents with 1,780 tools and long-horizon workflows. It reports systematic failures on extended trajectories and introduces SciForge data synthesis to improve tool-use training. Current manuscript status: ICML 2026 submission (under review).

OpenNovelty: An LLM-powered Agentic System for Verifiable Scholarly Novelty Assessment

Published in arXiv preprint, 2026

Ming Zhang * , Kexin Tan * , Yueyuan Huang * , Yujiong Shen , Chunchun Ma , Li Ju , Xinran Zhang , Yuhui Wang , Wenqing Jing , Jingyi Deng , Huayu Sha , Binze Hu , Jingqi Tong , Changhao Jiang , Yage Geng , Yuankai Ying , Yue Zhang , Zhangyue Yin , Zhiheng Xi , Shihan Dou , Tao Gui , Qi Zhang , Xuanjing Huang

OpenNovelty builds an evidence-grounded agent pipeline for scholarly novelty assessment. Instead of giving opaque yes/no judgments, it retrieves related literature, compares contribution claims against full texts, and produces verifiable novelty reports with explicit evidence snippets and citations.

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Published in ACL 2026 Submission (Under Review), 2025

Ming Zhang * , Yujiong Shen * , Jingyi Deng * , Yuhui Wang * , Huayu Sha , Kexin Tan , Qiyuan Peng , Yue Zhang , Junzhe Wang , Shichun Liu , Yueyuan Huang , Changhao Jiang , Jingqi Tong , Yilong Wu , Zhihao Zhang , Mingqi Wu , Mingxu Chai , Zhiheng Xi , Shihan Dou , Tao Gui , Qi Zhang , Xuanjing Huang

LLMEval-Fair proposes a dynamic evaluation framework that samples unseen test sets from a large question bank, combines contamination-resistant curation with anti-cheating design, and studies frontier models longitudinally to produce a more reliable picture of progress than static leaderboards. Current manuscript status: ACL 2026 submission (under review).

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Published in Findings of EMNLP 2025, 2025

Ming Zhang * , Yujiong Shen * , Zelin Li * , Huayu Sha , Binze Hu , Yuhui Wang , Chenhao Huang , Shichun Liu , Jingqi Tong , Changhao Jiang , Mingxu Chai , Zhiheng Xi , Shihan Dou , Tao Gui , Qi Zhang , Xuanjing Huang

LLMEval-Med is a physician-validated clinical benchmark built from real-world electronic health records and expert-designed scenarios. It targets the weaknesses of existing medical LLM evaluations by moving beyond exam-style questions toward realistic clinical reasoning and checklist-based expert assessment.