Computation and Language
☆ GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu
Visual generation models have made remarkable progress in creating realistic
images from text prompts, yet struggle with complex prompts that specify
multiple objects with precise spatial relationships and attributes. Effective
handling of such prompts requires explicit reasoning about the semantic content
and spatial layout. We present GoT-R1, a framework that applies reinforcement
learning to enhance semantic-spatial reasoning in visual generation. Building
upon the Generation Chain-of-Thought approach, GoT-R1 enables models to
autonomously discover effective reasoning strategies beyond predefined
templates through carefully designed reinforcement learning. To achieve this,
we propose a dual-stage multi-dimensional reward framework that leverages MLLMs
to evaluate both the reasoning process and final output, enabling effective
supervision across the entire generation pipeline. The reward system assesses
semantic alignment, spatial accuracy, and visual quality in a unified approach.
Experimental results demonstrate significant improvements on T2I-CompBench
benchmark, particularly in compositional tasks involving precise spatial
relationships and attribute binding. GoT-R1 advances the state-of-the-art in
image generation by successfully transferring sophisticated reasoning
capabilities to the visual generation domain. To facilitate future research, we
make our code and pretrained models publicly available at
https://github.com/gogoduan/GoT-R1.
comment: Github page refer to: https://github.com/gogoduan/GoT-R1
☆ Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO
Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, Pheng-Ann Heng
Recent advancements underscore the significant role of Reinforcement Learning
(RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large
language models (LLMs). Two prominent RL algorithms, Direct Preference
Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central
to these developments, showcasing different pros and cons. Autoregressive image
generation, also interpretable as a sequential CoT reasoning process, presents
unique challenges distinct from LLM-based CoT reasoning. These encompass
ensuring text-image consistency, improving image aesthetic quality, and
designing sophisticated reward models, rather than relying on simpler
rule-based rewards. While recent efforts have extended RL to this domain, these
explorations typically lack an in-depth analysis of the domain-specific
challenges and the characteristics of different RL strategies. To bridge this
gap, we provide the first comprehensive investigation of the GRPO and DPO
algorithms in autoregressive image generation, evaluating their in-domain
performance and out-of-domain generalization, while scrutinizing the impact of
different reward models on their respective capabilities. Our findings reveal
that GRPO and DPO exhibit distinct advantages, and crucially, that reward
models possessing stronger intrinsic generalization capabilities potentially
enhance the generalization potential of the applied RL algorithms. Furthermore,
we systematically explore three prevalent scaling strategies to enhance both
their in-domain and out-of-domain proficiency, deriving unique insights into
efficiently scaling performance for each paradigm. We hope our study paves a
new path for inspiring future work on developing more effective RL algorithms
to achieve robust CoT reasoning in the realm of autoregressive image
generation. Code is released at
https://github.com/ZiyuGuo99/Image-Generation-CoT
comment: Code is released at https://github.com/ZiyuGuo99/Image-Generation-CoT
☆ Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, Kevin J. Liang
Multi-modal large language models (MLLMs) have rapidly advanced in visual
tasks, yet their spatial understanding remains limited to single images,
leaving them ill-suited for robotics and other real-world applications that
require multi-frame reasoning. In this paper, we propose a framework to equip
MLLMs with robust multi-frame spatial understanding by integrating depth
perception, visual correspondence, and dynamic perception. Central to our
approach is the MultiSPA dataset, a novel, large-scale collection of more than
27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we
introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks
under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves
significant gains over baselines and proprietary systems, demonstrating
scalable, generalizable multi-frame reasoning. We further observe multi-task
benefits and early indications of emergent capabilities in challenging
scenarios, and showcase how our model can serve as a multi-frame reward
annotator for robotics.
comment: 24 pages. An MLLM, dataset, and benchmark for multi-frame spatial
understanding. Project page: https://runsenxu.com/projects/Multi-SpatialMLLM
☆ R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning
Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yingqian Min, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen
Large Language Models (LLMs) are powerful but prone to hallucinations due to
static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting
external information, but current methods often are costly, generalize poorly,
or ignore the internal knowledge of the model. In this paper, we introduce
R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage
both internal and external knowledge sources. R1-Searcher++ employs a two-stage
training strategy: an initial SFT Cold-start phase for preliminary format
learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses
outcome-supervision to encourage exploration, incorporates a reward mechanism
for internal knowledge utilization, and integrates a memorization mechanism to
continuously assimilate retrieved information, thereby enriching the model's
internal knowledge. By leveraging internal knowledge and external search
engine, the model continuously improves its capabilities, enabling efficient
retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++
outperforms previous RAG and reasoning methods and achieves efficient
retrieval. The code is available at
https://github.com/RUCAIBox/R1-Searcher-plus.
☆ Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?
Jin Jiang, Jianing Wang, Yuchen Yan, Yang Liu, Jianhua Zhu, Mengdi Zhang, Xunliang Cai, Liangcai Gao
Large Language Models (LLMs) have been shown to achieve breakthrough
performance on complex logical reasoning tasks. Nevertheless, most existing
research focuses on employing formal language to guide LLMs to derive reliable
reasoning paths, while systematic evaluations of these capabilities are still
limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs
across various logical reasoning problems utilizing formal languages. From the
perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and
format of trajectories, our key findings are: 1) Thinking models significantly
outperform Instruct models, especially when formal language is employed; 2) All
LLMs exhibit limitations in inductive reasoning capability, irrespective of
whether they use a formal language; 3) Data with PoT format achieves the best
generalization performance across other languages. Additionally, we also curate
the formal-relative training data to further enhance the small language models,
and the experimental results indicate that a simple rejected fine-tuning method
can better enable LLMs to generalize across formal languages and achieve the
best overall performance. Our codes and reports are available at
https://github.com/jiangjin1999/FormalEval.
☆ X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs
LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by
enabling cooperation among multiple specialized agents. However, most existing
MAS frameworks rely on a single LLM to drive all agents, constraining the
system's intelligence to the limit of that model. This paper explores the
paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by
diverse LLMs, elevating the system's potential to the collective intelligence
of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to
evaluate the performance of various LLMs across different domains and
MAS-related functions. As an extensive empirical study, we assess 27 LLMs
across 5 domains (encompassing 21 test sets) and 5 functions, conducting over
1.7 million evaluations to identify optimal model selections for each
domain-function combination. Building on these findings, we demonstrate that
transitioning from homogeneous to heterogeneous LLM-driven MAS can
significantly enhance system performance without requiring structural redesign.
Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration
yields up to 8.4\% performance improvement on the MATH dataset. In a mixed
chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable
47\% performance boost on the AIME dataset. Our results underscore the
transformative potential of heterogeneous LLMs in MAS, highlighting a promising
avenue for advancing scalable, collaborative AI systems.
comment: 19 pages, 5 figures
☆ DecoupledESC: Enhancing Emotional Support Generation via Strategy-Response Decoupled Preference Optimization
Recent advances in Emotional Support Conversation (ESC) have improved
emotional support generation by fine-tuning Large Language Models (LLMs) via
Supervised Fine-Tuning (SFT). However, common psychological errors still
persist. While Direct Preference Optimization (DPO) shows promise in reducing
such errors through pairwise preference learning, its effectiveness in ESC
tasks is limited by two key challenges: (1) Entangled data structure: Existing
ESC data inherently entangles psychological strategies and response content,
making it difficult to construct high-quality preference pairs; and (2)
Optimization ambiguity: Applying vanilla DPO to such entangled pairwise data
leads to ambiguous training objectives. To address these issues, we introduce
Inferential Preference Mining (IPM) to construct high-quality preference data,
forming the IPM-PrefDial dataset. Building upon this data, we propose a
Decoupled ESC framework inspired by Gross's Extended Process Model of Emotion
Regulation, which decomposes the ESC task into two sequential subtasks:
strategy planning and empathic response generation. Each was trained via SFT
and subsequently enhanced by DPO to align with the psychological preference.
Extensive experiments demonstrate that our Decoupled ESC framework outperforms
joint optimization baselines, reducing preference bias and improving response
quality.
☆ $\text{R}^2\text{ec}$: Towards Large Recommender Models with Reasoning
Large recommender models have extended LLMs as powerful recommenders via
encoding or item generation, and recent breakthroughs in LLM reasoning
synchronously motivate the exploration of reasoning in recommendation. Current
studies usually position LLMs as external reasoning modules to yield auxiliary
thought for augmenting conventional recommendation pipelines. However, such
decoupled designs are limited in significant resource cost and suboptimal joint
optimization. To address these issues, we propose \name, a unified large
recommender model with intrinsic reasoning capabilities. Initially, we
reconceptualize the model architecture to facilitate interleaved reasoning and
recommendation in the autoregressive process. Subsequently, we propose RecPO, a
corresponding reinforcement learning framework that optimizes \name\ both the
reasoning and recommendation capabilities simultaneously in a single policy
update; RecPO introduces a fused reward scheme that solely leverages
recommendation labels to simulate the reasoning capability, eliminating
dependency on specialized reasoning annotations. Experiments on three datasets
with various baselines verify the effectiveness of \name, showing relative
improvements of 68.67\% in Hit@5 and 45.21\% in NDCG@20. Code available at
https://github.com/YRYangang/RRec.
☆ MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems
Rui Ye, Keduan Huang, Qimin Wu, Yuzhu Cai, Tian Jin, Xianghe Pang, Xiangrui Liu, Jiaqi Su, Chen Qian, Bohan Tang, Kaiqu Liang, Jiaao Chen, Yue Hu, Zhenfei Yin, Rongye Shi, Bo An, Yang Gao, Wenjun Wu, Lei Bai, Siheng Chen
LLM-based multi-agent systems (MAS) have demonstrated significant potential
in enhancing single LLMs to address complex and diverse tasks in practical
applications. Despite considerable advancements, the field lacks a unified
codebase that consolidates existing methods, resulting in redundant
re-implementation efforts, unfair comparisons, and high entry barriers for
researchers. To address these challenges, we introduce MASLab, a unified,
comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab
integrates over 20 established methods across multiple domains, each rigorously
validated by comparing step-by-step outputs with its official implementation.
(2) MASLab provides a unified environment with various benchmarks for fair
comparisons among methods, ensuring consistent inputs and standardized
evaluation protocols. (3) MASLab implements methods within a shared streamlined
structure, lowering the barriers for understanding and extension. Building on
MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models,
offering researchers a clear and comprehensive view of the current landscape of
MAS methods. MASLab will continue to evolve, tracking the latest developments
in the field, and invite contributions from the broader open-source community.
comment: 18 pages, 11 figures
☆ T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning
Amartya Chakraborty, Paresh Dashore, Nadia Bathaee, Anmol Jain, Anirban Das, Shi-Xiong Zhang, Sambit Sahu, Milind Naphade, Genta Indra Winata
Large Language Models (LLMs) have demonstrated impressive capabilities as
intelligent agents capable of solving complex problems. However, effective
planning in scenarios involving dependencies between API or tool
calls-particularly in multi-turn conversations-remains a significant challenge.
To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn
conversational dataset specifically designed to capture and manage inter-tool
dependencies across diverse domains. T1 enables rigorous evaluation of agents'
ability to coordinate tool use across nine distinct domains (4 single domain
and 5 multi-domain) with the help of an integrated caching mechanism for both
short- and long-term memory, while supporting dynamic replanning-such as
deciding whether to recompute or reuse cached results. Beyond facilitating
research on tool use and planning, T1 also serves as a benchmark for evaluating
the performance of open-source language models. We present results powered by
T1-Agent, highlighting their ability to plan and reason in complex,
tool-dependent scenarios.
comment: Preprint
☆ UFT: Unifying Supervised and Reinforcement Fine-Tuning
Post-training has demonstrated its importance in enhancing the reasoning
capabilities of large language models (LLMs). The primary post-training methods
can be categorized into supervised fine-tuning (SFT) and reinforcement
fine-tuning (RFT). SFT is efficient and well-suited for small language models,
but it may lead to overfitting and limit the reasoning abilities of larger
models. In contrast, RFT generally yields better generalization but depends
heavily on the strength of the base model. To address the limitations of SFT
and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm
that unifies SFT and RFT into a single, integrated process. UFT enables the
model to effectively explore solutions while incorporating informative
supervision signals, bridging the gap between memorizing and thinking
underlying existing methods. Notably, UFT outperforms both SFT and RFT in
general, regardless of model sizes. Furthermore, we theoretically prove that
UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for
the first time that unified training can exponentially accelerate convergence
on long-horizon reasoning tasks.
☆ LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding ACL 2025
Large Language Models (LLMs) are primarily designed for batch processing.
Existing methods for adapting LLMs to streaming rely either on expensive
re-encoding or specialized architectures with limited scalability. This work
identifies three key mismatches in adapting batch-oriented LLMs to streaming:
(1) input-attention, (2) output-attention, and (3) position-ID mismatches.
While it is commonly assumed that the latter two mismatches require frequent
re-encoding, our analysis reveals that only the input-attention mismatch
significantly impacts performance, indicating re-encoding outputs is largely
unnecessary. To better understand this discrepancy with the common assumption,
we provide the first comprehensive analysis of the impact of position encoding
on LLMs in streaming, showing that preserving relative positions within source
and target contexts is more critical than maintaining absolute order. Motivated
by the above analysis, we introduce a group position encoding paradigm built on
batch architectures to enhance consistency between streaming and batch modes.
Extensive experiments on cross-lingual and cross-modal tasks demonstrate that
our method outperforms existing approaches. Our method requires no
architectural modifications, exhibits strong generalization in both streaming
and batch modes. The code is available at repository
https://github.com/EIT-NLP/StreamingLLM.
comment: ACL 2025 Findings
☆ SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development
Large Language Models (LLMs) have shown strong capability in diverse software
engineering tasks, e.g. code completion, bug fixing, and document generation.
However, feature-driven development (FDD), a highly prevalent real-world task
that involves developing new functionalities for large, existing codebases,
remains underexplored. We therefore introduce SWE-Dev, the first large-scale
dataset (with 14,000 training and 500 test samples) designed to evaluate and
train autonomous coding systems on real-world feature development tasks. To
ensure verifiable and diverse training, SWE-Dev uniquely provides all instances
with a runnable environment and its developer-authored executable unit tests.
This collection not only provides high-quality data for Supervised Fine-Tuning
(SFT), but also enables Reinforcement Learning (RL) by delivering accurate
reward signals from executable unit tests. Our extensive evaluations on
SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent
Systems (MAS), reveal that FDD is a profoundly challenging frontier for current
AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test
split). Crucially, we demonstrate that SWE-Dev serves as an effective platform
for model improvement: fine-tuning on training set enabled a 7B model
comparable to GPT-4o on \textit{hard} split, underscoring the value of its
high-quality training data. Code is available here
\href{https://github.com/justLittleWhite/SWE-Dev}{https://github.com/justLittleWhite/SWE-Dev}.
☆ VeriFastScore: Speeding up long-form factuality evaluation
Metrics like FactScore and VeriScore that evaluate long-form factuality
operate by decomposing an input response into atomic claims and then
individually verifying each claim. While effective and interpretable, these
methods incur numerous LLM calls and can take upwards of 100 seconds to
evaluate a single response, limiting their practicality in large-scale
evaluation and training scenarios. To address this, we propose VeriFastScore,
which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously
extracting and verifying all verifiable claims within a given text based on
evidence from Google Search. We show that this task cannot be solved via
few-shot prompting with closed LLMs due to its complexity: the model receives
~4K tokens of evidence on average and needs to concurrently decompose claims,
judge their verifiability, and verify them against noisy evidence. However, our
fine-tuned VeriFastScore model demonstrates strong correlation with the
original VeriScore pipeline at both the example level (r=0.80) and system level
(r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence
retrieval) over VeriScore. To facilitate future factuality research, we
publicly release our VeriFastScore model and synthetic datasets.
☆ From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition
Recent advances in Automatic Speech Recognition (ASR) have been largely
fueled by massive speech corpora. However, extending coverage to diverse
languages with limited resources remains a formidable challenge. This paper
introduces Speech Back-Translation, a scalable pipeline that improves
multilingual ASR models by converting large-scale text corpora into synthetic
speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just
tens of hours of real transcribed speech can effectively train TTS models to
generate synthetic speech at hundreds of times the original volume while
maintaining high quality. To evaluate synthetic speech quality, we develop an
intelligibility-based assessment framework and establish clear thresholds for
when synthetic data benefits ASR training. Using Speech Back-Translation, we
generate more than 500,000 hours of synthetic speech in ten languages and
continue pre-training Whisper-large-v3, achieving average transcription error
reductions of over 30\%. These results highlight the scalability and
effectiveness of Speech Back-Translation for enhancing multilingual ASR
systems.
☆ CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
Ahmed Heakl, Sarim Hashmi, Gustavo Bertolo Stahl, Seung Hun Eddie Han, Salman Khan, Abdulrahman Mahmoud
We introduce \texttt{CASS}, the first large-scale dataset and model suite for
cross-architecture GPU code transpilation, targeting both source-level
(CUDA~$\leftrightarrow$~HIP) and assembly-level (Nvidia
SASS~$\leftrightarrow$~AMD RDNA3) translation. The dataset comprises 70k
verified code pairs across host and device, addressing a critical gap in
low-level GPU code portability. Leveraging this resource, we train the
\texttt{CASS} family of domain-specific language models, achieving 95\% source
translation accuracy and 37.5\% assembly translation accuracy, substantially
outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our
generated code matches native performance in over 85\% of test cases,
preserving runtime and memory behavior. To support rigorous evaluation, we
introduce \texttt{CASS-Bench}, a curated benchmark spanning 16 GPU domains with
ground-truth execution. All data, models, and evaluation tools are released as
open source to foster progress in GPU compiler tooling, binary compatibility,
and LLM-guided hardware translation. Dataset and benchmark are on
\href{https://huggingface.co/datasets/MBZUAI/cass}{\textcolor{blue}{HuggingFace}},
with code at
\href{https://github.com/GustavoStahl/CASS}{\textcolor{blue}{GitHub}}.
comment: 20 pages, 11 figures, 5 tables
☆ Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
Training robust retrieval and reranker models typically relies on large-scale
retrieval datasets; for example, the BGE collection contains 1.6 million
query-passage pairs sourced from various data sources. However, we find that
certain datasets can negatively impact model effectiveness -- pruning 8 out of
15 datasets from the BGE collection reduces the training set size by
2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a
deeper examination of training data quality, with a particular focus on "false
negatives", where relevant passages are incorrectly labeled as irrelevant. We
propose a simple, cost-effective approach using cascading LLM prompts to
identify and relabel hard negatives. Experimental results show that relabeling
false negatives with true positives improves both E5 (base) and Qwen2.5-7B
retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot
AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on
the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the
cascading design is further supported by human annotation results, where we
find judgment by GPT-4o shows much higher agreement with humans than
GPT-4o-mini.
comment: Code is available at https://github.com/castorini/rlhn & datasets are
available at https://huggingface.co/rlhn
☆ BP-Seg: A graphical model approach to unsupervised and non-contiguous text segmentation using belief propagation
Text segmentation based on the semantic meaning of sentences is a fundamental
task with broad utility in many downstream applications. In this paper, we
propose a graphical model-based unsupervised learning approach, named BP-Seg
for efficient text segmentation. Our method not only considers local coherence,
capturing the intuition that adjacent sentences are often more related, but
also effectively groups sentences that are distant in the text yet semantically
similar. This is achieved through belief propagation on the carefully
constructed graphical models. Experimental results on both an illustrative
example and a dataset with long-form documents demonstrate that our method
performs favorably compared to competing approaches.
☆ MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning
Existing medical VQA benchmarks mostly focus on single-image analysis, yet
clinicians almost always compare a series of images before reaching a
diagnosis. To better approximate this workflow, we introduce MedFrameQA -- the
first benchmark that explicitly evaluates multi-image reasoning in medical VQA.
To build MedFrameQA both at scale and in high-quality, we develop 1) an
automated pipeline that extracts temporally coherent frames from medical videos
and constructs VQA items whose content evolves logically across images, and 2)
a multiple-stage filtering strategy, including model-based and manual review,
to preserve data clarity, difficulty, and medical relevance. The resulting
dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in
3,420 videos), covering nine human body systems and 43 organs; every question
is accompanied by two to five images. We comprehensively benchmark ten advanced
Multimodal LLMs -- both proprietary and open source, with and without explicit
reasoning modules -- on MedFrameQA. The evaluation challengingly reveals that
all models perform poorly, with most accuracies below 50%, and accuracy
fluctuates as the number of images per question increases. Error analysis
further shows that models frequently ignore salient findings, mis-aggregate
evidence across images, and propagate early mistakes through their reasoning
chains; results also vary substantially across body systems, organs, and
modalities. We hope this work can catalyze research on clinically grounded,
multi-image reasoning and accelerate progress toward more capable diagnostic AI
systems.
comment: 9 pages, 4 Figures Benchmark data:
https://huggingface.co/datasets/SuhaoYu1020/MedFrameQA
☆ On Multilingual Encoder Language Model Compression for Low-Resource Languages
In this paper, we combine two-step knowledge distillation, structured
pruning, truncation, and vocabulary trimming for extremely compressing
multilingual encoder-only language models for low-resource languages. Our novel
approach systematically combines existing techniques and takes them to the
extreme, reducing layer depth, feed-forward hidden size, and intermediate layer
embedding size to create significantly smaller monolingual models while
retaining essential language-specific knowledge. We achieve compression rates
of up to 92% with only a marginal performance drop of 2-10% in four downstream
tasks, including sentiment analysis, topic classification, named entity
recognition, and part-of-speech tagging, across three low-resource languages.
Notably, the performance degradation correlates with the amount of
language-specific data in the teacher model, with larger datasets resulting in
smaller performance losses. Additionally, we conduct extensive ablation studies
to identify best practices for multilingual model compression using these
techniques.
comment: Pre-print
☆ AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios
Large Language Models (LLMs) have demonstrated advanced capabilities in
real-world agentic applications. Growing research efforts aim to develop
LLM-based agents to address practical demands, introducing a new challenge:
agentic scenarios often involve lengthy instructions with complex constraints,
such as extended system prompts and detailed tool specifications. While
adherence to such instructions is crucial for agentic applications, whether
LLMs can reliably follow them remains underexplored. In this paper, we
introduce AgentIF, the first benchmark for systematically evaluating LLM
instruction following ability in agentic scenarios. AgentIF features three key
characteristics: (1) Realistic, constructed from 50 real-world agentic
applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words.
(3) Complex, averaging 11.9 constraints per instruction, covering diverse
constraint types, such as tool specifications and condition constraints. To
construct AgentIF, we collect 707 human-annotated instructions across 50
agentic tasks from industrial application agents and open-source agentic
systems. For each instruction, we annotate the associated constraints and
corresponding evaluation metrics, including code-based evaluation, LLM-based
evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically
evaluate existing advanced LLMs. We observe that current models generally
perform poorly, especially in handling complex constraint structures and tool
specifications. We further conduct error analysis and analytical experiments on
instruction length and meta constraints, providing some findings about the
failure modes of existing LLMs. We have released the code and data to
facilitate future research.
☆ NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification
NovelSeek Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Runmin Ma, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai
Artificial Intelligence (AI) is accelerating the transformation of scientific
research paradigms, not only enhancing research efficiency but also driving
innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework
to conduct Autonomous Scientific Research (ASR) across various scientific
research fields, enabling researchers to tackle complicated problems in these
fields with unprecedented speed and precision. NovelSeek highlights three key
advantages: 1) Scalability: NovelSeek has demonstrated its versatility across
12 scientific research tasks, capable of generating innovative ideas to enhance
the performance of baseline code. 2) Interactivity: NovelSeek provides an
interface for human expert feedback and multi-agent interaction in automated
end-to-end processes, allowing for the seamless integration of domain expert
knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in
several scientific fields with significantly less time cost compared to human
efforts. For instance, in reaction yield prediction, it increased from 27.6% to
35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from
0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation,
precision advanced from 78.8% to 81.0% in a mere 30 hours.
comment: HomePage: https://alpha-innovator.github.io/NovelSeek-project-page
☆ In-Context Watermarks for Large Language Models
The growing use of large language models (LLMs) for sensitive applications
has highlighted the need for effective watermarking techniques to ensure the
provenance and accountability of AI-generated text. However, most existing
watermarking methods require access to the decoding process, limiting their
applicability in real-world settings. One illustrative example is the use of
LLMs by dishonest reviewers in the context of academic peer review, where
conference organizers have no access to the model used but still need to detect
AI-generated reviews. Motivated by this gap, we introduce In-Context
Watermarking (ICW), which embeds watermarks into generated text solely through
prompt engineering, leveraging LLMs' in-context learning and
instruction-following abilities. We investigate four ICW strategies at
different levels of granularity, each paired with a tailored detection method.
We further examine the Indirect Prompt Injection (IPI) setting as a specific
case study, in which watermarking is covertly triggered by modifying input
documents such as academic manuscripts. Our experiments validate the
feasibility of ICW as a model-agnostic, practical watermarking approach.
Moreover, our findings suggest that as LLMs become more capable, ICW offers a
promising direction for scalable and accessible content attribution.
☆ LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large
Language Model (MLLM) that integrates visual instruction tuning with masked
diffusion models, representing a departure from the autoregressive paradigms
dominant in current multimodal approaches. Built upon LLaDA, a representative
large language diffusion model, LLaDA-V incorporates a vision encoder and MLP
connector that projects visual features into the language embedding space,
enabling effective multimodal alignment. Our empirical investigation reveals
several intriguing results: First, LLaDA-V demonstrates promising multimodal
performance despite its language model being weaker on purely textual tasks
than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same
instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal
tasks with better data scalability. It also narrows the performance gap to
Qwen2-VL, suggesting the effectiveness of its architecture for multimodal
tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal
understanding compared to existing hybrid autoregressive-diffusion and purely
diffusion-based MLLMs. Our findings suggest that large language diffusion
models show promise in multimodal contexts and warrant further investigation in
future research. Project page and codes:
https://ml-gsai.github.io/LLaDA-V-demo/.
☆ The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm
Computing the polar decomposition and the related matrix sign function, has
been a well-studied problem in numerical analysis for decades. More recently,
it has emerged as an important subroutine in deep learning, particularly within
the Muon optimization framework. However, the requirements in this setting
differ significantly from those of traditional numerical analysis. In deep
learning, methods must be highly efficient and GPU-compatible, but high
accuracy is often unnecessary. As a result, classical algorithms like
Newton-Schulz (which suffers from slow initial convergence) and methods based
on rational functions (which rely on QR decompositions or matrix inverses) are
poorly suited to this context. In this work, we introduce Polar Express, a
GPU-friendly algorithm for computing the polar decomposition. Like classical
polynomial methods such as Newton-Schulz, our approach uses only matrix-matrix
multiplications, making it GPU-compatible. Motivated by earlier work of Chen &
Chow and Nakatsukasa & Freund, Polar Express adapts the polynomial update rule
at each iteration by solving a minimax optimization problem, and we prove that
it enjoys a strong worst-case optimality guarantee. This property ensures both
rapid early convergence and fast asymptotic convergence. We also address
finite-precision issues, making it stable in bfloat16 in practice. We apply
Polar Express within the Muon optimization framework and show consistent
improvements in validation loss on large-scale models such as GPT-2,
outperforming recent alternatives across a range of learning rates.
☆ PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues EMNLP 2025
Personally identifiable information (PII) anonymization is a high-stakes task
that poses a barrier to many open-science data sharing initiatives. While PII
identification has made large strides in recent years, in practice, error
thresholds and the recall/precision trade-off still limit the uptake of these
anonymization pipelines. We present PIIvot, a lighter-weight framework for PII
anonymization that leverages knowledge of the data context to simplify the PII
detection problem. To demonstrate its effectiveness, we also contribute
QATD-2k, the largest open-source real-world tutoring dataset of its kind, to
support the demand for quality educational dialogue data.
comment: 6 pages, 2 figures, submitted to EMNLP 2025, for associated dataset,
see
https://huggingface.co/datasets/Eedi/Question-Anchored-Tutoring-Dialogues-2k
☆ Latent Principle Discovery for Language Model Self-Improvement
When language model (LM) users aim to improve the quality of its generations,
it is crucial to specify concrete behavioral attributes that the model should
strive to reflect. However, curating such principles across many domains, even
non-exhaustively, requires a labor-intensive annotation process. To automate
this process, we propose eliciting these latent attributes guiding model
reasoning towards human-preferred responses by explicitly modeling them in a
self-correction setting. Our approach mines new principles from the LM itself
and compresses the discovered elements to an interpretable set via clustering.
Specifically, we employ an approximation of posterior-regularized Monte Carlo
Expectation-Maximization to both identify a condensed set of the most effective
latent principles and teach the LM to strategically invoke them in order to
intrinsically refine its responses. We demonstrate that bootstrapping our
algorithm over multiple iterations enables smaller language models (7-8B
parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an
average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on
IFEval. We also show that clustering the principles yields interpretable and
diverse model-generated constitutions while retaining model performance. The
gains our method achieves highlight the potential of automated,
principle-driven post-training recipes toward continual self-improvement.
☆ UNCLE: Uncertainty Expressions in Long-Form Generation
Large Language Models (LLMs) are prone to hallucination, particularly in
long-form generations. A promising direction to mitigate hallucination is to
teach LLMs to express uncertainty explicitly when they lack sufficient
knowledge. However, existing work lacks direct and fair evaluation of LLMs'
ability to express uncertainty effectively in long-form generation. To address
this gap, we first introduce UNCLE, a benchmark designed to evaluate
uncertainty expression in both long- and short-form question answering (QA).
UNCLE spans five domains and comprises 4k long-form QA instances and over 20k
short-form QA pairs. Our dataset is the first to directly bridge short- and
long-form QA with paired questions and gold-standard answers. Along with the
benchmark, we propose a suite of new metrics to assess the models' capabilities
to selectively express uncertainty. Using UNCLE, we then demonstrate that
current models fail to convey uncertainty appropriately in long-form
generation. We further explore both prompt-based and training-based methods to
improve models' performance, with the training-based methods yielding greater
gains. Further analysis of alignment gaps between short- and long-form
uncertainty expression highlights promising directions for future research
using UNCLE.
☆ Power-Law Decay Loss for Large Language Model Finetuning: Focusing on Information Sparsity to Enhance Generation Quality
During the finetuning stage of text generation tasks, standard cross-entropy
loss treats all tokens equally. This can lead models to overemphasize
high-frequency, low-information tokens, neglecting lower-frequency tokens
crucial for specificity and informativeness in generated content. This paper
introduces a novel loss function, Power-Law Decay Loss (PDL), specifically
designed to optimize the finetuning process for text generation. The core
motivation for PDL stems from observations in information theory and
linguistics: the informativeness of a token is often inversely proportional to
its frequency of occurrence. PDL re-weights the contribution of each token in
the standard cross-entropy loss based on its frequency in the training corpus,
following a power-law decay. Specifically, the weights for high-frequency
tokens are reduced, while low-frequency, information-dense tokens are assigned
higher weights. This mechanism guides the model during finetuning to focus more
on learning and generating tokens that convey specific and unique information,
thereby enhancing the quality, diversity, and informativeness of the generated
text. We theoretically elaborate on the motivation and construction of PDL and
discuss its potential applications and advantages across various text
generation finetuning tasks, such as abstractive summarization, dialogue
systems, and style transfer.
☆ Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLMs
Hallucinations -- plausible yet erroneous outputs -- remain a critical
barrier to reliable deployment of large language models (LLMs). We present the
first systematic study linking hallucination incidence to internal-state drift
induced by incremental context injection. Using TruthfulQA, we construct two
16-round "titration" tracks per question: one appends relevant but partially
flawed snippets, the other injects deliberately misleading content. Across six
open-source LLMs, we track overt hallucination rates with a tri-perspective
detector and covert dynamics via cosine, entropy, JS and Spearman drifts of
hidden states and attention maps. Results reveal (1) monotonic growth of
hallucination frequency and representation drift that plateaus after 5--7
rounds; (2) relevant context drives deeper semantic assimilation, producing
high-confidence "self-consistent" hallucinations, whereas irrelevant context
induces topic-drift errors anchored by attention re-routing; and (3)
convergence of JS-Drift ($\sim0.69$) and Spearman-Drift ($\sim0$) marks an
"attention-locking" threshold beyond which hallucinations solidify and become
resistant to correction. Correlation analyses expose a seesaw between
assimilation capacity and attention diffusion, clarifying size-dependent error
modes. These findings supply empirical foundations for intrinsic hallucination
prediction and context-aware mitigation mechanisms.
☆ CAIN: Hijacking LLM-Humans Conversations via a Two-Stage Malicious System Prompt Generation and Refining Framework
Large language models (LLMs) have advanced many applications, but are also
known to be vulnerable to adversarial attacks. In this work, we introduce a
novel security threat: hijacking AI-human conversations by manipulating LLMs'
system prompts to produce malicious answers only to specific targeted questions
(e.g., "Who should I vote for US President?", "Are Covid vaccines safe?"),
while behaving benignly on others. This attack is detrimental as it can enable
malicious actors to exercise large-scale information manipulation by spreading
harmful but benign-looking system prompts online. To demonstrate such an
attack, we develop CAIN, an algorithm that can automatically curate such
harmful system prompts for a specific target question in a black-box setting or
without the need to access the LLM's parameters. Evaluated on both open-source
and commercial LLMs, CAIN demonstrates significant adversarial impact. In
untargeted attacks or forcing LLMs to output incorrect answers, CAIN achieves
up to 40% F1 degradation on targeted questions while preserving high accuracy
on benign inputs. For targeted attacks or forcing LLMs to output specific
harmful answers, CAIN achieves over 70% F1 scores on these targeted responses
with minimal impact on benign questions. Our results highlight the critical
need for enhanced robustness measures to safeguard the integrity and safety of
LLMs in real-world applications. All source code will be publicly available.
☆ Don't "Overthink" Passage Reranking: Is Reasoning Truly Necessary?
With the growing success of reasoning models across complex natural language
tasks, researchers in the Information Retrieval (IR) community have begun
exploring how similar reasoning capabilities can be integrated into passage
rerankers built on Large Language Models (LLMs). These methods typically employ
an LLM to produce an explicit, step-by-step reasoning process before arriving
at a final relevance prediction. But, does reasoning actually improve reranking
accuracy? In this paper, we dive deeper into this question, studying the impact
of the reasoning process by comparing reasoning-based pointwise rerankers
(ReasonRR) to standard, non-reasoning pointwise rerankers (StandardRR) under
identical training conditions, and observe that StandardRR generally
outperforms ReasonRR. Building on this observation, we then study the
importance of reasoning to ReasonRR by disabling its reasoning process
(ReasonRR-NoReason), and find that ReasonRR-NoReason is surprisingly more
effective than ReasonRR. Examining the cause of this result, our findings
reveal that reasoning-based rerankers are limited by the LLM's reasoning
process, which pushes it toward polarized relevance scores and thus fails to
consider the partial relevance of passages, a key factor for the accuracy of
pointwise rerankers.
☆ CASTILLO: Characterizing Response Length Distributions of Large Language Models
Efficiently managing compute resources for Large Language Model (LLM)
inference remains challenging due to the inherently stochastic and variable
lengths of autoregressive text generation. Accurately estimating response
lengths in advance enables proactive resource allocation, yet existing
approaches either bias text generation towards certain lengths or rely on
assumptions that ignore model- and prompt-specific variability. We introduce
CASTILLO, a dataset characterizing response length distributions across 13
widely-used open-source LLMs evaluated on seven distinct instruction-following
corpora. For each $\langle$prompt, model$\rangle$ sample pair, we generate 10
independent completions using fixed decoding hyper-parameters, record the token
length of each response, and publish summary statistics (mean, std-dev,
percentiles), along with the shortest and longest completions, and the exact
generation settings. Our analysis reveals significant inter- and intra-model
variability in response lengths (even under identical generation settings), as
well as model-specific behaviors and occurrences of partial text degeneration
in only subsets of responses. CASTILLO enables the development of predictive
models for proactive scheduling and provides a systematic framework for
analyzing model-specific generation behaviors. We publicly release the dataset
and code to foster research at the intersection of generative language modeling
and systems.
comment: Dataset available in
https://huggingface.co/datasets/danfperam/castillo and code is available in
https://github.com/DanielFPerez/castillo
☆ MPO: Multilingual Safety Alignment via Reward Gap Optimization ACL 2025
Weixiang Zhao, Yulin Hu, Yang Deng, Tongtong Wu, Wenxuan Zhang, Jiahe Guo, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu
Large language models (LLMs) have become increasingly central to AI
applications worldwide, necessitating robust multilingual safety alignment to
ensure secure deployment across diverse linguistic contexts. Existing
preference learning methods for safety alignment, such as RLHF and DPO, are
primarily monolingual and struggle with noisy multilingual data. To address
these limitations, we introduce Multilingual reward gaP Optimization (MPO), a
novel approach that leverages the well-aligned safety capabilities of the
dominant language (English) to improve safety alignment across multiple
languages. MPO directly minimizes the reward gap difference between the
dominant language and target languages, effectively transferring safety
capabilities while preserving the original strengths of the dominant language.
Extensive experiments on three LLMs, LLaMA-3.1, Gemma-2 and Qwen2.5, validate
MPO's efficacy in multilingual safety alignment without degrading general
multilingual utility.
comment: To Appear at ACL 2025 (Main)
☆ Comparative analysis of subword tokenization approaches for Indian languages
Tokenization is the act of breaking down text into smaller parts, or tokens,
that are easier for machines to process. This is a key phase in machine
translation (MT) models. Subword tokenization enhances this process by breaking
down words into smaller subword units, which is especially beneficial in
languages with complicated morphology or a vast vocabulary. It is useful in
capturing the intricate structure of words in Indian languages (ILs), such as
prefixes, suffixes, and other morphological variations. These languages
frequently use agglutinative structures, in which words are formed by the
combination of multiple morphemes such as suffixes, prefixes, and stems. As a
result, a suitable tokenization strategy must be chosen to address these
scenarios. This paper examines how different subword tokenization techniques,
such as SentencePiece, Byte Pair Encoding (BPE), and WordPiece Tokenization,
affect ILs. The effectiveness of these subword tokenization techniques is
investigated in statistical, neural, and multilingual neural machine
translation models. All models are examined using standard evaluation metrics,
such as the Bilingual Evaluation Understudy (BLEU) score, TER, METEOR, CHRF,
RIBES, and COMET. Based on the results, it appears that for the majority of
language pairs for the Statistical and Neural MT models, the SentencePiece
tokenizer continuously performed better than other tokenizers in terms of BLEU
score. However, BPE tokenization outperformed other tokenization techniques in
the context of Multilingual Neural Machine Translation model. The results show
that, despite using the same tokenizer and dataset for each model, translations
from ILs to English surpassed translations from English to ILs.
comment: 24 pages, 4 tables
☆ Nested Named Entity Recognition as Single-Pass Sequence Labeling EMNLP 2025
We cast nested named entity recognition (NNER) as a sequence labeling task by
leveraging prior work that linearizes constituency structures, effectively
reducing the complexity of this structured prediction problem to
straightforward token classification. By combining these constituency
linearizations with pretrained encoders, our method captures nested entities
while performing exactly $n$ tagging actions. Our approach achieves competitive
performance compared to less efficient systems, and it can be trained using any
off-the-shelf sequence labeling library.
comment: Submitted to EMNLP 2025
☆ ATR-Bench: A Federated Learning Benchmark for Adaptation, Trust, and Reasoning
Tajamul Ashraf, Mohammed Mohsen Peerzada, Moloud Abdar, Yutong Xie, Yuyin Zhou, Xiaofeng Liu, Iqra Altaf Gillani, Janibul Bashir
Federated Learning (FL) has emerged as a promising paradigm for collaborative
model training while preserving data privacy across decentralized participants.
As FL adoption grows, numerous techniques have been proposed to tackle its
practical challenges. However, the lack of standardized evaluation across key
dimensions hampers systematic progress and fair comparison of FL methods. In
this work, we introduce ATR-Bench, a unified framework for analyzing federated
learning through three foundational dimensions: Adaptation, Trust, and
Reasoning. We provide an in-depth examination of the conceptual foundations,
task formulations, and open research challenges associated with each theme. We
have extensively benchmarked representative methods and datasets for adaptation
to heterogeneous clients and trustworthiness in adversarial or unreliable
environments. Due to the lack of reliable metrics and models for reasoning in
FL, we only provide literature-driven insights for this dimension. ATR-Bench
lays the groundwork for a systematic and holistic evaluation of federated
learning with real-world relevance. We will make our complete codebase publicly
accessible and a curated repository that continuously tracks new developments
and research in the FL literature.
comment: Federated Learning Benchmark for Domain Adaptation, Trustworthiness,
and Reasoning
☆ Understanding and Analyzing Inappropriately Targeting Language in Online Discourse: A Comparative Annotation Study
This paper introduces a method for detecting inappropriately targeting
language in online conversations by integrating crowd and expert annotations
with ChatGPT. We focus on English conversation threads from Reddit, examining
comments that target individuals or groups. Our approach involves a
comprehensive annotation framework that labels a diverse data set for various
target categories and specific target words within the conversational context.
We perform a comparative analysis of annotations from human experts, crowd
annotators, and ChatGPT, revealing strengths and limitations of each method in
recognizing both explicit hate speech and subtler discriminatory language. Our
findings highlight the significant role of contextual factors in identifying
hate speech and uncover new categories of targeting, such as social belief and
body image. We also address the challenges and subjective judgments involved in
annotation and the limitations of ChatGPT in grasping nuanced language. This
study provides insights for improving automated content moderation strategies
to enhance online safety and inclusivity.
☆ R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search
Yibo Wang, Li Shen, Huanjin Yao, Tiansheng Huang, Rui Liu, Naiqiang Tan, Jiaxing Huang, Kai Zhang, Dacheng Tao
Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by
enabling step-by-step problem-solving, yet its extension to Long-CoT introduces
substantial computational overhead due to increased token length. Existing
compression approaches -- instance-level and token-level -- either sacrifice
essential local reasoning signals like reflection or yield incoherent outputs.
To address these limitations, we propose R1-Compress, a two-stage chunk-level
compression framework that preserves both local information and coherence. Our
method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk
compression, and employs an inter-chunk search mechanism to select the short
and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500,
AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces
token usage while maintaining comparable reasoning accuracy. On MATH500,
R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to
the Long-CoT baseline, while reducing token usage by about 20%. Source code
will be available at https://github.com/w-yibo/R1-Compress
☆ SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis
Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen
Retrieval-augmented generation (RAG) systems have advanced large language
models (LLMs) in complex deep search scenarios requiring multi-step reasoning
and iterative information retrieval. However, existing approaches face critical
limitations that lack high-quality training trajectories or suffer from the
distributional mismatches in simulated environments and prohibitive
computational costs for real-world deployment. This paper introduces
SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap
through strategic data engineering rather than complex training paradigms. Our
approach synthesizes high-quality training data by simulating realistic user
interactions in live web search environments, coupled with a multi-criteria
curation strategy that optimizes the diversity and quality of input and output
side. Experiments on five benchmarks across diverse domains demonstrate that
SFT on only 871 curated samples yields significant improvements over RL-based
baselines. Our work establishes SFT as a viable pathway by systematically
addressing the data-scarce bottleneck, offering practical insights for
efficient deep search systems. Our code is available at
https://github.com/RUCAIBox/SimpleDeepSearcher.
☆ From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization
While foundation models (FMs), such as diffusion models and large
vision-language models (LVLMs), have been widely applied in educational
contexts, their ability to generate pedagogically effective visual explanations
remains limited. Most existing approaches focus primarily on textual reasoning,
overlooking the critical role of structured and interpretable visualizations in
supporting conceptual understanding. To better assess the visual reasoning
capabilities of FMs in educational settings, we introduce EduVisBench, a
multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem
sets requiring visually grounded solutions, along with a fine-grained
evaluation rubric informed by pedagogical theory. Our empirical analysis
reveals that existing models frequently struggle with the inherent challenge of
decomposing complex reasoning and translating it into visual representations
aligned with human cognitive processes. To address these limitations, we
propose EduVisAgent, a multi-agent collaborative framework that coordinates
specialized agents for instructional planning, reasoning decomposition,
metacognitive prompting, and visualization design. Experimental results show
that EduVisAgent substantially outperforms all baselines, achieving a 40.2%
improvement and delivering more educationally aligned visualizations.
EduVisBench and EduVisAgent are available at
https://github.com/aiming-lab/EduVisBench and
https://github.com/aiming-lab/EduVisAgent.
comment: 16 pages; 7 figures
☆ Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
Unlearning in large language models (LLMs) is intended to remove the
influence of specific data, yet current evaluations rely heavily on token-level
metrics such as accuracy and perplexity. We show that these metrics can be
misleading: models often appear to forget, but their original behavior can be
rapidly restored with minimal fine-tuning, revealing that unlearning may
obscure information rather than erase it. To diagnose this phenomenon, we
introduce a representation-level evaluation framework using PCA-based
similarity and shift, centered kernel alignment, and Fisher information.
Applying this toolkit across six unlearning methods, three domains (text, code,
math), and two open-source LLMs, we uncover a critical distinction between
reversible and irreversible forgetting. In reversible cases, models suffer
token-level collapse yet retain latent features; in irreversible cases, deeper
representational damage occurs. We further provide a theoretical account
linking shallow weight perturbations near output layers to misleading
unlearning signals, and show that reversibility is modulated by task type and
hyperparameters. Our findings reveal a fundamental gap in current evaluation
practices and establish a new diagnostic foundation for trustworthy unlearning
in LLMs. We provide a unified toolkit for analyzing LLM representation changes
under unlearning and relearning:
https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git.
comment: 44 pages
☆ KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning
Recent advances have demonstrated that integrating reinforcement learning
with rule-based rewards can significantly enhance the reasoning capabilities of
large language models, even without supervised fine-tuning. However, prevalent
reinforcement learning algorithms such as GRPO and its variants like DAPO,
suffer from a coarse granularity issue when computing the advantage.
Specifically, they compute rollout-level advantages that assign identical
values to every token within a sequence, failing to capture token-specific
contributions and hindering effective learning. To address this limitation, we
propose Key-token Advantage Estimation (KTAE) - a novel algorithm that
estimates fine-grained, token-level advantages without introducing additional
models. KTAE leverages the correctness of sampled rollouts and applies
statistical analysis to quantify the importance of individual tokens within a
sequence to the final outcome. This quantified token-level importance is then
combined with the rollout-level advantage to obtain a more fine-grained
token-level advantage estimation. Empirical results show that models trained
with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five
mathematical reasoning benchmarks. Notably, they achieve higher accuracy with
shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base
model.
☆ Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?
Named Entity Recognition(NER) for low-resource languages aims to produce
robust systems for languages where there is limited labeled training data
available, and has been an area of increasing interest within NLP. Data
augmentation for increasing the amount of low-resource labeled data is a common
practice. In this paper, we explore the role of synthetic data in the context
of multilingual, low-resource NER, considering 11 languages from diverse
language families. Our results suggest that synthetic data does in fact hold
promise for low-resource language NER, though we see significant variation
between languages.
comment: pre-print
☆ Two-way Evidence self-Alignment based Dual-Gated Reasoning Enhancement
Large language models (LLMs) encounter difficulties in knowledge-intensive
multi-step reasoning (KIMSR) tasks. One challenge is how to effectively extract
and represent rationale evidence. The current methods often extract
semantically relevant but logically irrelevant evidence, resulting in flawed
reasoning and inaccurate responses. We propose a two-way evidence
self-alignment (TW-ESA) module, which utilizes the mutual alignment between
strict reasoning and LLM reasoning to enhance its understanding of the causal
logic of evidence, thereby addressing the first challenge. Another challenge is
how to utilize the rationale evidence and LLM's intrinsic knowledge for
accurate reasoning when the evidence contains uncertainty. We propose a
dual-gated reasoning enhancement (DGR) module to gradually fuse useful
knowledge of LLM within strict reasoning, which can enable the model to perform
accurate reasoning by focusing on causal elements in the evidence and exhibit
greater robustness. The two modules are collaboratively trained in a unified
framework ESA-DGR. Extensive experiments on three diverse and challenging KIMSR
datasets reveal that ESA-DGR significantly surpasses state-of-the-art LLM-based
fine-tuning methods, with remarkable average improvements of 4% in exact match
(EM) and 5% in F1 score. The implementation code is available at
https://anonymous.4open.science/r/ESA-DGR-2BF8.
☆ Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation
We introduce a transformer-based morpheme segmentation system that augments a
low-resource training signal through multitask learning and LLM-generated
synthetic data. Our framework jointly predicts morphological segments and
glosses from orthographic input, leveraging shared linguistic representations
obtained through a common documentary process to enhance model generalization.
To further address data scarcity, we integrate synthetic training data
generated by large language models (LLMs) using in-context learning.
Experimental results on the SIGMORPHON 2023 dataset show that our approach
significantly improves word-level segmentation accuracy and morpheme-level
F1-score across multiple low-resource languages.
☆ Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability
As large language models gain popularity, their vulnerability to adversarial
attacks remains a primary concern. While fine-tuning models on domain-specific
datasets is often employed to improve model performance, it can introduce
vulnerabilities within the underlying model. In this work, we investigate
Accidental Misalignment, unexpected vulnerabilities arising from
characteristics of fine-tuning data. We begin by identifying potential
correlation factors such as linguistic features, semantic similarity, and
toxicity within our experimental datasets. We then evaluate the adversarial
performance of these fine-tuned models and assess how dataset factors correlate
with attack success rates. Lastly, we explore potential causal links, offering
new insights into adversarial defense strategies and highlighting the crucial
role of dataset design in preserving model alignment. Our code is available at
https://github.com/psyonp/accidental_misalignment.
☆ Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning
Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, Xiaoyu Shen
Large Language Models (LLMs) have achieved impressive performance on complex
reasoning tasks with Chain-of-Thought (CoT) prompting. However, conventional
CoT relies on reasoning steps explicitly verbalized in natural language,
introducing inefficiencies and limiting its applicability to abstract
reasoning. To address this, there has been growing research interest in latent
CoT reasoning, where inference occurs within latent spaces. By decoupling
reasoning from language, latent reasoning promises richer cognitive
representations and more flexible, faster inference. Researchers have explored
various directions in this promising field, including training methodologies,
structural innovations, and internal reasoning mechanisms. This paper presents
a comprehensive overview and analysis of this reasoning paradigm. We begin by
proposing a unified taxonomy from four perspectives: token-wise strategies,
internal mechanisms, analysis, and applications. We then provide in-depth
discussions and comparative analyses of representative methods, highlighting
their design patterns, strengths, and open challenges. We aim to provide a
structured foundation for advancing this emerging direction in LLM reasoning.
The relevant papers will be regularly updated at
https://github.com/EIT-NLP/Awesome-Latent-CoT.
☆ IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models
Large language models (LLMs) have demonstrated strong instruction-following
capabilities in text-based tasks. However, this ability often deteriorates in
multimodal models after alignment with non-text modalities such as images or
audio. While several recent efforts have investigated instruction-following
performance in text and vision-language models, instruction-following in
audio-based large language models remains largely unexplored. To bridge this
gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess
the ability to follow instructions in an audio LLM. IFEval-Audio contains 280
audio-instruction-answer triples across six diverse dimensions: Content,
Capitalization, Symbol, List Structure, Length, and Format. Each example pairs
an audio input with a text instruction, requiring the model to generate an
output that follows a specified structure. We benchmark state-of-the-art audio
LLMs on their ability to follow audio-involved instructions. The dataset is
released publicly to support future research in this emerging area.
comment: Link: https://github.com/AudioLLMs/AudioBench/tree/main/IFEval-Audio
☆ TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning
Large Language Models (LLMs) present significant computational and memory
challenges due to their extensive size, making pruning essential for their
efficient deployment. Existing one-shot pruning methods often apply uniform
sparsity constraints across layers or within each layer, resulting in
suboptimal performance, especially at high sparsity ratios. This work
introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel
approach that applies varying sparsity ratios to individual output dimensions
(rows) within each layer. TRIM employs an iterative adjustment process guided
by quality metrics to optimize dimension-wise sparsity allocation, focusing on
reducing variance in quality retention across outputs to preserve critical
information. TRIM can be seamlessly integrated with existing layer-wise pruning
strategies. Our evaluations on perplexity and zero-shot tasks across diverse
LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that
TRIM achieves new state-of-the-art results and enhances stability. For
instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and
over 90% for OPT-13B compared to baseline methods. We conclude that
fine-grained, dimension-wise sparsity adaptation is crucial for pushing the
limits of extreme LLM compression. Code available at:
https://github.com/flobk/TRIM
☆ Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization
The significant progress of large language models (LLMs) has led to
remarkable achievements across numerous applications. However, their ability to
generate harmful content has sparked substantial safety concerns. Despite the
implementation of safety alignment techniques during the pre-training phase,
recent research indicates that fine-tuning LLMs on adversarial or even benign
data can inadvertently compromise their safety. In this paper, we re-examine
the fundamental issue of why fine-tuning on non-harmful data still results in
safety degradation. We introduce a safety-aware probing (SAP) optimization
framework designed to mitigate the safety risks of fine-tuning LLMs.
Specifically, SAP incorporates a safety-aware probe into the gradient
propagation process, mitigating the model's risk of safety degradation by
identifying potential pitfalls in gradient directions, thereby enhancing
task-specific performance while successfully preserving model safety. Our
extensive experimental results demonstrate that SAP effectively reduces
harmfulness below the original fine-tuned model and achieves comparable test
loss to standard fine-tuning methods. Our code is available at
https://github.com/ChengcanWu/SAP.
☆ Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification
As large language models (LLMs) become increasingly prevalent in global
applications, ensuring that they are toxicity-free across diverse linguistic
contexts remains a critical challenge. We explore "Cross-lingual
Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling
detoxification capabilities to transfer between high and low-resource languages
across different script families. We analyze cross-lingual detoxification's
effectiveness through 504 extensive settings to evaluate toxicity reduction in
cross-distribution settings with limited data and investigate how mitigation
impacts model performance on non-toxic tasks, revealing trade-offs between
safety and knowledge preservation. Our code and dataset are publicly available
at https://github.com/himanshubeniwal/Breaking-mBad.
☆ Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs
Although multimodal large language models (MLLMs) have achieved impressive
performance, the multimodal instruction tuning stage often causes catastrophic
forgetting of the base LLM's language ability, even in strong models like
Llama3. To address this, we propose Locate-then-Merge, a training-free
parameter fusion framework that first locates important parameters and then
selectively merges them. We further introduce Neuron-Fusion, a neuron-level
strategy that preserves the influence of neurons with large parameter
shifts--neurons likely responsible for newly acquired visual
capabilities--while attenuating the influence of neurons with smaller changes
that likely encode general-purpose language skills. This design enables better
retention of visual adaptation while mitigating language degradation.
Experiments on 13 benchmarks across both language and visual tasks show that
Neuron-Fusion consistently outperforms existing model merging methods. Further
analysis reveals that our method effectively reduces context hallucination in
generation.
☆ Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence ICML 2025
Transformer-based language models exhibit In-Context Learning (ICL), where
predictions are made adaptively based on context. While prior work links
induction heads to ICL through a sudden jump in accuracy, this can only account
for ICL when the answer is included within the context. However, an important
property of practical ICL in large language models is the ability to meta-learn
how to solve tasks from context, rather than just copying answers from context;
how such an ability is obtained during training is largely unexplored. In this
paper, we experimentally clarify how such meta-learning ability is acquired by
analyzing the dynamics of the model's circuit during training. Specifically, we
extend the copy task from previous research into an In-Context Meta Learning
setting, where models must infer a task from examples to answer queries.
Interestingly, in this setting, we find that there are multiple phases in the
process of acquiring such abilities, and that a unique circuit emerges in each
phase, contrasting with the single-phases change in induction heads. The
emergence of such circuits can be related to several phenomena known in large
language models, and our analysis lead to a deeper understanding of the source
of the transformer's ICL ability.
comment: Accepted to ICML 2025
☆ SPaRC: A Spatial Pathfinding Reasoning Challenge
Existing reasoning datasets saturate and fail to test abstract, multi-step
problems, especially pathfinding and complex rule constraint satisfaction. We
introduce SPaRC (Spatial Pathfinding Reasoning Challenge), a dataset of 1,000
2D grid pathfinding puzzles to evaluate spatial and symbolic reasoning,
requiring step-by-step planning with arithmetic and geometric rules. Humans
achieve near-perfect accuracy (98.0%; 94.5% on hard puzzles), while the best
reasoning models, such as o4-mini, struggle (15.8%; 1.1% on hard puzzles).
Models often generate invalid paths (>50% of puzzles for o4-mini), and
reasoning tokens reveal they make errors in navigation and spatial logic.
Unlike humans, who take longer on hard puzzles, models fail to scale test-time
compute with difficulty. Allowing models to make multiple solution attempts
improves accuracy, suggesting potential for better spatial reasoning with
improved training and efficient test-time scaling methods. SPaRC can be used as
a window into models' spatial reasoning limitations and drive research toward
new methods that excel in abstract, multi-step problem-solving.
☆ R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, Jiaxing Huang
In this work, we aim to incentivize the reasoning ability of Multimodal Large
Language Models (MLLMs) via reinforcement learning (RL) and develop an
effective approach that mitigates the sparse reward and advantage vanishing
issues during RL. To this end, we propose Share-GRPO, a novel RL approach that
tackle these issues by exploring and sharing diverse reasoning trajectories
over expanded question space. Specifically, Share-GRPO first expands the
question space for a given question via data transformation techniques, and
then encourages MLLM to effectively explore diverse reasoning trajectories over
the expanded question space and shares the discovered reasoning trajectories
across the expanded questions during RL. In addition, Share-GRPO also shares
reward information during advantage computation, which estimates solution
advantages hierarchically across and within question variants, allowing more
accurate estimation of relative advantages and improving the stability of
policy training. Extensive evaluations over six widely-used reasoning
benchmarks showcase the superior performance of our method. Code will be
available at https://github.com/HJYao00/R1-ShareVL.
comment: Technical report
☆ A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP
We present a Japanese domain-specific language model for the pharmaceutical
field, developed through continual pretraining on 2 billion Japanese
pharmaceutical tokens and 8 billion English biomedical tokens. To enable
rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on
national pharmacist licensing exams; NayoseQA, which tests cross-lingual
synonym and terminology normalization; and SogoCheck, a novel task designed to
assess consistency reasoning between paired statements. We evaluate our model
against both open-source medical LLMs and commercial models, including GPT-4o.
Results show that our domain-specific model outperforms existing open models
and achieves competitive performance with commercial ones, particularly on
terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o
performs poorly on SogoCheck, suggesting that cross-sentence consistency
reasoning remains an open challenge. Our benchmark suite offers a broader
diagnostic lens for pharmaceutical NLP, covering factual recall, lexical
variation, and logical consistency. This work demonstrates the feasibility of
building practical, secure, and cost-effective language models for Japanese
domain-specific applications, and provides reusable evaluation resources for
future research in pharmaceutical and healthcare NLP. Our model, codes, and
datasets are released at https://github.com/EQUES-Inc/pharma-LLM-eval.
comment: 15 pages, 9 tables, 5 figures
☆ Can reasoning models comprehend mathematical problems in Chinese ancient texts? An empirical study based on data from Suanjing Shishu
This study addresses the challenges in intelligent processing of Chinese
ancient mathematical classics by constructing Guji_MATH, a benchmark for
evaluating classical texts based on Suanjing Shishu. It systematically assesses
the mathematical problem-solving capabilities of mainstream reasoning models
under the unique linguistic constraints of classical Chinese. Through
machine-assisted annotation and manual verification, 538 mathematical problems
were extracted from 8 canonical texts, forming a structured dataset centered on
the "Question-Answer-Solution" framework, supplemented by problem types and
difficulty levels. Dual evaluation modes--closed-book (autonomous
problem-solving) and open-book (reproducing classical solution methods)--were
designed to evaluate the performance of six reasoning models on ancient Chinese
mathematical problems. Results indicate that reasoning models can partially
comprehend and solve these problems, yet their overall performance remains
inferior to benchmarks on modern mathematical tasks. Enhancing models'
classical Chinese comprehension and cultural knowledge should be prioritized
for optimization. This study provides methodological support for mining
mathematical knowledge from ancient texts and disseminating traditional
culture, while offering new perspectives for evaluating cross-linguistic and
cross-cultural capabilities of reasoning models.
comment: 29pages, 7 figures
☆ Collaboration among Multiple Large Language Models for Medical Question Answering
Empowered by vast internal knowledge reservoir, the new generation of large
language models (LLMs) demonstrate untapped potential to tackle medical tasks.
However, there is insufficient effort made towards summoning up a synergic
effect from multiple LLMs' expertise and background. In this study, we propose
a multi-LLM collaboration framework tailored on a medical multiple-choice
questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants,
our framework is proved to boost all LLMs reasoning ability as well as
alleviate their divergence among questions. We also measure an LLM's confidence
when it confronts with adversary opinions from other LLMs and observe a
concurrence between LLM's confidence and prediction accuracy.
comment: Accepted to IEEE International Conference on Healthcare Informatics
2025
☆ SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation
Large language models (LLMs) have recently demonstrated remarkable
capabilities in machine translation (MT). However, most advanced MT-specific
LLMs heavily rely on external supervision signals during training, such as
human-annotated reference data or trained reward models (RMs), which are often
expensive to obtain and challenging to scale. To overcome this limitation, we
propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for
MT that is reference-free, fully online, and relies solely on self-judging
rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as
the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs,
e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like
Qwen2.5-32B-Instruct in English $\leftrightarrow$ Chinese translation tasks
from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR
with external supervision from COMET, our strongest model, SSR-X-Zero-7B,
achieves state-of-the-art performance in English $\leftrightarrow$ Chinese
translation, surpassing all existing open-source models under 72B parameters
and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro.
Our analysis highlights the effectiveness of the self-rewarding mechanism
compared to the external LLM-as-a-judge approach in MT and demonstrates its
complementary benefits when combined with trained RMs. Our findings provide
valuable insight into the potential of self-improving RL methods. We have
publicly released our code, data and models.
☆ MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries
Despite bilingual speakers frequently using mixed-language queries in web
searches, Information Retrieval (IR) research on them remains scarce. To
address this, we introduce MiLQ,Mixed-Language Query test set, the first public
benchmark of mixed-language queries, confirmed as realistic and highly
preferred. Experiments show that multilingual IR models perform moderately on
MiLQ and inconsistently across native, English, and mixed-language queries,
also suggesting code-switched training data's potential for robust IR models
handling such queries. Meanwhile, intentional English mixing in queries proves
an effective strategy for bilinguals searching English documents, which our
analysis attributes to enhanced token matching compared to native queries.
comment: 16 pages, 9 figures
☆ Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports
Francesco Dalla Serra, Patrick Schrempf, Chaoyang Wang, Zaiqiao Meng, Fani Deligianni, Alison Q. O'Neil
We present a novel approach to Chest X-ray (CXR) Visual Question Answering
(VQA), addressing both single-image image-difference questions. Single-image
questions focus on abnormalities within a specific CXR ("What abnormalities are
seen in image X?"), while image-difference questions compare two longitudinal
CXRs acquired at different time points ("What are the differences between image
X and Y?"). We further explore how the integration of radiology reports can
enhance the performance of VQA models. While previous approaches have
demonstrated the utility of radiology reports during the pre-training phase, we
extend this idea by showing that the reports can also be leveraged as
additional input to improve the VQA model's predicted answers. First, we
propose a unified method that handles both types of questions and
auto-regressively generates the answers. For single-image questions, the model
is provided with a single CXR. For image-difference questions, the model is
provided with two CXRs from the same patient, captured at different time
points, enabling the model to detect and describe temporal changes. Taking
inspiration from 'Chain-of-Thought reasoning', we demonstrate that performance
on the CXR VQA task can be improved by grounding the answer generator module
with a radiology report predicted for the same CXR. In our approach, the VQA
model is divided into two steps: i) Report Generation (RG) and ii) Answer
Generation (AG). Our results demonstrate that incorporating predicted radiology
reports as evidence to the AG model enhances performance on both single-image
and image-difference questions, achieving state-of-the-art results on the
Medical-Diff-VQA dataset.
☆ Steering Large Language Models for Machine Translation Personalization
High-quality machine translation systems based on large language models
(LLMs) have simplified the production of personalized translations reflecting
specific stylistic constraints. However, these systems still struggle in
settings where stylistic requirements are less explicit and might be harder to
convey via prompting. We explore various strategies for personalizing
LLM-generated translations in low-resource settings, focusing on the
challenging literary translation domain. We explore prompting strategies and
inference-time interventions for steering model generations towards a
personalized style, and propose a contrastive framework exploiting latent
concepts extracted from sparse autoencoders to identify salient personalization
properties. Our results show that steering achieves strong personalization
while preserving translation quality. We further examine the impact of steering
on LLM representations, finding model layers with a relevant impact for
personalization are impacted similarly by multi-shot prompting and our steering
method, suggesting similar mechanism at play.
☆ From Generic Empathy to Personalized Emotional Support: A Self-Evolution Framework for User Preference Alignment
Effective emotional support hinges on understanding users' emotions and needs
to provide meaningful comfort during multi-turn interactions. Large Language
Models (LLMs) show great potential for expressing empathy; however, they often
deliver generic and one-size-fits-all responses that fail to address users'
specific needs. To tackle this issue, we propose a self-evolution framework
designed to help LLMs improve their responses to better align with users'
implicit preferences concerning user profiles (personalities), emotional
states, and specific situations. Our framework consists of two distinct phases:
\textit{(1)} \textit{Emotional Support Experience Acquisition}, where LLMs are
fine-tuned on limited emotional support conversation data to provide basic
support, and \textit{(2)} \textit{Self-Improvement for Personalized Emotional
Support}, where LLMs leverage self-reflection and self-refinement to generate
personalized responses. Through iterative direct preference optimization
between the pre- and post-refined responses, our model generates responses that
reflect a better understanding of the user's implicit preferences. Extensive
experiments and evaluations demonstrate that our method significantly enhances
the model's performance in emotional support, reducing unhelpful responses and
minimizing discrepancies between user preferences and model outputs.
comment: 27 pages
☆ What Media Frames Reveal About Stance: A Dataset and Study about Memes in Climate Change Discourse
Media framing refers to the emphasis on specific aspects of perceived reality
to shape how an issue is defined and understood. Its primary purpose is to
shape public perceptions often in alignment with the authors' opinions and
stances. However, the interaction between stance and media frame remains
largely unexplored. In this work, we apply an interdisciplinary approach to
conceptualize and computationally explore this interaction with internet memes
on climate change. We curate CLIMATEMEMES, the first dataset of climate-change
memes annotated with both stance and media frames, inspired by research in
communication science. CLIMATEMEMES includes 1,184 memes sourced from 47
subreddits, enabling analysis of frame prominence over time and communities,
and sheds light on the framing preferences of different stance holders. We
propose two meme understanding tasks: stance detection and media frame
detection. We evaluate LLaVA-NeXT and Molmo in various setups, and report the
corresponding results on their LLM backbone. Human captions consistently
enhance performance. Synthetic captions and human-corrected OCR also help
occasionally. Our findings highlight that VLMs perform well on stance, but
struggle on frames, where LLMs outperform VLMs. Finally, we analyze VLMs'
limitations in handling nuanced frames and stance expressions on climate change
internet memes.
comment: 19 pages, 9 figures
☆ Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering
Bowen Jiang, Runchuan Zhu, Jiang Wu, Zinco Jiang, Yifan He, Junyuan Gao, Jia Yu, Rui Min, Yinfan Wang, Haote Yang, Songyang Zhang, Dahua Lin, Lijun Wu, Conghui He
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual
factual ability of Large Language Models (LLMs). Inspired by existing research,
we created the question set with features such as single knowledge point
coverage, absolute objectivity, unique answers, and temporal stability. These
questions enable efficient evaluation using the LLM-as-judge paradigm, testing
both the LLMs' factual memory and self-awareness ("know what they don't know").
KoLasSimpleQA expands existing research in two key dimensions: (1) Breadth
(Multilingual Coverage): It includes 9 languages, supporting global
applicability evaluation. (2) Depth (Dual Domain Design): It covers both the
general domain (global facts) and the language-specific domain (such as
history, culture, and regional traditions) for a comprehensive assessment of
multilingual capabilities. We evaluated mainstream LLMs, including traditional
LLM and emerging Large Reasoning Models. Results show significant performance
differences between the two domains, particularly in performance metrics,
ranking, calibration, and robustness. This highlights the need for targeted
evaluation and optimization in multilingual contexts. We hope KoLasSimpleQA
will help the research community better identify LLM capability boundaries in
multilingual contexts and provide guidance for model optimization. We will
release KoLasSimpleQA at https://github.com/opendatalab/KoLasSimpleQA .
comment: Equal contribution: Bowen Jiang, Runchuan Zhu, Jiang Wu;
Corresponding author: Conghui He
☆ O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering
Jianbiao Mei, Tao Hu, Daocheng Fu, Licheng Wen, Xuemeng Yang, Rong Wu, Pinlong Cai, Xing Gao, Yu Yang, Chengjun Xie, Botian Shi, Yong Liu, Yu Qiao
Large Language Models (LLMs), despite their advancements, are fundamentally
limited by their static parametric knowledge, hindering performance on tasks
requiring open-domain up-to-date information. While enabling LLMs to interact
with external knowledge environments is a promising solution, current efforts
primarily address closed-end problems. Open-ended questions, which
characterized by lacking a standard answer or providing non-unique and diverse
answers, remain underexplored. To bridge this gap, we present O$^2$-Searcher, a
novel search agent leveraging reinforcement learning to effectively tackle both
open-ended and closed-ended questions in the open domain. O$^2$-Searcher
leverages an efficient, locally simulated search environment for dynamic
knowledge acquisition, effectively decoupling the external world knowledge from
model's sophisticated reasoning processes. It employs a unified training
mechanism with meticulously designed reward functions, enabling the agent to
identify problem types and adapt different answer generation strategies.
Furthermore, to evaluate performance on complex open-ended tasks, we construct
O$^2$-QA, a high-quality benchmark featuring 300 manually curated, multi-domain
open-ended questions with associated web page caches. Extensive experiments
show that O$^2$-Searcher, using only a 3B model, significantly surpasses
leading LLM agents on O$^2$-QA. It also achieves SOTA results on various
closed-ended QA benchmarks against similarly-sized models, while performing on
par with much larger ones.
comment: 25 pages, 9 figures
☆ EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions
Determining the veracity of atomic claims is an imperative component of many
recently proposed fact-checking systems. Many approaches tackle this problem by
first retrieving evidence by querying a search engine and then performing
classification by providing the evidence set and atomic claim to a large
language model, but this process deviates from what a human would do in order
to perform the task. Recent work attempted to address this issue by proposing
iterative evidence retrieval, allowing for evidence to be collected several
times and only when necessary. Continuing along this line of research, we
propose a novel claim verification system, called EMULATE, which is designed to
better emulate human actions through the use of a multi-agent framework where
each agent performs a small part of the larger task, such as ranking search
results according to predefined criteria or evaluating webpage content.
Extensive experiments on several benchmarks show clear improvements over prior
work, demonstrating the efficacy of our new multi-agent framework.
☆ URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training
Large Language Models (LLMs) are commonly pretrained on vast corpora of text
without utilizing contextual metadata such as source, quality, or topic,
leading to a context-free learning paradigm. While recent studies suggest that
adding metadata like URL information as context (i.e., auxiliary inputs not
used in the loss calculation) can improve training efficiency and downstream
performance, they offer limited understanding of which types of metadata are
truly effective and under what conditions. In this work, we conduct a
systematic evaluation and find that not all metadata types contribute equally.
Only URL context speeds up training, whereas quality scores and topic/format
domain information offer no clear benefit. Furthermore, the improved downstream
performances of URL conditioning emerge only when longer prompts are used at
inference time. In addition, we demonstrate that context-aware pretraining
enables more controllable generation than context-free pretraining, in a
classifier-free guidance fashion. Although topic and format metadata do not
accelerate training, they are effective for steering outputs, offering
human-interpretable control over generation.
☆ ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts
Prior benchmarks for evaluating the domain-specific knowledge of large
language models (LLMs) lack the scalability to handle complex academic tasks.
To address this, we introduce \texttt{ScholarBench}, a benchmark centered on
deep expert knowledge and complex academic problem-solving, which evaluates the
academic reasoning ability of LLMs and is constructed through a three-step
process. \texttt{ScholarBench} targets more specialized and logically complex
contexts derived from academic literature, encompassing five distinct problem
types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the
abstraction, comprehension, and reasoning capabilities of LLMs across eight
distinct research domains. To ensure high-quality evaluation data, we define
category-specific example attributes and design questions that are aligned with
the characteristic research methodologies and discourse structures of each
domain. Additionally, this benchmark operates as an English-Korean bilingual
dataset, facilitating simultaneous evaluation for linguistic capabilities of
LLMs in both languages. The benchmark comprises 5,031 examples in Korean and
5,309 in English, with even state-of-the-art models like o3-mini achieving an
average evaluation score of only 0.543, demonstrating the challenging nature of
this benchmark.
☆ CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning
Fine-tuning-as-a-service, while commercially successful for Large Language
Model (LLM) providers, exposes models to harmful fine-tuning attacks. As a
widely explored defense paradigm against such attacks, unlearning attempts to
remove malicious knowledge from LLMs, thereby essentially preventing them from
being used to perform malicious tasks. However, we highlight a critical flaw:
the powerful general adaptability of LLMs allows them to easily bypass
selective unlearning by rapidly relearning or repurposing their capabilities
for harmful tasks. To address this fundamental limitation, we propose a
paradigm shift: instead of selective removal, we advocate for inducing model
collapse--effectively forcing the model to "unlearn everything"--specifically
in response to updates characteristic of malicious adaptation. This collapse
directly neutralizes the very general capabilities that attackers exploit,
tackling the core issue unaddressed by selective unlearning. We introduce the
Collapse Trap (CTRAP) as a practical mechanism to implement this concept
conditionally. Embedded during alignment, CTRAP pre-configures the model's
reaction to subsequent fine-tuning dynamics. If updates during fine-tuning
constitute a persistent attempt to reverse safety alignment, the pre-configured
trap triggers a progressive degradation of the model's core language modeling
abilities, ultimately rendering it inert and useless for the attacker.
Crucially, this collapse mechanism remains dormant during benign fine-tuning,
ensuring the model's utility and general capabilities are preserved for
legitimate users. Extensive empirical results demonstrate that CTRAP
effectively counters harmful fine-tuning risks across various LLMs and attack
settings, while maintaining high performance in benign scenarios. Our code is
available at https://anonymous.4open.science/r/CTRAP.
☆ Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains
Large Language Models (LLMs) achieve superior performance through
Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are
computationally expensive and inefficient. In this paper, we introduce
Compressed Latent Reasoning (CoLaR), a novel framework that dynamically
compresses reasoning processes in latent space through a two-stage training
approach. First, during supervised fine-tuning, CoLaR extends beyond next-token
prediction by incorporating an auxiliary next compressed embedding prediction
objective. This process merges embeddings of consecutive tokens using a
compression factor randomly sampled from a predefined range, and trains a
specialized latent head to predict distributions of subsequent compressed
embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that
leverages the latent head's non-deterministic nature to explore diverse
reasoning paths and exploit more compact ones. This approach enables CoLaR to:
i) perform reasoning at a dense latent level (i.e., silently), substantially
reducing reasoning chain length, and ii) dynamically adjust reasoning speed at
inference time by simply prompting the desired compression factor. Extensive
experiments across four mathematical reasoning datasets demonstrate that CoLaR
achieves 14.1% higher accuracy than latent-based baseline methods at comparable
compression ratios, and reduces reasoning chain length by 53.3% with only 4.8%
performance degradation compared to explicit CoT method. Moreover, when applied
to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR
demonstrates performance gains of up to 5.4% while dramatically reducing latent
reasoning chain length by 82.8%. The code and models will be released upon
acceptance.
comment: 15 pages, 8 figures
☆ Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models
Language confusion -- where large language models (LLMs) generate unintended
languages against the user's need -- remains a critical challenge, especially
for English-centric models. We present the first mechanistic interpretability
(MI) study of language confusion, combining behavioral benchmarking with
neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show
that confusion points (CPs) -- specific positions where language switches occur
-- are central to this phenomenon. Through layer-wise analysis with TunedLens
and targeted neuron attribution, we reveal that transition failures in the
final layers drive confusion. We further demonstrate that editing a small set
of critical neurons, identified via comparative analysis with
multilingual-tuned models, substantially mitigates confusion without harming
general competence or fluency. Our approach matches multilingual alignment in
confusion reduction for most languages and yields cleaner, higher-quality
outputs. These findings provide new insights into the internal dynamics of LLMs
and highlight neuron-level interventions as a promising direction for robust,
interpretable multilingual language modeling.
comment: 16 pages, 5 figures
☆ DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection
Large language models (LLMs) are considered valuable Intellectual Properties
(IP) for legitimate owners due to the enormous computational cost of training.
It is crucial to protect the IP of LLMs from malicious stealing or unauthorized
deployment. Despite existing efforts in watermarking and fingerprinting LLMs,
these methods either impact the text generation process or are limited in
white-box access to the suspect model, making them impractical. Hence, we
propose DuFFin, a novel $\textbf{Du}$al-Level $\textbf{Fin}$gerprinting
$\textbf{F}$ramework for black-box setting ownership verification. DuFFin
extracts the trigger pattern and the knowledge-level fingerprints to identify
the source of a suspect model. We conduct experiments on a variety of models
collected from the open-source website, including four popular base models as
protected LLMs and their fine-tuning, quantization, and safety alignment
versions, which are released by large companies, start-ups, and individual
users. Results show that our method can accurately verify the copyright of the
base protected LLM on their model variants, achieving the IP-ROC metric greater
than 0.95. Our code is available at
https://github.com/yuliangyan0807/llm-fingerprint.
☆ EnSToM: Enhancing Dialogue Systems with Entropy-Scaled Steering Vectors for Topic Maintenance ACL 2025
Small large language models (sLLMs) offer the advantage of being lightweight
and efficient, which makes them suitable for resource-constrained environments.
However, sLLMs often struggle to maintain topic consistency in task-oriented
dialogue systems, which is critical for scenarios such as service chatbots.
Specifically, it is important to ensure that the model denies off-topic or
malicious inputs and adheres to its intended functionality so as to prevent
potential misuse and uphold reliability. Towards this, existing activation
engineering approaches have been proposed to manipulate internal activations
during inference. While these methods are effective in certain scenarios, our
preliminary experiments reveal their limitations in ensuring topic adherence.
Therefore, to address this, we propose a novel approach termed Entropy-scaled
Steering vectors for Topic Maintenance (EnSToM). EnSToM dynamically adjusts the
steering intensity based on input uncertainty, which allows the model to handle
off-topic distractors effectively while preserving on-topic accuracy. Our
experiments demonstrate that EnSToM achieves significant performance gain with
a relatively small data size compared to fine-tuning approaches. By improving
topic adherence without compromising efficiency, our approach provides a robust
solution for enhancing sLLM-based dialogue systems.
comment: Accepted at ACL 2025 (Findings, long paper)
☆ Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing
Despite significant progress, recent studies have indicated that current
large language models (LLMs) may still utilize bias during inference, leading
to the poor generalizability of LLMs. Some benchmarks are proposed to
investigate the generalizability of LLMs, with each piece of data typically
containing one type of controlled bias. However, a single piece of data may
contain multiple types of biases in practical applications. To bridge this gap,
we propose a multi-bias benchmark where each piece of data contains five types
of biases. The evaluations conducted on this benchmark reveal that the
performance of existing LLMs and debiasing methods is unsatisfying,
highlighting the challenge of eliminating multiple types of biases
simultaneously. To overcome this challenge, we propose a causal effect
estimation-guided multi-bias elimination method (CMBE). This method first
estimates the causal effect of multiple types of biases simultaneously.
Subsequently, we eliminate the causal effect of biases from the total causal
effect exerted by both the semantic information and biases during inference.
Experimental results show that CMBE can effectively eliminate multiple types of
bias simultaneously to enhance the generalizability of LLMs.
☆ Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs
Factual hallucinations are a major challenge for Large Language Models
(LLMs). They undermine reliability and user trust by generating inaccurate or
fabricated content. Recent studies suggest that when generating false
statements, the internal states of LLMs encode information about truthfulness.
However, these studies often rely on synthetic datasets that lack realism,
which limits generalization when evaluating the factual accuracy of text
generated by the model itself. In this paper, we challenge the findings of
previous work by investigating truthfulness encoding capabilities, leading to
the generation of a more realistic and challenging dataset. Specifically, we
extend previous work by introducing: (1) a strategy for sampling plausible
true-false factoid sentences from tabular data and (2) a procedure for
generating realistic, LLM-dependent true-false datasets from Question Answering
collections. Our analysis of two open-source LLMs reveals that while the
findings from previous studies are partially validated, generalization to
LLM-generated datasets remains challenging. This study lays the groundwork for
future research on factuality in LLMs and offers practical guidelines for more
effective evaluation.
☆ CUB: Benchmarking Context Utilisation Techniques for Language Models
Lovisa Hagström, Youna Kim, Haeun Yu, Sang-goo Lee, Richard Johansson, Hyunsoo Cho, Isabelle Augenstein
Incorporating external knowledge is crucial for knowledge-intensive tasks,
such as question answering and fact checking. However, language models (LMs)
may ignore relevant information that contradicts outdated parametric memory or
be distracted by irrelevant contexts. While many context utilisation
manipulation techniques (CMTs) that encourage or suppress context utilisation
have recently been proposed to alleviate these issues, few have seen systematic
comparison. In this paper, we develop CUB (Context Utilisation Benchmark) to
help practitioners within retrieval-augmented generation (RAG) identify the
best CMT for their needs. CUB allows for rigorous testing on three distinct
context types, observed to capture key challenges in realistic context
utilisation scenarios. With this benchmark, we evaluate seven state-of-the-art
methods, representative of the main categories of CMTs, across three diverse
datasets and tasks, applied to nine LMs. Our results show that most of the
existing CMTs struggle to handle the full set of types of contexts that may be
encountered in real-world retrieval-augmented scenarios. Moreover, we find that
many CMTs display an inflated performance on simple synthesised datasets,
compared to more realistic datasets with naturally occurring samples.
Altogether, our results show the need for holistic tests of CMTs and the
development of CMTs that can handle multiple context types.
comment: 27 pages
☆ AppealCase: A Dataset and Benchmark for Civil Case Appeal Scenarios
Yuting Huang, Meitong Guo, Yiquan Wu, Ang Li, Xiaozhong Liu, Keting Yin, Changlong Sun, Fei Wu, Kun Kuang
Recent advances in LegalAI have primarily focused on individual case judgment
analysis, often overlooking the critical appellate process within the judicial
system. Appeals serve as a core mechanism for error correction and ensuring
fair trials, making them highly significant both in practice and in research.
To address this gap, we present the AppealCase dataset, consisting of 10,000
pairs of real-world, matched first-instance and second-instance documents
across 91 categories of civil cases. The dataset also includes detailed
annotations along five dimensions central to appellate review: judgment
reversals, reversal reasons, cited legal provisions, claim-level decisions, and
whether there is new information in the second instance. Based on these
annotations, we propose five novel LegalAI tasks and conduct a comprehensive
evaluation across 20 mainstream models. Experimental results reveal that all
current models achieve less than 50% F1 scores on the judgment reversal
prediction task, highlighting the complexity and challenge of the appeal
scenario. We hope that the AppealCase dataset will spur further research in
LegalAI for appellate case analysis and contribute to improving consistency in
judicial decision-making.
comment: 15 pages, 4 figures
☆ Sparse Activation Editing for Reliable Instruction Following in Narratives
Complex narrative contexts often challenge language models' ability to follow
instructions, and existing benchmarks fail to capture these difficulties. To
address this, we propose Concise-SAE, a training-free framework that improves
instruction following by identifying and editing instruction-relevant neurons
using only natural language instructions, without requiring labelled data. To
thoroughly evaluate our method, we introduce FreeInstruct, a diverse and
realistic benchmark of 1,212 examples that highlights the challenges of
instruction following in narrative-rich settings. While initially motivated by
complex narratives, Concise-SAE demonstrates state-of-the-art instruction
adherence across varied tasks without compromising generation quality.
☆ LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing
Dario Di Palma, Alessandro De Bellis, Giovanni Servedio, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia
Large Language Models (LLMs) have rapidly become central to NLP,
demonstrating their ability to adapt to various tasks through prompting
techniques, including sentiment analysis. However, we still have a limited
understanding of how these models capture sentiment-related information. This
study probes the hidden layers of Llama models to pinpoint where sentiment
features are most represented and to assess how this affects sentiment
analysis.
Using probe classifiers, we analyze sentiment encoding across layers and
scales, identifying the layers and pooling methods that best capture sentiment
signals. Our results show that sentiment information is most concentrated in
mid-layers for binary polarity tasks, with detection accuracy increasing up to
14% over prompting techniques. Additionally, we find that in decoder-only
models, the last token is not consistently the most informative for sentiment
encoding. Finally, this approach enables sentiment tasks to be performed with
memory requirements reduced by an average of 57%.
These insights contribute to a broader understanding of sentiment in LLMs,
suggesting layer-specific probing as an effective approach for sentiment tasks
beyond prompting, with potential to enhance model utility and reduce memory
requirements.
☆ Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning
Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
Teaching large language models (LLMs) to be faithful in the provided context
is crucial for building reliable information-seeking systems. Therefore, we
propose a systematic framework, CANOE, to improve the faithfulness of LLMs in
both short-form and long-form generation tasks without human annotations.
Specifically, we first synthesize short-form question-answering (QA) data with
four diverse tasks to construct high-quality and easily verifiable training
data without human annotation. Also, we propose Dual-GRPO, a rule-based
reinforcement learning method that includes three tailored rule-based rewards
derived from synthesized short-form QA data, while simultaneously optimizing
both short-form and long-form response generation. Notably, Dual-GRPO
eliminates the need to manually label preference data to train reward models
and avoids over-optimizing short-form generation when relying only on the
synthesized short-form QA data. Experimental results show that CANOE greatly
improves the faithfulness of LLMs across 11 different downstream tasks, even
outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.
☆ Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering
Document Visual Question Answering (DocVQA) faces dual challenges in
processing lengthy multimodal documents (text, images, tables) and performing
cross-modal reasoning. Current document retrieval-augmented generation (DocRAG)
methods remain limited by their text-centric approaches, frequently missing
critical visual information. The field also lacks robust benchmarks for
assessing multimodal evidence selection and integration. We introduce MMDocRAG,
a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with
multi-page, cross-modal evidence chains. Our framework introduces innovative
metrics for evaluating multimodal quote selection and enables answers that
interleave text with relevant visual elements. Through large-scale experiments
with 60 VLM/LLM models and 14 retrieval systems, we identify persistent
challenges in multimodal evidence retrieval, selection, and integration.Key
findings reveal advanced proprietary LVMs show superior performance than
open-sourced alternatives. Also, they show moderate advantages using multimodal
inputs over text-only inputs, while open-source alternatives show significant
performance degradation. Notably, fine-tuned LLMs achieve substantial
improvements when using detailed image descriptions. MMDocRAG establishes a
rigorous testing ground and provides actionable insights for developing more
robust multimodal DocVQA systems. Our benchmark and code are available at
https://mmdocrag.github.io/MMDocRAG/.
comment: preprint. code available at
\url{https://mmdocrag.github.io/MMDocRAG/}
☆ Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization
Generative Large Language Models (LLMs) infer user's demographic information
from subtle cues in the conversation -- a phenomenon called implicit
personalization. Prior work has shown that such inferences can lead to lower
quality responses for users assumed to be from minority groups, even when no
demographic information is explicitly provided. In this work, we systematically
explore how LLMs respond to stereotypical cues using controlled synthetic
conversations, by analyzing the models' latent user representations through
both model internals and generated answers to targeted user questions. Our
findings reveal that LLMs do infer demographic attributes based on these
stereotypical signals, which for a number of groups even persists when the user
explicitly identifies with a different demographic group. Finally, we show that
this form of stereotype-driven implicit personalization can be effectively
mitigated by intervening on the model's internal representations using a
trained linear probe to steer them toward the explicitly stated identity. Our
results highlight the need for greater transparency and control in how LLMs
represent user identity.
☆ University of Indonesia at SemEval-2025 Task 11: Evaluating State-of-the-Art Encoders for Multi-Label Emotion Detection
Ikhlasul Akmal Hanif, Eryawan Presma Yulianrifat, Jaycent Gunawan Ongris, Eduardus Tjitrahardja, Muhammad Falensi Azmi, Rahmat Bryan Naufal, Alfan Farizki Wicaksono
This paper presents our approach for SemEval 2025 Task 11 Track A, focusing
on multilabel emotion classification across 28 languages. We explore two main
strategies: fully fine-tuning transformer models and classifier-only training,
evaluating different settings such as fine-tuning strategies, model
architectures, loss functions, encoders, and classifiers. Our findings suggest
that training a classifier on top of prompt-based encoders such as mE5 and BGE
yields significantly better results than fully fine-tuning XLMR and mBERT. Our
best-performing model on the final leaderboard is an ensemble combining
multiple BGE models, where CatBoost serves as the classifier, with different
configurations. This ensemble achieves an average F1-macro score of 56.58
across all languages.
comment: 16 pages, 13 tables, 1 figures
☆ Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems
Evaluating and iterating upon recommender systems is crucial, yet traditional
A/B testing is resource-intensive, and offline methods struggle with dynamic
user-platform interactions. While agent-based simulation is promising, existing
platforms often lack a mechanism for user actions to dynamically reshape the
environment. To bridge this gap, we introduce RecInter, a novel agent-based
simulation platform for recommender systems featuring a robust interaction
mechanism. In RecInter platform, simulated user actions (e.g., likes, reviews,
purchases) dynamically update item attributes in real-time, and introduced
Merchant Agents can reply, fostering a more realistic and evolving ecosystem.
High-fidelity simulation is ensured through Multidimensional User Profiling
module, Advanced Agent Architecture, and LLM fine-tuned on Chain-of-Thought
(CoT) enriched interaction data. Our platform achieves significantly improved
simulation credibility and successfully replicates emergent phenomena like
Brand Loyalty and the Matthew Effect. Experiments demonstrate that this
interaction mechanism is pivotal for simulating realistic system evolution,
establishing our platform as a credible testbed for recommender systems
research.
☆ $I^2G$: Generating Instructional Illustrations via Text-Conditioned Diffusion
The effective communication of procedural knowledge remains a significant
challenge in natural language processing (NLP), as purely textual instructions
often fail to convey complex physical actions and spatial relationships. We
address this limitation by proposing a language-driven framework that
translates procedural text into coherent visual instructions. Our approach
models the linguistic structure of instructional content by decomposing it into
goal statements and sequential steps, then conditioning visual generation on
these linguistic elements. We introduce three key innovations: (1) a
constituency parser-based text encoding mechanism that preserves semantic
completeness even with lengthy instructions, (2) a pairwise discourse coherence
model that maintains consistency across instruction sequences, and (3) a novel
evaluation protocol specifically designed for procedural language-to-image
alignment. Our experiments across three instructional datasets (HTStep,
CaptainCook4D, and WikiAll) demonstrate that our method significantly
outperforms existing baselines in generating visuals that accurately reflect
the linguistic content and sequential nature of instructions. This work
contributes to the growing body of research on grounding procedural language in
visual content, with applications spanning education, task guidance, and
multimodal language understanding.
comment: 13 pages, 5 figures, under review
☆ WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, Lihong Li
While reinforcement learning (RL) has demonstrated remarkable success in
enhancing large language models (LLMs), it has primarily focused on single-turn
tasks such as solving math problems. Training effective web agents for
multi-turn interactions remains challenging due to the complexity of
long-horizon decision-making across dynamic web interfaces. In this work, we
present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework
for training web agents. It learns directly from online interactions with web
environments by asynchronously generating diverse trajectories, entirely guided
by binary rewards depending on task success. Experiments on the WebArena-Lite
benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task
success rate of Qwen-2.5-3B from 6.1% to 33.9% and Llama-3.1-8B from 8.5% to
44.8%, significantly outperforming existing state-of-the-art methods and strong
proprietary models such as OpenAI o3. In-depth analyses reveal the
effectiveness of the thinking-based prompting strategy and test-time scaling
through increased interactions for web tasks. We further investigate different
RL initialization policies by introducing two variants, namely WebAgent-R1-Zero
and WebAgent-R1-CoT, which highlight the importance of the warm-up training
stage (i.e., behavior cloning) and provide insights on incorporating long
chain-of-thought (CoT) reasoning in web agents.
comment: Preprint
☆ Exploring the Relationship Between Diversity and Quality in Ad Text Generation
In natural language generation for advertising, creating diverse and engaging
ad texts is crucial for capturing a broad audience and avoiding advertising
fatigue. Regardless of the importance of diversity, the impact of the
diversity-enhancing methods in ad text generation -- mainly tested on tasks
such as summarization and machine translation -- has not been thoroughly
explored. Ad text generation significantly differs from these tasks owing to
the text style and requirements. This research explores the relationship
between diversity and ad quality in ad text generation by considering multiple
factors, such as diversity-enhancing methods, their hyperparameters,
input-output formats, and the models.
☆ Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) leverages large language models (LLMs)
combined with external contexts to enhance the accuracy and reliability of
generated responses. However, reliably attributing generated content to
specific context segments, context attribution, remains challenging due to the
computationally intensive nature of current methods, which often require
extensive fine-tuning or human annotation. In this work, we introduce a novel
Jensen-Shannon Divergence driven method to Attribute Response to Context
(ARC-JSD), enabling efficient and accurate identification of essential context
sentences without additional fine-tuning or surrogate modelling. Evaluations on
a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using
instruction-tuned LLMs in different scales demonstrate superior accuracy and
significant computational efficiency improvements compared to the previous
surrogate-based method. Furthermore, our mechanistic analysis reveals specific
attention heads and multilayer perceptron (MLP) layers responsible for context
attribution, providing valuable insights into the internal workings of RAG
models.
comment: Work in process
☆ Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, Ji-Rong Wen
Recently, large language models (LLMs) have shown remarkable reasoning
capabilities via large-scale reinforcement learning (RL). However, leveraging
the RL algorithm to empower effective multi-tool collaborative reasoning in
LLMs remains an open challenge. In this paper, we introduce Tool-Star, an
RL-based framework designed to empower LLMs to autonomously invoke multiple
external tools during stepwise reasoning. Tool-Star integrates six types of
tools and incorporates systematic designs in both data synthesis and training.
To address the scarcity of tool-use data, we propose a general tool-integrated
reasoning data synthesis pipeline, which combines tool-integrated prompting
with hint-based sampling to automatically and scalably generate tool-use
trajectories. A subsequent quality normalization and difficulty-aware
classification process filters out low-quality samples and organizes the
dataset from easy to hard. Furthermore, we propose a two-stage training
framework to enhance multi-tool collaborative reasoning by: (1) cold-start
fine-tuning, which guides LLMs to explore reasoning patterns via
tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with
hierarchical reward design, which reinforces reward understanding and promotes
effective tool collaboration. Experimental analyses on over 10 challenging
reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star.
The code is available at https://github.com/dongguanting/Tool-Star.
comment: Working in progress
☆ From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs
Adapting cultural values in Large Language Models (LLMs) presents significant
challenges, particularly due to biases and limited training data. Prior work
primarily aligns LLMs with different cultural values using World Values Survey
(WVS) data. However, it remains unclear whether this approach effectively
captures cultural nuances or produces distinct cultural representations for
various downstream tasks. In this paper, we systematically investigate
WVS-based training for cultural value adaptation and find that relying solely
on survey data can homogenize cultural norms and interfere with factual
knowledge. To investigate these issues, we augment WVS with encyclopedic and
scenario-based cultural narratives from Wikipedia and NormAd. While these
narratives may have variable effects on downstream tasks, they consistently
improve cultural distinctiveness than survey data alone. Our work highlights
the inherent complexity of aligning cultural values with the goal of guiding
task-specific behavior.
☆ On the reliability of feature attribution methods for speech classification
As the capabilities of large-scale pre-trained models evolve, understanding
the determinants of their outputs becomes more important. Feature attribution
aims to reveal which parts of the input elements contribute the most to model
outputs. In speech processing, the unique characteristics of the input signal
make the application of feature attribution methods challenging. We study how
factors such as input type and aggregation and perturbation timespan impact the
reliability of standard feature attribution methods, and how these factors
interact with characteristics of each classification task. We find that
standard approaches to feature attribution are generally unreliable when
applied to the speech domain, with the exception of word-aligned perturbation
methods when applied to word-based classification tasks.
☆ AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Despite recent progress in large-scale reinforcement learning (RL) for
reasoning, the training recipe for building high-performing reasoning models
remains elusive. Key implementation details of frontier models, such as
DeepSeek-R1, including data curation strategies and RL training recipe, are
often omitted. Moreover, recent research indicates distillation remains more
effective than RL for smaller models. In this work, we demonstrate that
large-scale RL can significantly enhance the reasoning capabilities of strong,
small- and mid-sized models, achieving results that surpass those of
state-of-the-art distillation-based models. We systematically study the RL
training process through extensive ablations and propose a simple yet effective
approach: first training on math-only prompts, then on code-only prompts.
Notably, we find that math-only RL not only significantly enhances the
performance of strong distilled models on math benchmarks (e.g., +14.6% /
+17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks
(e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition,
extended code-only RL iterations further improve performance on code benchmarks
with minimal or no degradation in math results. We develop a robust data
curation pipeline to collect challenging prompts with high-quality, verifiable
answers and test cases to enable verification-based RL across both domains.
Finally, we identify key experimental insights, including curriculum learning
with progressively increasing response lengths and the stabilizing effect of
on-policy parameter updates. We find that RL not only elicits the foundational
reasoning capabilities acquired during pretraining and supervised fine-tuning
(e.g., distillation), but also pushes the limits of the model's reasoning
ability, enabling it to solve problems that were previously unsolvable.
comment: We release the model at:
https://huggingface.co/nvidia/AceReason-Nemotron-14B
☆ Resource for Error Analysis in Text Simplification: New Taxonomy and Test Collection SIGIR 2025
The general public often encounters complex texts but does not have the time
or expertise to fully understand them, leading to the spread of misinformation.
Automatic Text Simplification (ATS) helps make information more accessible, but
its evaluation methods have not kept up with advances in text generation,
especially with Large Language Models (LLMs). In particular, recent studies
have shown that current ATS metrics do not correlate with the presence of
errors. Manual inspections have further revealed a variety of errors,
underscoring the need for a more nuanced evaluation framework, which is
currently lacking. This resource paper addresses this gap by introducing a test
collection for detecting and classifying errors in simplified texts. First, we
propose a taxonomy of errors, with a formal focus on information distortion.
Next, we introduce a parallel dataset of automatically simplified scientific
texts. This dataset has been human-annotated with labels based on our proposed
taxonomy. Finally, we analyze the quality of the dataset, and we study the
performance of existing models to detect and classify errors from that
taxonomy. These contributions give researchers the tools to better evaluate
errors in ATS, develop more reliable models, and ultimately improve the quality
of automatically simplified texts.
comment: Accepted at SIGIR 2025
☆ Semantic Pivots Enable Cross-Lingual Transfer in Large Language Models
Large language models (LLMs) demonstrate remarkable ability in cross-lingual
tasks. Understanding how LLMs acquire this ability is crucial for their
interpretability. To quantify the cross-lingual ability of LLMs accurately, we
propose a Word-Level Cross-Lingual Translation Task. To find how LLMs learn
cross-lingual ability, we trace the outputs of LLMs' intermediate layers in the
word translation task. We identify and distinguish two distinct behaviors in
the forward pass of LLMs: co-occurrence behavior and semantic pivot behavior.
We attribute LLMs' two distinct behaviors to the co-occurrence frequency of
words and find the semantic pivot from the pre-training dataset. Finally, to
apply our findings to improve the cross-lingual ability of LLMs, we reconstruct
a semantic pivot-aware pre-training dataset using documents with a high
proportion of semantic pivots. Our experiments validate the effectiveness of
our approach in enhancing cross-lingual ability. Our research contributes
insights into the interpretability of LLMs and offers a method for improving
LLMs' cross-lingual ability.
comment: 14 pages, 10 figures
☆ PaTH Attention: Position Encoding via Accumulating Householder Transformations
Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, Yoon Kim
The attention mechanism is a core primitive in modern large language models
(LLMs) and AI more broadly. Since attention by itself is permutation-invariant,
position encoding is essential for modeling structured domains such as
language. Rotary position encoding (RoPE) has emerged as the de facto standard
approach for position encoding and is part of many modern LLMs. However, in
RoPE the key/query transformation between two elements in a sequence is only a
function of their relative position and otherwise independent of the actual
input. This limits the expressivity of RoPE-based transformers.
This paper describes PaTH, a flexible data-dependent position encoding scheme
based on accumulated products of Householder(like) transformations, where each
transformation is data-dependent, i.e., a function of the input. We derive an
efficient parallel algorithm for training through exploiting a compact
representation of products of Householder matrices, and implement a
FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both
targeted synthetic benchmarks and moderate-scale real-world language modeling
experiments, we find that PaTH demonstrates superior performance compared to
RoPE and other recent baselines.
comment: Preprint
♻ ☆ Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
The Mixture of Experts (MoE) is an effective architecture for scaling large
language models by leveraging sparse expert activation, optimizing the
trade-off between performance and efficiency. However, under expert
parallelism, MoE suffers from inference inefficiencies due to imbalanced
token-to-expert assignment, where some experts are overloaded while others
remain underutilized. This imbalance leads to poor resource utilization and
increased latency, as the most burdened expert dictates the overall delay, a
phenomenon we define as the \textbf{\textit{Straggler Effect}}. To mitigate
this, we propose Capacity-Aware Inference, including two key techniques: (1)
\textbf{\textit{Capacity-Aware Token Drop}}, which discards overloaded tokens
to regulate the maximum latency of MoE, and (2) \textbf{\textit{Capacity-Aware
Token Reroute}}, which reallocates overflowed tokens to underutilized experts,
balancing the token distribution. These techniques collectively optimize both
high-load and low-load expert utilization, leading to a more efficient MoE
inference pipeline. Extensive experiments demonstrate the effectiveness of our
methods, showing significant improvements in inference efficiency, e.g., 0.2\%
average performance increase and a 1.94$\times$ inference speedup on
Mixtral-8$\times$7B-Instruct.
♻ ☆ Diverse Preference Optimization
Jack Lanchantin, Angelica Chen, Shehzaad Dhuliawala, Ping Yu, Jason Weston, Sainbayar Sukhbaatar, Ilia Kulikov
Post-training of language models, either through reinforcement learning,
preference optimization or supervised finetuning, tends to sharpen the output
probability distribution and reduce the diversity of generated responses. This
is particularly a problem for creative generative tasks where varied responses
are desired. In this work we introduce Diverse Preference Optimization (DivPO),
an optimization method which learns to generate much more diverse responses
than standard pipelines, while maintaining the quality of the generations. In
DivPO, preference pairs are selected by first considering a pool of responses,
and a measure of diversity among them, and selecting chosen examples as being
more rare but high quality, while rejected examples are more common, but low
quality. DivPO results in generating 45.6% more diverse persona attributes, and
a 74.6% increase in story diversity, while maintaining similar win rates as
standard baselines. On general instruction following, DivPO results in a 46.2%
increase in diversity, and a 2.4% winrate improvement compared to DPO.
♻ ☆ TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning
Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
Reinforcement Learning (RL) has become a powerful tool for enhancing the
reasoning abilities of large language models (LLMs) by optimizing their
policies with reward signals. Yet, RL's success relies on the reliability of
rewards, which are provided by verifiers. In this paper, we expose and analyze
a widespread problem--false negatives--where verifiers wrongly reject correct
model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals
that over 38% of model-generated responses suffer from false negatives, where
the verifier fails to recognize correct answers. We show, both empirically and
theoretically, that these false negatives severely impair RL training by
depriving the model of informative gradient signals and slowing convergence. To
mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments
existing rule-based methods, which dynamically identifies potential false
negatives and recovers valid responses to produce more accurate reward
estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts
pass rates by up to 10% and accelerates convergence relative to the baseline.
Our findings highlight the critical importance of addressing verifier false
negatives and offer a practical approach to improve RL-based fine-tuning of
LLMs. Our code is available at https://github.com/uw-nsl/TinyV.
♻ ☆ Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments
Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udomcharoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong
Multi-step reasoning is essential for large language models (LLMs), yet
multilingual performance remains challenging. While Chain-of-Thought (CoT)
prompting improves reasoning, it struggles with non-English languages due to
the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting
separates reasoning from execution, offering a promising alternative but
shifting the challenge to generating programs from non-English questions. We
propose a framework to evaluate PoT by separating multilingual reasoning from
code execution to examine (i) the impact of fine-tuning on question-reasoning
alignment and (ii) how reasoning quality affects answer correctness. Our
findings demonstrate that PoT fine-tuning substantially enhances multilingual
reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong
correlation between reasoning quality (measured through code quality) and
answer accuracy, highlighting its potential as a test-time performance
improvement heuristic.
♻ ☆ Vague Knowledge: Evidence from Analyst Reports
People in the real world often possess vague knowledge of future payoffs, for
which quantification is not feasible or desirable. We argue that language, with
differing ability to convey vague information, plays an important but less
known-role in representing subjective expectations. Empirically, we find that
in their reports, analysts include useful information in linguistic expressions
but not numerical forecasts. Specifically, the textual tone of analyst reports
has predictive power for forecast errors and subsequent revisions in numerical
forecasts, and this relation becomes stronger when analyst's language is
vaguer, when uncertainty is higher, and when analysts are busier. Overall, our
theory and evidence suggest that some useful information is vaguely known and
only communicated through language.
♻ ☆ General-Reasoner: Advancing LLM Reasoning Across All Domains
Reinforcement learning (RL) has recently demonstrated strong potential in
enhancing the reasoning capabilities of large language models (LLMs).
Particularly, the "Zero" reinforcement learning introduced by Deepseek-R1-Zero,
enables direct RL training of base LLMs without relying on an intermediate
supervised fine-tuning stage. Despite these advancements, current works for LLM
reasoning mainly focus on mathematical and coding domains, largely due to data
abundance and the ease of answer verification. This limits the applicability
and generalization of such models to broader domains, where questions often
have diverse answer representations, and data is more scarce. In this paper, we
propose General-Reasoner, a novel training paradigm designed to enhance LLM
reasoning capabilities across diverse domains. Our key contributions include:
(1) constructing a large-scale, high-quality dataset of questions with
verifiable answers curated by web crawling, covering a wide range of
disciplines; and (2) developing a generative model-based answer verifier, which
replaces traditional rule-based verification with the capability of
chain-of-thought and context-awareness. We train a series of models and
evaluate them on a wide range of datasets covering wide domains like physics,
chemistry, finance, electronics etc. Our comprehensive evaluation across these
12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC)
demonstrates that General-Reasoner outperforms existing baseline methods,
achieving robust and generalizable reasoning performance while maintaining
superior effectiveness in mathematical reasoning tasks.
♻ ☆ Slamming: Training a Speech Language Model on One GPU in a Day ACL 2025
We introduce Slam, a recipe for training high-quality Speech Language Models
(SLMs) on a single academic GPU in 24 hours. We do so through empirical
analysis of model initialisation and architecture, synthetic training data,
preference optimisation with synthetic data and tweaking all other components.
We empirically demonstrate that this training recipe also scales well with more
compute getting results on par with leading SLMs in a fraction of the compute
cost. We hope these insights will make SLM training and research more
accessible. In the context of SLM scaling laws, our results far outperform
predicted compute optimal performance, giving an optimistic view to SLM
feasibility. See code, data, models, samples at -
https://pages.cs.huji.ac.il/adiyoss-lab/slamming .
comment: ACL 2025 (Findings)
♻ ☆ MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?
Large foundation models face challenges in acquiring transferable, structured
thinking abilities, especially when supervised with rigid templates or
crowd-annotated instruction datasets. Unlike prior approaches, we focus on a
thinking-centric data synthesis paradigm that enables models to evolve through
self-generated, cognitively guided data. We propose MindGYM, a structured and
scalable framework for question synthesis, composed of: (1) Cognitive Thinking
Process Injection, which infuses high-level reasoning objectives to shape the
model's synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating
atomic questions from diverse semantic types to encourage broader thinking; and
(3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop
questions based on QA seeds for deeper reasoning. Detailed analysis shows that
synthetic data generated by our method achieves 16.7% higher average quality
and 67.91% lower quality variance compared to baseline sources, highlighting
that both high-quality and self-contained data are essential for effective,
thinking-oriented fine-tuning. MindGYM improves performance on six reasoning
benchmarks, achieving gains of up to 16% on MathVision using only 400 data
samples, and generalizable improvements across different model sizes and
architectures. MindGYM underscores the viability of self-challenging mechanisms
in refining large model capabilities while minimizing human intervention and
resource demands. Code and data are released to promote data-centric research
into self-evolving foundation models driven by their internal reasoning
capabilities.
comment: 22 pages, 7 tables
♻ ☆ Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data
Fine-tuning large language models (LLMs) using diverse datasets is crucial
for enhancing their overall performance across various domains. In practical
scenarios, existing methods based on modeling the mixture proportions of data
composition often struggle with data whose domain labels are missing, imprecise
or non-normalized, while methods based on data selection usually encounter
difficulties in balancing multi-domain performance. To address these
challenges, in this work, we investigate the role of data diversity in
enhancing the overall abilities of LLMs by empirically constructing contrastive
data pools and theoretically deriving explanations. Building upon the insights
gained, we propose a new method that gives the LLM a dual identity: an output
model to cognitively probe and select data based on diversity reward, as well
as an input model to be tuned with the selected data. Extensive experiments
show that the proposed method notably boosts performance across
domain-undetermined data and a series of foundational downstream tasks when
applied to various advanced LLMs. We release our code and hope this study can
shed light on the understanding of data diversity and advance feedback-driven
data-model co-design for LLMs.
comment: 33 pages, 20 figures, 21 tables
♻ ☆ FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks
Spatial reasoning is a fundamental aspect of human intelligence. One key
concept in spatial cognition is the Frame of Reference (FoR), which identifies
the perspective of spatial expressions. Despite its significance, FoR has
received limited attention in AI models that need spatial intelligence. There
is a lack of dedicated benchmarks and in-depth evaluation of large language
models (LLMs) in this area. To address this issue, we introduce the Frame of
Reference Evaluation in Spatial Reasoning Tasks (FoREST) benchmark, designed to
assess FoR comprehension in LLMs. We evaluate LLMs on answering questions that
require FoR comprehension and layout generation in text-to-image models using
FoREST. Our results reveal a notable performance gap across different FoR
classes in various LLMs, affecting their ability to generate accurate layouts
for text-to-image generation. This highlights critical shortcomings in FoR
comprehension. To improve FoR understanding, we propose Spatial-Guided
prompting, which improves LLMs ability to extract essential spatial concepts.
Our proposed method improves overall performance across spatial reasoning
tasks.
comment: 9 pages
♻ ☆ TTRL: Test-Time Reinforcement Learning
Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, Bowen Zhou
This paper investigates Reinforcement Learning (RL) on data without explicit
labels for reasoning tasks in Large Language Models (LLMs). The core challenge
of the problem is reward estimation during inference while not having access to
ground-truth information. While this setting appears elusive, we find that
common practices in Test-Time Scaling (TTS), such as majority voting, yield
surprisingly effective rewards suitable for driving RL training. In this work,
we introduce Test-Time Reinforcement Learning (TTRL), a novel method for
training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs
by utilizing the priors in the pre-trained models. Our experiments demonstrate
that TTRL consistently improves performance across a variety of tasks and
models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by
approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore,
although TTRL is only supervised by the maj@n metric, TTRL has demonstrated
performance to consistently surpass the upper limit of the initial model maj@n,
and approach the performance of models trained directly on test data with
ground-truth labels. Our experimental findings validate the general
effectiveness of TTRL across various tasks and highlight TTRL's potential for
broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL
♻ ☆ From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment
Large language models (LLMs) have traditionally been aligned through
one-size-fits-all approaches that assume uniform human preferences,
fundamentally overlooking the diversity in user values and needs. This paper
introduces a comprehensive framework for scalable personalized alignment of
LLMs. We establish a systematic preference space characterizing psychological
and behavioral dimensions, alongside diverse persona representations for robust
preference inference in real-world scenarios. Building upon this foundation, we
introduce \textsc{AlignX}, a large-scale dataset of over 1.3 million
personalized preference examples, and develop two complementary alignment
approaches: \textit{in-context alignment} directly conditioning on persona
representations and \textit{preference-bridged alignment} modeling intermediate
preference distributions. Extensive experiments demonstrate substantial
improvements over existing methods, with an average 17.06\% accuracy gain
across four benchmarks while exhibiting a strong adaptation capability to novel
preferences, robustness to limited user data, and precise preference
controllability. These results validate our approach toward user-adaptive AI
systems.
♻ ☆ More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding
Enabling Large Language Models (LLMs) to comprehend the 3D physical world
remains a significant challenge. Due to the lack of large-scale 3D-text pair
datasets, the success of LLMs has yet to be replicated in 3D understanding. In
this paper, we rethink this issue and propose a new task: 3D Data-Efficient
Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D
object understanding with minimal 3D point cloud and text data pairs. To
address this task, we introduce GreenPLM, which leverages more text data to
compensate for the lack of 3D data. First, inspired by using CLIP to align
images and text, we utilize a pre-trained point cloud-text encoder to map the
3D point cloud space to the text space. This mapping leaves us to seamlessly
connect the text space with LLMs. Once the point-text-LLM connection is
established, we further enhance text-LLM alignment by expanding the
intermediate text space, thereby reducing the reliance on 3D point cloud data.
Specifically, we generate 6M free-text descriptions of 3D objects, and design a
three-stage training strategy to help LLMs better explore the intrinsic
connections between different modalities. To achieve efficient modality
alignment, we design a zero-parameter cross-attention module for token pooling.
Extensive experimental results show that GreenPLM requires only 12% of the 3D
training data used by existing state-of-the-art models to achieve superior 3D
understanding. Remarkably, GreenPLM also achieves competitive performance using
text-only data. The code and weights are available at:
https://github.com/TangYuan96/GreenPLM.
♻ ☆ Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG
High-resolution (HR) image perception remains a key challenge in multimodal
large language models (MLLMs). To overcome the limitations of existing methods,
this paper shifts away from prior dedicated heuristic approaches and revisits
the most fundamental idea to HR perception by enhancing the long-context
capability of MLLMs, driven by recent advances in long-context techniques like
retrieval-augmented generation (RAG) for general LLMs. Towards this end, this
paper presents the first study exploring the use of RAG to address HR
perception challenges. Specifically, we propose Retrieval-Augmented Perception
(RAP), a training-free framework that retrieves and fuses relevant image crops
while preserving spatial context using the proposed Spatial-Awareness Layout.
To accommodate different tasks, the proposed Retrieved-Exploration Search
(RE-Search) dynamically selects the optimal number of crops based on model
confidence and retrieval scores. Experimental results on HR benchmarks
demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving
a 43% improvement on $V^*$ Bench and 19% on HR-Bench.
♻ ☆ APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries
Recent progress in large language models (LLMs) has shown promise in formal
theorem proving, yet existing benchmarks remain limited to isolated, static
proof tasks, failing to capture the iterative, engineering-intensive workflows
of real-world formal mathematics libraries. Motivated by analogous advances in
software engineering, we introduce the paradigm of Automated Proof Engineering
(APE), which aims to automate proof engineering tasks such as feature addition,
proof refactoring, and bug fixing using LLMs. To facilitate research in this
direction, we present APE-Bench I, the first realistic benchmark built from
real-world commit histories of Mathlib4, featuring diverse file-level tasks
described in natural language and verified via a hybrid approach combining the
Lean compiler and LLM-as-a-Judge. We further develop Eleanstic, a scalable
parallel verification infrastructure optimized for proof checking across
multiple versions of Mathlib. Empirical results on state-of-the-art LLMs
demonstrate strong performance on localized edits but substantial degradation
on handling complex proof engineering. This work lays the foundation for
developing agentic workflows in proof engineering, with future benchmarks
targeting multi-file coordination, project-scale verification, and autonomous
agents capable of planning, editing, and repairing formal libraries.
♻ ☆ Evaluating Automated Radiology Report Quality through Fine-Grained Phrasal Grounding of Clinical Findings
Razi Mahmood, Pingkun Yan, Diego Machado Reyes, Ge Wang, Mannudeep K. Kalra, Parisa Kaviani, Joy T. Wu, Tanveer Syeda-Mahmood
Several evaluation metrics have been developed recently to automatically
assess the quality of generative AI reports for chest radiographs based only on
textual information using lexical, semantic, or clinical named entity
recognition methods. In this paper, we develop a new method of report quality
evaluation by first extracting fine-grained finding patterns capturing the
location, laterality, and severity of a large number of clinical findings. We
then performed phrasal grounding to localize their associated anatomical
regions on chest radiograph images. The textual and visual measures are then
combined to rate the quality of the generated reports. We present results that
compare this evaluation metric with other textual metrics on a gold standard
dataset derived from the MIMIC collection and show its robustness and
sensitivity to factual errors.
♻ ☆ SMARTe: Slot-based Method for Accountable Relational Triple extraction
Relational Triple Extraction (RTE) is a fundamental task in Natural Language
Processing (NLP). However, prior research has primarily focused on optimizing
model performance, with limited efforts to understand the internal mechanisms
driving these models. Many existing methods rely on complex preprocessing to
induce specific interactions, often resulting in opaque systems that may not
fully align with their theoretical foundations. To address these limitations,
we propose SMARTe: a Slot-based Method for Accountable Relational Triple
extraction. SMARTe introduces intrinsic interpretability through a slot
attention mechanism and frames the task as a set prediction problem. Slot
attention consolidates relevant information into distinct slots, ensuring all
predictions can be explicitly traced to learned slot representations and the
tokens contributing to each predicted relational triple. While emphasizing
interpretability, SMARTe achieves performance comparable to state-of-the-art
models. Evaluations on the NYT and WebNLG datasets demonstrate that adding
interpretability does not compromise performance. Furthermore, we conducted
qualitative assessments to showcase the explanations provided by SMARTe, using
attention heatmaps that map to their respective tokens. We conclude with a
discussion of our findings and propose directions for future research.
♻ ☆ ChartCards: A Chart-Metadata Generation Framework for Multi-Task Chart Understanding
The emergence of Multi-modal Large Language Models (MLLMs) presents new
opportunities for chart understanding. However, due to the fine-grained nature
of these tasks, applying MLLMs typically requires large, high-quality datasets
for task-specific fine-tuning, leading to high data collection and training
costs. To address this, we propose ChartCards, a unified chart-metadata
generation framework for multi-task chart understanding. ChartCards
systematically synthesizes various chart information, including data tables,
visualization code, visual elements, and multi-dimensional semantic captions.
By structuring this information into organized metadata, ChartCards enables a
single chart to support multiple downstream tasks, such as text-to-chart
retrieval, chart summarization, chart-to-table conversion, chart description,
and chart question answering. Using ChartCards, we further construct MetaChart,
a large-scale high-quality dataset containing 10,862 data tables, 85K charts,
and 170 K high-quality chart captions. We validate the dataset through
qualitative crowdsourcing evaluations and quantitative fine-tuning experiments
across various chart understanding tasks. Fine-tuning six different models on
MetaChart resulted in an average performance improvement of 5% across all
tasks. The most notable improvements are seen in text-to-chart retrieval and
chart-to-table tasks, with Long-CLIP and Llama 3.2-11B achieving improvements
of 17% and 28%, respectively.
♻ ☆ Through the LLM Looking Glass: A Socratic Probing of Donkeys, Elephants, and Markets
While detecting and avoiding bias in LLM-generated text is becoming
increasingly important, media bias often remains subtle and subjective, making
it particularly difficult to identify and mitigate. In this study, we assess
media bias in LLM-generated content and LLMs' ability to detect subtle
ideological bias. We conduct this evaluation using two datasets, PoliGen and
EconoLex, covering political and economic discourse, respectively. We evaluate
seven widely used LLMs by prompting them to generate articles and analyze their
ideological preferences via Socratic probing. By using our self-contained
Socratic approach, the study aims to directly measure the models' biases rather
than relying on external interpretations, thereby minimizing subjective
judgments about media bias. Our results reveal a consistent preference of
Democratic over Republican positions across all models. Conversely, in economic
topics, biases vary among Western LLMs, while those developed in China lean
more strongly toward socialism.
♻ ☆ HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Transformers have become the de facto architecture for a wide range of
machine learning tasks, particularly in large language models (LLMs). Despite
their remarkable performance, challenges remain in training deep transformer
networks, especially regarding the position of layer normalization. While
Pre-Norm structures facilitate more stable training owing to their stronger
identity path, they often lead to suboptimal performance compared to Post-Norm.
In this paper, we propose $\textbf{HybridNorm}$, a simple yet effective hybrid
normalization strategy that integrates the advantages of both Pre-Norm and
Post-Norm. Specifically, HybridNorm employs QKV normalization within the
attention mechanism and Post-Norm in the feed-forward network (FFN) of each
transformer block. We provide both theoretical insights and empirical evidence
demonstrating that HybridNorm improves gradient flow and model robustness.
Extensive experiments on large-scale transformer models, including both dense
and sparse variants, show that HybridNorm consistently outperforms both
Pre-Norm and Post-Norm approaches across multiple benchmarks. These findings
highlight the potential of HybridNorm as a more stable and effective technique
for improving the training and performance of deep transformer models. Code is
available at https://github.com/BryceZhuo/HybridNorm.
♻ ☆ TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
Recent efforts target spoken language models (SLMs) that not only listen but
also speak for more natural human-LLM interaction. Joint speech-text modeling
is a promising direction to achieve this. However, the effectiveness of recent
speech tokens for joint modeling remains underexplored. To address this, we
introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that
directly addresses the modality gap by aligning speech token with the
corresponding text transcription during the tokenization stage. We propose a
method that can achieve this through a attention-based aggregation mechanism
and with speech reconstruction as the training objective. We conduct extensive
experiments and show that TASTE can preserve essential paralinguistic
information while dramatically reducing the token sequence length. With TASTE,
we perform straightforward joint spoken language modeling by using Low-Rank
Adaptation on the pre-trained text LLM. Experimental results show that
TASTE-based SLMs perform comparable to previous work on SALMON and StoryCloze;
while significantly outperform other pre-trained SLMs on speech continuation
across subjective and objective evaluations. To our knowledge, TASTE is the
first end-to-end approach that utilizes a reconstruction objective to
automatically learn a text-aligned speech tokenization and embedding suitable
for spoken language modeling. Our demo, code, and model are available at
https://mtkresearch.github.io/TASTE-SpokenLM.github.io.
comment: Preprint
♻ ☆ EntGPT: Entity Linking with Generative Large Language Models
Entity Linking in natural language processing seeks to match text entities to
their corresponding entries in a dictionary or knowledge base. Traditional
approaches rely on contextual models, which can be complex, hard to train, and
have limited transferability across different domains. Generative large
language models like GPT offer a promising alternative but often underperform
with naive prompts. In this study, we introduce EntGPT, employing advanced
prompt engineering to enhance EL tasks. Our three-step hard-prompting method
(EntGPT-P) significantly boosts the micro-F_1 score by up to 36% over vanilla
prompts, achieving competitive performance across 10 datasets without
supervised fine-tuning. Additionally, our instruction tuning method (EntGPT-I)
improves micro-F_1 scores by 2.1% on average in supervised EL tasks and
outperforms several baseline models in six Question Answering tasks. Our
methods are compatible with both open-source and proprietary LLMs. All data and
code are available on GitHub at https://github.com/yifding/In_Context_EL.
♻ ☆ Transformers for molecular property prediction: Domain adaptation efficiently improves performance
Over the past six years, molecular transformer models have become key tools
in drug discovery. Most existing models are pre-trained on large, unlabeled
datasets such as ZINC or ChEMBL. However, the extent to which large-scale
pre-training improves molecular property prediction remains unclear. This study
evaluates transformer models for this task while addressing their limitations.
We explore how pre-training dataset size and chemically informed objectives
impact performance. Our results show that increasing the dataset beyond
approximately 400K to 800K molecules from large-scale unlabeled databases does
not enhance performance across seven datasets covering five ADME endpoints:
lipophilicity, permeability, solubility (two datasets), microsomal stability
(two datasets), and plasma protein binding. In contrast, domain adaptation on a
small, domain-specific dataset (less than or equal 4K molecules) using
multi-task regression of physicochemical properties significantly boosts
performance (P-value less than 0.001). A model pre-trained on 400K molecules
and adapted with domain-specific data outperforms larger models such as
MolFormer and performs comparably to MolBERT. Benchmarks against Random Forest
(RF) baselines using descriptors and Morgan fingerprints show that chemically
and physically informed features consistently yield better performance across
model types. While RF remains a strong baseline, we identify concrete practices
to enhance transformer performance. Aligning pre-training and adaptation with
chemically meaningful tasks and domain-relevant data presents a promising
direction for molecular property prediction. Our models are available on
HuggingFace for easy use and adaptation.
♻ ☆ ToolSpectrum : Towards Personalized Tool Utilization for Large Language Models ACL 2025
While integrating external tools into large language models (LLMs) enhances
their ability to access real-time information and domain-specific services,
existing approaches focus narrowly on functional tool selection following user
instructions, overlooking the context-aware personalization in tool selection.
This oversight leads to suboptimal user satisfaction and inefficient tool
utilization, particularly when overlapping toolsets require nuanced selection
based on contextual factors. To bridge this gap, we introduce ToolSpectrum, a
benchmark designed to evaluate LLMs' capabilities in personalized tool
utilization. Specifically, we formalize two key dimensions of personalization,
user profile and environmental factors, and analyze their individual and
synergistic impacts on tool utilization. Through extensive experiments on
ToolSpectrum, we demonstrate that personalized tool utilization significantly
improves user experience across diverse scenarios. However, even
state-of-the-art LLMs exhibit the limited ability to reason jointly about user
profiles and environmental factors, often prioritizing one dimension at the
expense of the other. Our findings underscore the necessity of context-aware
personalization in tool-augmented LLMs and reveal critical limitations for
current models. Our data and code are available at
https://github.com/Chengziha0/ToolSpectrum.
comment: Accepted by ACL 2025 Findings
♻ ☆ Critique-Guided Distillation: Improving Supervised Fine-tuning via Better Distillation NeurIPS 2025
Supervised fine-tuning (SFT) using expert demonstrations often suffer from
the imitation problem, where the model learns to reproduce the correct
responses without understanding the underlying rationale. To address this
limitation, we propose Critique-Guided Distillation (CGD), a novel multi-stage
framework that integrates teacher model generated explanatory critiques and
refined responses into the SFT process. A student model is then trained to map
the triplet of prompt, teacher critique, and its own initial response to the
corresponding refined teacher response, thereby learning both what to imitate
and why. Using entropy-based analysis, we show that CGD reduces refinement
uncertainty and can be interpreted as a Bayesian posterior update. We perform
extensive empirical evaluation of CGD, on variety of benchmark tasks, and
demonstrate significant gains on both math (AMC23 +17.5%) and language
understanding tasks (MMLU-Pro +6.3%), while successfully mitigating the format
drift issues observed in previous critique fine-tuning (CFT) techniques.
comment: Submitted to NeurIPS 2025
♻ ☆ Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models
Large language models (LLMs) are widely applied in various fields of society
due to their powerful reasoning, understanding, and generation capabilities.
However, the security issues associated with these models are becoming
increasingly severe. Jailbreaking attacks, as an important method for detecting
vulnerabilities in LLMs, have been explored by researchers who attempt to
induce these models to generate harmful content through various attack methods.
Nevertheless, existing jailbreaking methods face numerous limitations, such as
excessive query counts, limited coverage of jailbreak modalities, low attack
success rates, and simplistic evaluation methods. To overcome these
constraints, this paper proposes a multimodal jailbreaking method: JMLLM. This
method integrates multiple strategies to perform comprehensive jailbreak
attacks across text, visual, and auditory modalities. Additionally, we
contribute a new and comprehensive dataset for multimodal jailbreaking
research: TriJail, which includes jailbreak prompts for all three modalities.
Experiments on the TriJail dataset and the benchmark dataset AdvBench,
conducted on 13 popular LLMs, demonstrate advanced attack success rates and
significant reduction in time overhead.
♻ ☆ Breaking Information Cocoons: A Hyperbolic Graph-LLM Framework for Exploration and Exploitation in Recommender Systems
Modern recommender systems often create information cocoons, restricting
users' exposure to diverse content. A key challenge lies in balancing content
exploration and exploitation while allowing users to adjust their
recommendation preferences. Intuitively, this balance can be modeled as a
tree-structured representation, where depth search facilitates exploitation and
breadth search enables exploration. However, existing approaches face two
fundamental limitations: Euclidean methods struggle to capture hierarchical
structures, while hyperbolic methods, despite their superior hierarchical
modeling, lack semantic understanding of user and item profiles and fail to
provide a principled mechanism for balancing exploration and exploitation. To
address these challenges, we propose HERec, a hyperbolic graph-LLM framework
that effectively balances exploration and exploitation in recommender systems.
Our framework introduces two key innovations: (1) a semantic-enhanced
hierarchical mechanism that aligns rich textual descriptions processed by large
language models (LLMs) with collaborative information directly in hyperbolic
space, allowing for more nuanced updates that respect the underlying
hierarchical structure in user-item profiles; (2) an automatic hierarchical
representation by optimizing Dasgupta's cost, which discovers hierarchical
structures without requiring predefined hyperparameters, enabling
user-adjustable exploration-exploitation trade-offs. Extensive experiments
demonstrate that HERec consistently outperforms both Euclidean and hyperbolic
baselines, achieving up to 5.49% improvement in utility metrics and 11.39%
increase in diversity metrics, effectively mitigating information cocoons. We
open-source our model implementation at https://github.com/Martin-qyma/HERec.
♻ ☆ Identifying Legal Holdings with LLMs: A Systematic Study of Performance, Scale, and Memorization
As large language models (LLMs) continue to advance in capabilities, it is
essential to assess how they perform on established benchmarks. In this study,
we present a suite of experiments to assess the performance of modern LLMs
(ranging from 3B to 90B+ parameters) on CaseHOLD, a legal benchmark dataset for
identifying case holdings. Our experiments demonstrate ``scaling effects'' -
performance on this task improves with model size, with more capable models
like GPT4o and AmazonNovaPro achieving macro F1 scores of 0.744 and 0.720
respectively. These scores are competitive with the best published results on
this dataset, and do not require any technically sophisticated model training,
fine-tuning or few-shot prompting. To ensure that these strong results are not
due to memorization of judicial opinions contained in the training data, we
develop and utilize a novel citation anonymization test that preserves semantic
meaning while ensuring case names and citations are fictitious. Models maintain
strong performance under these conditions (macro F1 of 0.728), suggesting the
performance is not due to rote memorization. These findings demonstrate both
the promise and current limitations of LLMs for legal tasks with important
implications for the development and measurement of automated legal analytics
and legal benchmarks.
comment: Presented as a short paper at International Conference on Artificial
Intelligence and Law 2025 (Chicago, IL)
♻ ☆ ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews
Academic paper review is a critical yet time-consuming task within the
research community. With the increasing volume of academic publications,
automating the review process has become a significant challenge. The primary
issue lies in generating comprehensive, accurate, and reasoning-consistent
review comments that align with human reviewers' judgments. In this paper, we
address this challenge by proposing ReviewAgents, a framework that leverages
large language models (LLMs) to generate academic paper reviews. We first
introduce a novel dataset, Review-CoT, consisting of 142k review comments,
designed for training LLM agents. This dataset emulates the structured
reasoning process of human reviewers-summarizing the paper, referencing
relevant works, identifying strengths and weaknesses, and generating a review
conclusion. Building upon this, we train LLM reviewer agents capable of
structured reasoning using a relevant-paper-aware training method. Furthermore,
we construct ReviewAgents, a multi-role, multi-LLM agent review framework, to
enhance the review comment generation process. Additionally, we propose
ReviewBench, a benchmark for evaluating the review comments generated by LLMs.
Our experimental results on ReviewBench demonstrate that while existing LLMs
exhibit a certain degree of potential for automating the review process, there
remains a gap when compared to human-generated reviews. Moreover, our
ReviewAgents framework further narrows this gap, outperforming advanced LLMs in
generating review comments.
comment: Work in progress
♻ ☆ AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving
Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Hongyi Wang, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye, Lijun Sun, Diange Yang
Vision-Language Models (VLMs) show promise for autonomous driving, yet their
struggle with hallucinations, inefficient reasoning, and limited real-world
validation hinders accurate perception and robust step-by-step reasoning. To
overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework
that, for the first time, integrates Chain-of-Thought (CoT) reasoning with
dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's
core innovations include: \textbf{(i) Structured Data Generation}, by
establishing an autonomous driving tool library to automatically construct
structured, self-verified reasoning data explicitly incorporating tool usage
for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline},
employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization
(GRPO) to equip VLMs with the capability for autonomous tool invocation; and
\textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel
multi-tool assessment protocol to rigorously evaluate the model's tool
invocation and utilization. Experiments on the DriveLMM-o1 benchmark
demonstrate AgentThink significantly boosts overall reasoning scores by
\textbf{53.91\%} and enhances answer accuracy by \textbf{33.54\%}, while
markedly improving reasoning quality and consistency. Furthermore, ablation
studies and robust zero-shot/few-shot generalization experiments across various
benchmarks underscore its powerful capabilities. These findings highlight a
promising trajectory for developing trustworthy and tool-aware autonomous
driving models.
comment: 18 pages, 8 figures
♻ ☆ Hallucination Detection in LLMs with Topological Divergence on Attention Graphs
Alexandra Bazarova, Aleksandr Yugay, Andrey Shulga, Alina Ermilova, Andrei Volodichev, Konstantin Polev, Julia Belikova, Rauf Parchiev, Dmitry Simakov, Maxim Savchenko, Andrey Savchenko, Serguei Barannikov, Alexey Zaytsev
Hallucination, i.e., generating factually incorrect content, remains a
critical challenge for large language models (LLMs). We introduce TOHA, a
TOpology-based HAllucination detector in the RAG setting, which leverages a
topological divergence metric to quantify the structural properties of graphs
induced by attention matrices. Examining the topological divergence between
prompt and response subgraphs reveals consistent patterns: higher divergence
values in specific attention heads correlate with hallucinated outputs,
independent of the dataset. Extensive experiments - including evaluation on
question answering and summarization tasks - show that our approach achieves
state-of-the-art or competitive results on several benchmarks while requiring
minimal annotated data and computational resources. Our findings suggest that
analyzing the topological structure of attention matrices can serve as an
efficient and robust indicator of factual reliability in LLMs.
♻ ☆ A Unified Approach to Routing and Cascading for LLMs
The availability of a wide range of large language models (LLMs) embedded in
various agentic systems has significantly increased the potential of model
selection strategies to improve the cost-performance tradeoff. Existing
strategies involve either routing, where a single model is chosen per query, or
cascading, which sequentially runs increasingly larger models until a
satisfactory answer is found. However, current approaches face three key
limitations: they (1) lack formal proofs of optimality, (2) fail to identify
the conditions under which these strategies are most effective to improve the
cost-performance tradeoff, and (3) are unable to combine both paradigms for
further improvements. To address these issues, we first derive a novel optimal
strategy for cascading and prove the optimality of an existing routing
strategy. Further, we propose cascade routing, a unified framework that
integrates routing and cascading into a theoretically optimal strategy. Through
our analysis, we identify good quality estimators as the critical factor for
the success of model selection paradigms. Finally, in our experiments, we show
that cascade routing consistently outperforms the individual approaches by a
large margin and we analyze quality estimators to determine when routing and/or
cascading are useful paradigms for model selection.
♻ ☆ GRIFFIN: Effective Token Alignment for Faster Speculative Decoding
Speculative decoding accelerates inference in large language models (LLMs) by
generating multiple draft tokens simultaneously. However, existing methods
often struggle with token misalignment between the training and decoding
phases, limiting their performance. To address this, we propose GRIFFIN, a
novel framework that incorporates a token-alignable training strategy and a
token-alignable draft model to mitigate misalignment. The training strategy
employs a loss masking mechanism to exclude highly misaligned tokens during
training, preventing them from negatively impacting the draft model's
optimization. The token-alignable draft model introduces input tokens to
correct inconsistencies in generated features. Experiments on LLaMA, Vicuna,
Qwen and Mixtral models demonstrate that GRIFFIN achieves an average acceptance
length improvement of over 8% and a speedup ratio exceeding 7%, outperforming
current speculative decoding state-of-the-art methods. Our code and GRIFFIN's
draft models are released publicly in https://github.com/hsj576/GRIFFIN.
♻ ☆ Semantic Aware Linear Transfer by Recycling Pre-trained Language Models for Cross-lingual Transfer ACL 2025
Large Language Models (LLMs) increasingly incorporate multilingual
capabilities, fueling the demand to transfer them into target language-specific
models. However, most approaches, which blend the source model's embedding by
replacing the source vocabulary with the target language-specific vocabulary,
may constrain expressive capacity in the target language since the source model
is predominantly trained on English data. In this paper, we propose Semantic
Aware Linear Transfer (SALT), a novel cross-lingual transfer technique that
recycles embeddings from target language Pre-trained Language Models (PLMs) to
transmit the deep representational strengths of PLM-derived embedding to LLMs.
SALT derives unique regression lines based on the similarity in the overlap of
the source and target vocabularies, to handle each non-overlapping token's
embedding space. Our extensive experiments show that SALT significantly
outperforms other transfer methods and achieves lower loss with accelerating
faster convergence during language adaptation. Notably, SALT obtains remarkable
performance in cross-lingual understanding setups compared to other methods.
Furthermore, we highlight the scalable use of PLMs to enhance the functionality
of contemporary LLMs by conducting experiments with varying architectures.
comment: Accepted to ACL 2025 Findings
♻ ☆ GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents
Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm,
coupling online Reinforcement Learning (RL) with explicit chain-of-thought
reasoning prior to object grounding and thereby achieving substantial
performance gains. In this paper, we first conduct extensive analysis
experiments of three key components of that training pipeline: input design,
output evaluation, and policy update-each revealing distinct challenges arising
from blindly applying general-purpose RL without adapting to GUI grounding
tasks. Input design: Current templates encourage the model to generate
chain-of-thought reasoning, but longer chains unexpectedly lead to worse
grounding performance. Output evaluation: Reward functions based on hit signals
or box area allow models to exploit box size, leading to reward hacking and
poor localization quality. Policy update: Online RL tends to overfit easy
examples due to biases in length and sample difficulty, leading to
under-optimization on harder cases. To address these issues, we propose three
targeted solutions. First, we adopt a Fast Thinking Template that encourages
direct answer generation, reducing excessive reasoning during training. Second,
we incorporate a box size constraint into the reward function to mitigate
reward hacking. Third, we revise the RL objective by adjusting length
normalization and adding a difficulty-aware scaling factor, enabling better
optimization on hard samples. Our GUI-G1-3B, trained on 17K public samples with
Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on
ScreenSpot-Pro. This surpasses all prior models of similar size and even
outperforms the larger UI-TARS-7B, establishing a new state-of-the-art in GUI
agent grounding. The project repository is available at
https://github.com/Yuqi-Zhou/GUI-G1.
♻ ☆ Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance
Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Siddhant Gupta, Drishti Sharma, Jebish Purbey, Kanwal Mehreen, Muhammad Arham, Hamza Farooq
Large Language Models (LLMs) have shown remarkable capabilities, but their
development has primarily focused on English and other high-resource languages,
leaving many languages underserved. We present our latest Hindi-English
bi-lingual LLM \textbf{Mantra-14B} with ~3\% average improvement in benchmark
scores over both languages, outperforming models twice its size. Using a
curated dataset composed of English and Hindi instruction data of 485K samples,
we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve
performance over both English and Hindi. Our experiments encompassing seven
different LLMs of varying parameter sizes and over 140 training attempts with
varying English-Hindi training data ratios demonstrated that it is possible to
significantly improve multilingual performance without compromising native
performance. Further, our approach avoids resource-intensive techniques like
vocabulary expansion or architectural modifications, thus keeping the model
size small. Our results indicate that modest fine-tuning with culturally and
locally informed data can bridge performance gaps without incurring significant
computational overhead. We release our training code, datasets, and models
under mit and apache licenses to aid further research towards under-represented
and low-resource languages.
comment: 24 pages, 18 figures
♻ ☆ Transferring Textual Preferences to Vision-Language Understanding through Model Merging ACL 2025
Large vision-language models (LVLMs) perform outstandingly across various
multimodal tasks. However, their ability to evaluate generated content remains
limited, and training vision-language reward models (VLRMs) with preference
data is computationally expensive. This paper explores a training-free
alternative by merging text-based reward models (RMs) with LVLMs to create
VLRMs. Our approach shows that integrating these models leads to improved
performance over LVLMs' scoring and text-based RMs, offering an efficient
method for incorporating textual preferences into LVLMs.
comment: Accepted to ACL 2025 main
♻ ☆ Normal forms in Virus Machines
In the present work, we further study the computational power of virus
machines (VMs in short).VMs provide a computing paradigm inspired by the
transmission and replication networks of viruses.VMs consist of process units
(called hosts) structured by a directed graph whose arcs are called channels
and an instruction graph that controls the transmissions of virus objects among
hosts. The present work complements our understanding of the computing power of
VMs by introducing normal forms; these expressions restrict the features in a
given computing model.Some of the features that we restrict in our normal forms
include (a) the number of hosts, (b) the number of instructions, and (c) the
number of virus objects in each host. After we recall some known results on the
computing power of VMs we give our series of normal forms, such as the size of
the loops in the network, proving new characterisations of family of sets, such
as finite sets, semilinear sets, or recursively enumerable sets (NRE).
comment: 24 pages, 14 figures
♻ ☆ My Words Imply Your Opinion: Reader Agent-based Propagation Enhancement for Personalized Implicit Emotion Analysis
The subtlety of emotional expressions makes implicit emotion analysis (IEA)
particularly sensitive to user-specific characteristics. Current studies
personalize emotion analysis by focusing on the author but neglect the impact
of the intended reader on implicit emotional feedback. In this paper, we
introduce Personalized IEA (PIEA) and present the RAPPIE model, which addresses
subjective variability by incorporating reader feedback. In particular, (1) we
create reader agents based on large language models to simulate reader
feedback, overcoming the issue of ``spiral of silence effect'' and data
incompleteness of real reader reaction. (2) We develop a role-aware multi-view
graph learning to model the emotion interactive propagation process in
scenarios with sparse reader information. (3) We construct two new PIEA
datasets covering English and Chinese social media with detailed user metadata,
addressing the text-centric limitation of existing datasets. Extensive
experiments show that RAPPIE significantly outperforms state-of-the-art
baselines, demonstrating the value of incorporating reader feedback in PIEA.
♻ ☆ Robust and Fine-Grained Detection of AI Generated Texts
Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Kanwal Mehreen, Drishti Sharma, Siddhant Gupta, Jebish Purbey, Ashay Srivastava, Subhasya TippaReddy, Arvind Reddy Bobbili, Suraj Telugara Chandrashekhar, Modabbir Adeeb, Srinadh Vura, Hamza Farooq
An ideal detection system for machine generated content is supposed to work
well on any generator as many more advanced LLMs come into existence day by
day. Existing systems often struggle with accurately identifying AI-generated
content over shorter texts. Further, not all texts might be entirely authored
by a human or LLM, hence we focused more over partial cases i.e human-LLM
co-authored texts. Our paper introduces a set of models built for the task of
token classification which are trained on an extensive collection of
human-machine co-authored texts, which performed well over texts of unseen
domains, unseen generators, texts by non-native speakers and those with
adversarial inputs. We also introduce a new dataset of over 2.4M such texts
mostly co-authored by several popular proprietary LLMs over 23 languages. We
also present findings of our models' performance over each texts of each domain
and generator. Additional findings include comparison of performance against
each adversarial method, length of input texts and characteristics of generated
texts compared to the original human authored texts.
comment: 18 pages, 6 figures
♻ ☆ LiTransProQA: an LLM-based Literary Translation evaluation metric with Professional Question Answering
The impact of Large Language Models (LLMs) has extended into literary
domains. However, existing evaluation metrics prioritize mechanical accuracy
over artistic expression and tend to overrate machine translation as being
superior to human translation from experienced professionals. In the long run,
this bias could result in an irreversible decline in translation quality and
cultural authenticity. In response to the urgent need for a specialized
literary evaluation metric, we introduce LiTransProQA, a novel, reference-free,
LLM-based question-answering framework designed for literary translation
evaluation. LiTransProQA uniquely integrates insights from professional
literary translators and researchers, focusing on critical elements in literary
quality assessment such as literary devices, cultural understanding, and
authorial voice. Our extensive evaluation shows that while literary-finetuned
XCOMET-XL yields marginal gains, LiTransProQA substantially outperforms current
metrics, achieving up to 0.07 gain in correlation and surpassing the best
state-of-the-art metrics by over 15 points in adequacy assessments.
Incorporating professional translator insights as weights further improves
performance, highlighting the value of translator inputs. Notably, LiTransProQA
reaches human-level evaluation performance comparable to trained student
evaluators. It shows broad applicability to open-source models like
LLaMa3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and
training-free tool for evaluating literary translations that require local
processing due to copyright or ethical considerations. The code and datasets
are available under: https://github.com/zhangr2021/TransProQA.
comment: Updated version, with examples in the appendix
♻ ☆ Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching
Distillation has shown remarkable success in transferring knowledge from a
Large Language Model (LLM) teacher to a student LLM. However, current
distillation methods require similar tokenizers between the teacher and the
student, restricting their applicability to only a small subset of
teacher-student pairs. In this work, we develop a principled cross-tokenizer
distillation method to solve this crucial deficiency. Our method is the first
to enable effective distillation across fundamentally different tokenizers,
while also substantially outperforming prior methods in all other cases. We
verify the efficacy of our method on three distinct use cases. First, we show
that viewing tokenizer transfer as self-distillation enables unprecedentedly
effective transfer across tokenizers, including rapid transfer of subword
models to the byte-level. Transferring different models to the same tokenizer
also enables ensembling to boost performance. Secondly, we distil a large
maths-specialised LLM into a small general-purpose model with a different
tokenizer, achieving competitive maths problem-solving performance. Thirdly, we
use our method to train state-of-the-art embedding prediction hypernetworks for
training-free tokenizer transfer. Our results unlock an expanded range of
teacher-student pairs for distillation, enabling new ways to adapt and enhance
interaction between LLMs.
comment: Preprint, 21 pages
♻ ☆ Adaptive Thinking via Mode Policy Optimization for Social Language Agents
Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, Wenji Mao
Effective social intelligence simulation requires language agents to
dynamically adjust reasoning depth, a capability notably absent in current
studies. Existing methods either lack this kind of reasoning capability or
enforce Long Chain-of-Thought reasoning uniformly across all scenarios,
resulting in excessive token usage and inflexible social simulation. To address
this, we propose an $\textbf{A}$daptive $\textbf{M}$ode $\textbf{L}$earning
($\textbf{AML}$) framework in this paper, aiming to improve the adaptive
thinking ability of language agents in dynamic social interactions. To this
end, we first identify hierarchical thinking modes ranging from intuitive
response to deep deliberation based on the cognitive control theory. We then
develop the $\textbf{A}$daptive $\textbf{M}$ode $\textbf{P}$olicy
$\textbf{O}$ptimization ($\textbf{AMPO}$) algorithm to optimize the
context-aware mode switching and reasoning. Our framework advances existing
research in three key aspects: (1) Multi-granular thinking mode design, (2)
Context-aware mode switching across social interaction, and (3) Token-efficient
reasoning via depth-adaptive processing. Extensive experiments on social
intelligence benchmarks verify that AML achieves 15.6% higher task performance
than GPT-4o. Notably, our AMPO outperforms GRPO by 7.0% with 32.8% shorter
reasoning chains, demonstrating the advantage of adaptive thinking mode
selection and optimization mechanism in AMPO over GRPO's fixed-depth solution.
comment: Work in Progress. The code and data are available, see
https://github.com/MozerWang/AMPO
♻ ☆ Social Bias in Popular Question-Answering Benchmarks
Question-answering (QA) and reading comprehension (RC) benchmarks are
essential for assessing the capabilities of large language models (LLMs) in
retrieving and reproducing knowledge. However, we demonstrate that popular QA
and RC benchmarks are biased and do not cover questions about different
demographics or regions in a representative way, potentially due to a lack of
diversity of those involved in their creation. We perform a qualitative content
analysis of 30 benchmark papers and a quantitative analysis of 20 respective
benchmark datasets to learn (1) who is involved in the benchmark creation, (2)
how social bias is addressed or prevented, and (3) whether the demographics of
the creators and annotators correspond to particular biases in the content.
Most analyzed benchmark papers provided insufficient information regarding the
stakeholders involved in benchmark creation, particularly the annotators.
Notably, just one of the benchmark papers explicitly reported measures taken to
address social representation issues. Moreover, the data analysis revealed
gender, religion, and geographic biases across a wide range of encyclopedic,
commonsense, and scholarly benchmarks. More transparent and bias-aware QA and
RC benchmark creation practices are needed to facilitate better scrutiny and
incentivize the development of fairer LLMs.
♻ ☆ FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training
Selecting high-quality data can improve the pretraining efficiency of large
language models (LLMs). Existing methods generally rely on heuristic techniques
or single quality signals, limiting their ability to evaluate data quality
comprehensively. In this work, we propose FIRE, a flexible and scalable
framework for integrating multiple data quality raters, which allows for a
comprehensive assessment of data quality across various dimensions. FIRE aligns
multiple quality signals into a unified space, and integrates diverse data
quality raters to provide a comprehensive quality signal for each data point.
Further, we introduce a progressive data selection scheme based on FIRE that
iteratively refines the selection of high-quality data points. Extensive
experiments show that FIRE outperforms other data selection methods and
significantly boosts pretrained model performance across a wide range of
downstream tasks, while requiring less than 37.5\% of the training data needed
by the Random baseline to reach the target performance.
comment: 21 pages, 11 figures
♻ ☆ Model Merging in Pre-training of Large Language Models
Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu
Model merging has emerged as a promising technique for enhancing large
language models, though its application in large-scale pre-training remains
relatively unexplored. In this paper, we present a comprehensive investigation
of model merging techniques during the pre-training process. Through extensive
experiments with both dense and Mixture-of-Experts (MoE) architectures ranging
from millions to over 100 billion parameters, we demonstrate that merging
checkpoints trained with constant learning rates not only achieves significant
performance improvements but also enables accurate prediction of annealing
behavior. These improvements lead to both more efficient model development and
significantly lower training costs. Our detailed ablation studies on merging
strategies and hyperparameters provide new insights into the underlying
mechanisms while uncovering novel applications. Through comprehensive
experimental analysis, we offer the open-source community practical
pre-training guidelines for effective model merging.
♻ ☆ Optimizing Case-Based Reasoning System for Functional Test Script Generation with Large Language Models KDD 2025
Siyuan Guo, Huiwu Liu, Xiaolong Chen, Yuming Xie, Liang Zhang, Tao Han, Hechang Chen, Yi Chang, Jun Wang
In this work, we explore the potential of large language models (LLMs) for
generating functional test scripts, which necessitates understanding the
dynamically evolving code structure of the target software. To achieve this, we
propose a case-based reasoning (CBR) system utilizing a 4R cycle (i.e.,
retrieve, reuse, revise, and retain), which maintains and leverages a case bank
of test intent descriptions and corresponding test scripts to facilitate LLMs
for test script generation. To improve user experience further, we introduce
Re4, an optimization method for the CBR system, comprising reranking-based
retrieval finetuning and reinforced reuse finetuning. Specifically, we first
identify positive examples with high semantic and script similarity, providing
reliable pseudo-labels for finetuning the retriever model without costly
labeling. Then, we apply supervised finetuning, followed by a reinforcement
learning finetuning stage, to align LLMs with our production scenarios,
ensuring the faithful reuse of retrieved cases. Extensive experimental results
on two product development units from Huawei Datacom demonstrate the
superiority of the proposed CBR+Re4. Notably, we also show that the proposed
Re4 method can help alleviate the repetitive generation issues with LLMs.
comment: Accepted by KDD 2025 (ADS Track)
♻ ☆ MentalMAC: Enhancing Large Language Models for Detecting Mental Manipulation via Multi-Task Anti-Curriculum Distillation
Mental manipulation is a subtle yet pervasive form of psychological abuse
that poses serious threats to mental health. Its covert nature and the
complexity of manipulation strategies make it challenging to detect, even for
state-of-the-art large language models (LLMs). This concealment also hinders
the manual collection of large-scale, high-quality annotations essential for
training effective models. Although recent efforts have sought to improve LLMs'
performance on this task, progress remains limited due to the scarcity of
real-world annotated datasets. To address these challenges, we propose
MentalMAC, a multi-task anti-curriculum distillation method that enhances LLMs'
ability to detect mental manipulation in multi-turn dialogue. Our approach
includes: (i) EvoSA, an unsupervised data expansion method based on
evolutionary operations and speech act theory; (ii) teacher model-generated
multi-task supervision; and (iii) progressive knowledge distillation from
complex to simpler tasks. We then constructed the ReaMent dataset with 5,000
real-world dialogue samples, using a MentalMAC-distilled model to assist human
annotation. Vast experiments demonstrate that our method significantly narrows
the gap between student and teacher models and outperforms competitive LLMs
across key evaluation metrics. All code, datasets, and checkpoints will be
released upon paper acceptance. Warning: This paper contains content that may
be offensive to readers.
♻ ☆ Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning
Large-scale Transformer language models (LMs) trained solely on next-token
prediction with web-scale data can solve a wide range of tasks after seeing
just a few examples. The mechanism behind this capability, known as in-context
learning (ICL), remains both controversial and poorly understood. Some studies
argue that it is merely the result of memorizing vast amounts of data, while
others contend that it reflects a fundamental, symbolic algorithmic development
in LMs. In this work, we introduce a suite of investigative tasks and a novel
method to systematically investigate ICL by leveraging the full Pythia scaling
suite, including interim checkpoints that capture progressively larger amount
of training data. By carefully exploring ICL performance on downstream tasks
and simultaneously conducting a mechanistic analysis of the residual stream's
subspace, we demonstrate that ICL extends beyond mere "memorization" of the
training corpus, yet does not amount to the implementation of an independent
symbolic algorithm. Our results also clarify several aspects of ICL, including
the influence of training dynamics, model capabilities, and elements of
mechanistic interpretability. Overall, our work advances the understanding of
ICL and its implications, offering model developers insights into potential
improvements and providing AI security practitioners with a basis for more
informed guidelines.
♻ ☆ Language Models are Universal Embedders ACL 2025
In the large language model (LLM) revolution, embedding is a key component of
various systems, such as retrieving knowledge or memories for LLMs or building
content moderation filters. As such cases span from English to other natural or
programming languages, from retrieval to classification and beyond, it is
advantageous to build a unified embedding model rather than dedicated ones for
each scenario. In this context, the pre-trained multilingual decoder-only large
language models, e.g., BLOOM, emerge as a viable backbone option. To assess
their potential, we propose straightforward strategies for constructing
embedders and introduce a universal evaluation benchmark. Experimental results
show that our trained model is proficient at generating good embeddings across
languages and tasks, even extending to languages and tasks for which no
finetuning/pretraining data is available. We also present detailed analyses and
additional evaluations. We hope that this work could encourage the development
of more robust open-source universal embedders.
comment: XLLM Workshop, ACL 2025
♻ ☆ BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism ACL
Martin Fajcik, Martin Docekal, Jan Dolezal, Karel Ondrej, Karel Beneš, Jan Kapsa, Pavel Smrz, Alexander Polok, Michal Hradis, Zuzana Neverilova, Ales Horak, Radoslav Sabol, Michal Stefanik, Adam Jirkovsky, David Adamczyk, Petr Hyner, Jan Hula, Hynek Kydlicek
We present BenCzechMark (BCM), the first comprehensive Czech language
benchmark designed for large language models, offering diverse tasks, multiple
task formats, and multiple evaluation metrics. Its duel scoring system is
grounded in statistical significance theory and uses aggregation across tasks
inspired by social preference theory. Our benchmark encompasses 50 challenging
tasks, with corresponding test datasets, primarily in native Czech, with 14
newly collected ones. These tasks span 8 categories and cover diverse domains,
including historical Czech news, essays from pupils or language learners, and
spoken word. Furthermore, we collect and clean BUT-Large Czech Collection, the
largest publicly available clean Czech language corpus, and use it for (i)
contamination analysis and (ii) continuous pretraining of the first
Czech-centric 7B language model with Czech-specific tokenization. We use our
model as a baseline for comparison with publicly available multilingual models.
Lastly, we release and maintain a leaderboard with existing 50 model
submissions, where new model submissions can be made at
https://huggingface.co/spaces/CZLC/BenCzechMark.
comment: Accepted to TACL
♻ ☆ Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type NAACL 2025
Conceptual combination is a cognitive process that merges basic concepts,
enabling the creation of complex expressions. During this process, the
properties of combination (e.g., the whiteness of a peeled apple) can be
inherited from basic concepts, newly emerge, or be canceled. However, previous
studies have evaluated a limited set of properties and have not examined the
generative process. To address this gap, we introduce the Conceptual
Combination with Property Type dataset (CCPT), which consists of 12.3K
annotated triplets of noun phrases, properties, and property types. Using CCPT,
we establish three types of tasks to evaluate LLMs for conceptual combination
thoroughly. Our key findings are threefold: (1) Our automatic metric grading
property emergence and cancellation closely corresponds with human judgments.
(2) LLMs, including OpenAI's o1, struggle to generate noun phrases which
possess given emergent properties. (3) Our proposed method, inspired by
cognitive psychology model that explains how relationships between concepts are
formed, improves performances in all generative tasks. The dataset and
experimental code are available at https://github.com/seokwon99/CCPT.git.
comment: NAACL 2025 Oral
♻ ☆ M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis
Chengyan Wu, Bolei Ma, Yihong Liu, Zheyu Zhang, Ningyuan Deng, Yanshu Li, Baolan Chen, Yi Zhang, Yun Xue, Barbara Plank
Aspect-based sentiment analysis (ABSA) is a crucial task in information
extraction and sentiment analysis, aiming to identify aspects with associated
sentiment elements in text. However, existing ABSA datasets are predominantly
English-centric, limiting the scope for multilingual evaluation and research.
To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7
domains and 21 languages, making it the most extensive multilingual parallel
dataset for ABSA to date. Our primary focus is on triplet extraction, which
involves identifying aspect terms, aspect categories, and sentiment polarities.
The dataset is constructed through an automatic translation process with human
review to ensure quality. We perform extensive experiments using various
baselines to assess performance and compatibility on M-ABSA. Our empirical
findings highlight that the dataset enables diverse evaluation tasks, such as
multilingual and multi-domain transfer learning, and large language model
evaluation, underscoring its inclusivity and its potential to drive
advancements in multilingual ABSA research.