Computation and Language
☆ Are Large Reasoning Models Interruptible?
Large Reasoning Models (LRMs) excel at complex reasoning but are
traditionally evaluated in static, "frozen world" settings: model responses are
assumed to be instantaneous, and the context of a request is presumed to be
immutable over the duration of the response. While generally true for
short-term tasks, the "frozen world" assumption breaks down in modern reasoning
tasks such as assistive programming, where models may take hours to think
through problems and code may change dramatically from the time the model
starts thinking to the model's final output. In this work, we challenge the
frozen world assumption and evaluate LRM robustness under two realistic dynamic
scenarios: interruptions, which test the quality of the model's partial outputs
on a limited budget, and dynamic context, which tests model adaptation to
in-flight changes. Across mathematics and programming benchmarks that require
long-form reasoning, static evaluations consistently overestimate robustness:
even state-of-the-art LRMs, which achieve high accuracy in static settings, can
fail unpredictably when interrupted or exposed to changing context, with
performance dropping by up to 60% when updates are introduced late in the
reasoning process. Our analysis further reveals several novel failure modes,
including reasoning leakage, where models fold the reasoning into their final
answer when interrupted; panic, where under time pressure models abandon
reasoning entirely and return incorrect answers; and self-doubt, where
performance degrades while incorporating updated information.
comment: Project Page: https://dynamic-lm.github.io/
☆ Demystifying Reinforcement Learning in Agentic Reasoning
Recently, the emergence of agentic RL has showcased that RL could also
effectively improve the agentic reasoning ability of LLMs, yet the key design
principles and optimal practices remain unclear. In this work, we conduct a
comprehensive and systematic investigation to demystify reinforcement learning
in agentic reasoning from three key perspectives: data, algorithm, and
reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic
trajectories with real end-to-end tool-use trajectories yields a far stronger
SFT initialization; high-diversity, model-aware datasets sustain exploration
and markedly improve RL performance. (ii) Exploration-friendly techniques are
crucial for agentic RL, such as clip higher, overlong reward shaping, and
maintaining adequate policy entropy could improve the training efficiency.
(iii) A deliberative strategy with fewer tool calls outperforms frequent tool
calls or verbose self-reasoning, improving tool efficiency and final accuracy.
Together, these simple practices consistently enhance agentic reasoning and
training efficiency, achieving strong results on challenging benchmarks with
smaller models, and establishing a practical baseline for future agentic RL
research. Beyond these empirical insights, we further contribute a
high-quality, real end-to-end agentic SFT dataset along with a high-quality RL
dataset, and demonstrate the effectiveness of our insights in boosting the
agentic reasoning ability of LLMs across four challenging benchmarks, including
AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes,
4B-sized models could also achieve superior agentic reasoning performance
compared to 32B-sized models. Code and models:
https://github.com/Gen-Verse/Open-AgentRL
comment: Code and models: https://github.com/Gen-Verse/Open-AgentRL
☆ QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for
large language models (LLMs). While RL is essential for LLMs' reasoning
capabilities, it is resource-intensive, requiring substantial GPU memory and
long rollout durations. QeRL addresses these issues by combining NVFP4
quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL
while reducing memory overhead. Beyond efficiency, our findings show that
quantization noise increases policy entropy, enhancing exploration, and
enabling the discovery of better strategies during RL. To further optimize
exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism,
which dynamically adjusts noise during training. Experiments demonstrate that
QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is
the first framework to enable RL training of a 32B LLM on a single H100 80GB
GPU, while delivering overall speedups for RL training. It also achieves faster
reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while
matching the performance of full-parameter fine-tuning on mathematical
benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These
results establish QeRL as an efficient and effective framework for RL training
in LLMs.
comment: Code is available at https://github.com/NVlabs/QeRL
☆ When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents
Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, Yangyang Yu, Alejandro Lopez-Lira, Peng Lu, Jian-Yun Nie, Guojun Xiong, Jimin Huang, Sophia Ananiadou
Although Large Language Model (LLM)-based agents are increasingly used in
financial trading, it remains unclear whether they can reason and adapt in live
markets, as most studies test models instead of agents, cover limited periods
and assets, and rely on unverified data. To address these gaps, we introduce
Agent Market Arena (AMA), the first lifelong, real-time benchmark for
evaluating LLM-based trading agents across multiple markets. AMA integrates
verified trading data, expert-checked news, and diverse agent architectures
within a unified trading framework, enabling fair and continuous comparison
under real conditions. It implements four agents, including InvestorAgent as a
single-agent baseline, TradeAgent and HedgeFundAgent with different risk
styles, and DeepFundAgent with memory-based reasoning, and evaluates them
across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and
Gemini-2.0-flash. Live experiments on both cryptocurrency and stock markets
demonstrate that agent frameworks display markedly distinct behavioral
patterns, spanning from aggressive risk-taking to conservative decision-making,
whereas model backbones contribute less to outcome variation. AMA thus
establishes a foundation for rigorous, reproducible, and continuously evolving
evaluation of financial reasoning and trading intelligence in LLM-based agents.
☆ Scaling Language-Centric Omnimodal Representation Learning NeurIPS 2025
Recent multimodal embedding approaches leveraging multimodal large language
models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising
results, yet the underlying reasons behind their superiority remain
underexplored. This work argues that a crucial advantage of MLLM-based
approaches stems from implicit cross-modal alignment achieved during generative
pretraining, where the language decoder learns to exploit multimodal signals
within a shared representation space for generating unimodal outputs. Through
analysis of anisotropy and kernel similarity structure, we empirically confirm
that latent alignment emerges within MLLM representations, allowing CL to serve
as a lightweight refinement stage. Leveraging this insight, we propose a
Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive
experiments across diverse backbones and benchmarks demonstrate its
effectiveness, achieving state-of-the-art performance across modalities.
Furthermore, we identify a Generation-Representation Scaling Law (GRSL),
showing that the representational capabilities gained through contrastive
refinement scales positively with the MLLM's generative capabilities. This
suggests that improving generative abilities evolves as an effective paradigm
for enhancing representation quality. We provide a theoretical explanation of
GRSL, which formally links the MLLM's generative quality to the upper bound on
its representation performance, and validate it on a challenging, low-resource
visual-document retrieval task, showing that continual generative pretraining
before CL can further enhance the potential of a model's embedding
capabilities. Codes, models, and resources are available at
https://github.com/LCO-Embedding/LCO-Embedding.
comment: NeurIPS 2025
☆ Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models
A key challenge in applying reinforcement learning (RL) to diffusion large
language models (dLLMs) lies in the intractability of their likelihood
functions, which are essential for the RL objective, necessitating
corresponding approximation in each training step. While existing methods
approximate the log-likelihoods by their evidence lower bounds (ELBOs) via
customized Monte Carlo (MC) sampling, the forward computational graphs of all
MC samples need to be retained for the gradient computation of non-linear terms
in the RL objective, resulting in significant memory overhead. This constraint
restricts feasible sample sizes, leading to imprecise likelihood approximations
and ultimately distorting the RL objective. To overcome this limitation, we
propose \emph{Boundary-Guided Policy Optimization} (BGPO), a memory-efficient
RL algorithm that maximizes a specially constructed lower bound of the
ELBO-based objective. This lower bound is carefully designed to satisfy two key
properties: (1) Linearity: it is formulated in a linear sum where each term
depends only on a single MC sample, thereby enabling gradient accumulation
across samples and ensuring constant memory usage; (2) Equivalence: Both the
value and gradient of this lower bound are equal to those of the ELBO-based
objective in on-policy training, making it also an effective approximation for
the original RL objective. These properties allow BGPO to adopt a large MC
sample size, resulting in more accurate likelihood approximations and improved
RL objective estimation, which in turn leads to enhanced performance.
Experiments show that BGPO significantly outperforms previous RL algorithms for
dLLMs in math problem solving, code generation, and planning tasks.
☆ FinVet: A Collaborative Framework of RAG and External Fact-Checking Agents for Financial Misinformation Detection
Financial markets face growing threats from misinformation that can trigger
billions in losses in minutes. Most existing approaches lack transparency in
their decision-making and provide limited attribution to credible sources. We
introduce FinVet, a novel multi-agent framework that integrates two
Retrieval-Augmented Generation (RAG) pipelines with external fact-checking
through a confidence-weighted voting mechanism. FinVet employs adaptive
three-tier processing that dynamically adjusts verification strategies based on
retrieval confidence, from direct metadata extraction to hybrid reasoning to
full model-based analysis. Unlike existing methods, FinVet provides
evidence-backed verdicts, source attribution, confidence scores, and explicit
uncertainty flags when evidence is insufficient. Experimental evaluation on the
FinFact dataset shows that FinVet achieves an F1 score of 0.85, which is a
10.4% improvement over the best individual pipeline (fact-check pipeline) and
37% improvement over standalone RAG approaches.
☆ ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems
Xin Gui, King Zhu, JinCheng Ren, Qianben Chen, Zekun Moore Wang, Yizhi LI, Xinpeng Liu, Xiaowan Li, Wenli Ren, Linyu Miao, Tianrui Qin, Ziqi Shu, He Zhu, Xiangru Tang, Dingfeng Shi, Jiaheng Liu, Yuchen Eleanor Jiang, Minghao Liu, Ge Zhang, Wangchunshu Zhou
In recent years, the research focus of large language models (LLMs) and
agents has shifted increasingly from demonstrating novel capabilities to
complex reasoning and tackling challenging tasks. However, existing evaluations
focus mainly on math/code contests or general tasks, while existing
multi-domain academic benchmarks lack sufficient reasoning depth, leaving the
field without a rigorous benchmark for high-level reasoning. To fill this gap,
we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs
and agents to acquire and reason over academic knowledge. It consists of 50
expert-annotated academic problems across five high-reasoning domains,
including computer science, economics, law, mathematics, and philosophy. All
questions are sourced from top-tier publications in recent years and undergo
rigorous annotation and quality control to ensure they are both challenging and
answerable. We conduct systematic evaluations of over 10 mainstream LLMs and
agents. The results show that most LLMs scored below 20 points, with even the
cutting-edge GPT-5 achieving only 16 points. While agents achieved higher
scores, none exceeded 40 points. This demonstrates the current capability gap
between LLMs and agents in super-intelligent academic research tasks and
highlights the challenges of Acadreason.
☆ Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation
Inference-time scaling enhances the reasoning ability of a language model
(LM) by extending its chain-of-thought (CoT). However, existing approaches
typically generate the entire reasoning chain in a single forward pass, which
often leads to CoT derailment, i.e., the reasoning trajectory drifting off
course due to compounding errors. This problem is particularly severe for
smaller LMs with long CoTs due to their limited capacity. To address this, we
analyze raw long CoTs and uncover a reasoning hierarchy consisting of planning
and execution steps. Our analysis reveals that most reasoning errors stem from
incorrect planning. Motivated by this observation, we propose Multi-Path Plan
Aggregation (MPPA), a framework that augments single-pass reasoning with plan
exploration and aggregation. Following a variable interval schedule based on
the token position, MPPA generates multiple candidate plans and aggregates them
into a refined planning step. To maintain efficiency, we adopt a minimal design
in which the base LM serves as the primary policy, while a lightweight LoRA
module implements the plan aggregation policy. We further observe that
outcome-reward RL is inefficient for long trajectories (e.g., exceeding 4K
tokens). To overcome this, we introduce online Step-DPO, a process-level
preference optimization scheme that leverages Twisted Sequential Monte Carlo
(TSMC) to provide scalable stepwise supervision using small LMs. This yields
more efficient training, improved stability, and higher accuracy. Extensive
experiments on challenging math, science, and logical reasoning benchmarks
demonstrate that, with only 10% SFT data and 5% of preference pairs, our method
outperforms both the DeepSeek-R1 distillation baseline and the outcome-reward
RL baseline across multiple base models and tasks.
☆ StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models
Human writers often begin their stories with an overarching mental scene,
where they envision the interactions between characters and their environment.
Inspired by this creative process, we propose a novel approach to long-form
story generation, termed hybrid bottom-up long-form story generation, using
multi-agent simulations. In our method, agents interact within a dynamic
sandbox environment, where their behaviors and interactions with one another
and the environment generate emergent events. These events form the foundation
for the story, enabling organic character development and plot progression.
Unlike traditional top-down approaches that impose rigid structures, our hybrid
bottom-up approach allows for the natural unfolding of events, fostering more
spontaneous and engaging storytelling. The system is capable of generating
stories exceeding 10,000 words while maintaining coherence and consistency,
addressing some of the key challenges faced by current story generation models.
We achieve state-of-the-art performance across several metrics. This approach
offers a scalable and innovative solution for creating dynamic, immersive
long-form stories that evolve organically from agent-driven interactions.
comment: Project: https://storyboxproject.github.io
☆ LLM-Oriented Token-Adaptive Knowledge Distillation
Knowledge distillation (KD) is a key technique for compressing large-scale
language models (LLMs), yet prevailing logit-based methods typically employ
static strategies that are misaligned with the dynamic learning process of
student models. These methods typically treat all tokens indiscriminately and
apply a single, fixed temperature, resulting in suboptimal knowledge transfer.
To address these limitations, we propose LLM-Oriented Token-Adaptive Knowledge
Distillation (AdaKD), a novel framework that adapts the distillation process to
the real-time learning state of each token. AdaKD consists of two synergistic
modules driven by a unified token difficulty metric. First, our Loss-Driven
Adaptive Token Focusing (LATF) module dynamically adjusts the distillation
focus by monitoring the student's learning stability, concentrating
computational resources on the most valuable tokens at each training phase.
Second, we introduce Inverse Difficulty Temperature Scaling (IDTS), a
counterintuitive yet effective token-level temperature strategy. It employs low
temperatures for difficult tokens for targeted error correction, and high
temperatures for easy tokens to encourage students to learn from the teacher's
complete and smooth output distribution, thereby enhancing generalization. As a
plug-and-play framework, AdaKD can consistently improve the performance of
various distillation methods on multiple model architectures and benchmarks.
comment: 15 pages, 4 figures
☆ Deconstructing Attention: Investigating Design Principles for Effective Language Modeling
The success of Transformer language models is widely credited to their
dot-product attention mechanism, which interweaves a set of key design
principles: mixing information across positions (enabling multi-token
interactions), sequence-dependent activations (where attention weights adapt to
each input), a specific mathematical form (dot-product similarities plus
softmax weighting), and coupling of queries and keys to evolving hidden states
(grounding attention in the current layer). However, the necessity of each of
these principles remains largely untested. In this work, we systematically
deconstruct attention by designing controlled variants that selectively relax
these principles, applied both uniformly across all layers and in hybrid
architectures where only some layers retain standard attention. Our empirical
analysis reveals that mechanisms for mixing tokens are indispensable, as their
absence collapses models to near-random behavior, while the exact mathematical
form and sequence dependency can be substantially relaxed, especially when
preserved in just a subset of layers. Surprisingly, even variants that fail in
isolation can achieve robust performance when interleaved with standard
attention, highlighting a cooperative effect. These findings deepen our
understanding of what truly underpins attention's effectiveness and open new
avenues for simplifying language models without sacrificing performance.
☆ SemCSE-Multi: Multifaceted and Decodable Embeddings for Aspect-Specific and Interpretable Scientific Domain Mapping
We propose SemCSE-Multi, a novel unsupervised framework for generating
multifaceted embeddings of scientific abstracts, evaluated in the domains of
invasion biology and medicine. These embeddings capture distinct, individually
specifiable aspects in isolation, thus enabling fine-grained and controllable
similarity assessments as well as adaptive, user-driven visualizations of
scientific domains. Our approach relies on an unsupervised procedure that
produces aspect-specific summarizing sentences and trains embedding models to
map semantically related summaries to nearby positions in the embedding space.
We then distill these aspect-specific embedding capabilities into a unified
embedding model that directly predicts multiple aspect embeddings from a
scientific abstract in a single, efficient forward pass. In addition, we
introduce an embedding decoding pipeline that decodes embeddings back into
natural language descriptions of their associated aspects. Notably, we show
that this decoding remains effective even for unoccupied regions in
low-dimensional visualizations, thus offering vastly improved interpretability
in user-centric settings.
☆ MeTA-LoRA: Data-Efficient Multi-Task Fine-Tuning for Large Language Models
Low-Rank Adaptation (LoRA) has emerged as one of the most widely used
parameter-efficient fine-tuning (PEFT) methods for adapting large language
models (LLMs) to downstream tasks. While highly effective in single-task
settings, it struggles to efficiently leverage inter-task knowledge in complex
multi-task learning scenarios, often requiring substantial task-specific data
to achieve optimal performance. To address this limitation, we introduce
MeTA-LoRA, a two-stage optimization framework that significantly improves data
efficiency in multi-task adaptation. In the first stage, task-specific LoRA
adapters are learned using only a few samples from each involved dataset,
enabling rapid adaptation without large-scale supervision. In the second stage,
the shared LoRA adapter is updated by aggregating gradients from multiple tasks
to promote knowledge transfer across tasks, further reducing data usage by
leveraging common patterns. In both multi-task learning and multilingual
learning scenarios, our method matches or surpasses the performance of
traditional full-data LoRA fine-tuning approaches, while using significantly
less task-specific data.
☆ REGENT: Relevance-Guided Attention for Entity-Aware Multi-Vector Neural Re-Ranking SIGIR
Current neural re-rankers often struggle with complex information needs and
long, content-rich documents. The fundamental issue is not computational--it is
intelligent content selection: identifying what matters in lengthy,
multi-faceted texts. While humans naturally anchor their understanding around
key entities and concepts, neural models process text within rigid token
windows, treating all interactions as equally important and missing critical
semantic signals. We introduce REGENT, a neural re-ranking model that mimics
human-like understanding by using entities as a "semantic skeleton" to guide
attention. REGENT integrates relevance guidance directly into the attention
mechanism, combining fine-grained lexical matching with high-level semantic
reasoning. This relevance-guided attention enables the model to focus on
conceptually important content while maintaining sensitivity to precise term
matches. REGENT achieves new state-of-the-art performance in three challenging
datasets, providing up to 108% improvement over BM25 and consistently
outperforming strong baselines including ColBERT and RankVicuna. To our
knowledge, this is the first work to successfully integrate entity semantics
directly into neural attention, establishing a new paradigm for entity-aware
information retrieval.
comment: To be published in: Proceedings of the 2025 Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval in the
Asia Pacific Region (SIGIR-AP 2025)
☆ QDER: Query-Specific Document and Entity Representations for Multi-Vector Document Re-Ranking SIGIR
Neural IR has advanced through two distinct paths: entity-oriented approaches
leveraging knowledge graphs and multi-vector models capturing fine-grained
semantics. We introduce QDER, a neural re-ranking model that unifies these
approaches by integrating knowledge graph semantics into a multi-vector model.
QDER's key innovation lies in its modeling of query-document relationships:
rather than computing similarity scores on aggregated embeddings, we maintain
individual token and entity representations throughout the ranking process,
performing aggregation only at the final scoring stage - an approach we call
"late aggregation." We first transform these fine-grained representations
through learned attention patterns, then apply carefully chosen mathematical
operations for precise matches. Experiments across five standard benchmarks
show that QDER achieves significant performance gains, with improvements of 36%
in nDCG@20 over the strongest baseline on TREC Robust 2004 and similar
improvements on other datasets. QDER particularly excels on difficult queries,
achieving an nDCG@20 of 0.70 where traditional approaches fail completely
(nDCG@20 = 0.0), setting a foundation for future work in entity-aware
retrieval.
comment: Published in: Proceedings of the 48th International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR 2025)
☆ Survey Response Generation: Generating Closed-Ended Survey Responses In-Silico with Large Language Models
Many in-silico simulations of human survey responses with large language
models (LLMs) focus on generating closed-ended survey responses, whereas LLMs
are typically trained to generate open-ended text instead. Previous research
has used a diverse range of methods for generating closed-ended survey
responses with LLMs, and a standard practice remains to be identified. In this
paper, we systematically investigate the impact that various Survey Response
Generation Methods have on predicted survey responses. We present the results
of 32 mio. simulated survey responses across 8 Survey Response Generation
Methods, 4 political attitude surveys, and 10 open-weight language models. We
find significant differences between the Survey Response Generation Methods in
both individual-level and subpopulation-level alignment. Our results show that
Restricted Generation Methods perform best overall, and that reasoning output
does not consistently improve alignment. Our work underlines the significant
impact that Survey Response Generation Methods have on simulated survey
responses, and we develop practical recommendations on the application of
Survey Response Generation Methods.
☆ LLMAtKGE: Large Language Models as Explainable Attackers against Knowledge Graph Embeddings
Adversarial attacks on knowledge graph embeddings (KGE) aim to disrupt the
model's ability of link prediction by removing or inserting triples. A recent
black-box method has attempted to incorporate textual and structural
information to enhance attack performance. However, it is unable to generate
human-readable explanations, and exhibits poor generalizability. In the past
few years, large language models (LLMs) have demonstrated powerful capabilities
in text comprehension, generation, and reasoning. In this paper, we propose
LLMAtKGE, a novel LLM-based framework that selects attack targets and generates
human-readable explanations. To provide the LLM with sufficient factual context
under limited input constraints, we design a structured prompting scheme that
explicitly formulates the attack as multiple-choice questions while
incorporating KG factual evidence. To address the context-window limitation and
hesitation issues, we introduce semantics-based and centrality-based filters,
which compress the candidate set while preserving high recall of
attack-relevant information. Furthermore, to efficiently integrate both
semantic and structural information into the filter, we precompute high-order
adjacency and fine-tune the LLM with a triple classification task to enhance
filtering performance. Experiments on two widely used knowledge graph datasets
demonstrate that our attack outperforms the strongest black-box baselines and
provides explanations via reasoning, and showing competitive performance
compared with white-box methods. Comprehensive ablation and case studies
further validate its capability to generate explanations.
comment: 13 pages
☆ Bag of Tricks for Subverting Reasoning-based Safety Guardrails
Shuo Chen, Zhen Han, Haokun Chen, Bailan He, Shengyun Si, Jingpei Wu, Philip Torr, Volker Tresp, Jindong Gu
Recent reasoning-based safety guardrails for Large Reasoning Models (LRMs),
such as deliberative alignment, have shown strong defense against jailbreak
attacks. By leveraging LRMs' reasoning ability, these guardrails help the
models to assess the safety of user inputs before generating final responses.
The powerful reasoning ability can analyze the intention of the input query and
will refuse to assist once it detects the harmful intent hidden by the
jailbreak methods. Such guardrails have shown a significant boost in defense,
such as the near-perfect refusal rates on the open-source gpt-oss series.
Unfortunately, we find that these powerful reasoning-based guardrails can be
extremely vulnerable to subtle manipulation of the input prompts, and once
hijacked, can lead to even more harmful results. Specifically, we first uncover
a surprisingly fragile aspect of these guardrails: simply adding a few template
tokens to the input prompt can successfully bypass the seemingly powerful
guardrails and lead to explicit and harmful responses. To explore further, we
introduce a bag of jailbreak methods that subvert the reasoning-based
guardrails. Our attacks span white-, gray-, and black-box settings and range
from effortless template manipulations to fully automated optimization. Along
with the potential for scalable implementation, these methods also achieve
alarmingly high attack success rates (e.g., exceeding 90% across 5 different
benchmarks on gpt-oss series on both local host models and online API
services). Evaluations across various leading open-source LRMs confirm that
these vulnerabilities are systemic, underscoring the urgent need for stronger
alignment techniques for open-sourced LRMs to prevent malicious misuse. Code is
open-sourced at https://chenxshuo.github.io/bag-of-tricks.
comment: OpenAI Red-teaming Challenge Winner and Oral Presentation
☆ Culturally-Aware Conversations: A Framework & Benchmark for LLMs EMNLP
Existing benchmarks that measure cultural adaptation in LLMs are misaligned
with the actual challenges these models face when interacting with users from
diverse cultural backgrounds. In this work, we introduce the first framework
and benchmark designed to evaluate LLMs in realistic, multicultural
conversational settings. Grounded in sociocultural theory, our framework
formalizes how linguistic style - a key element of cultural communication - is
shaped by situational, relational, and cultural context. We construct a
benchmark dataset based on this framework, annotated by culturally diverse
raters, and propose a new set of desiderata for cross-cultural evaluation in
NLP: conversational framing, stylistic sensitivity, and subjective correctness.
We evaluate today's top LLMs on our benchmark and show that these models
struggle with cultural adaptation in a conversational setting.
comment: To appear at the 4th HCI + NLP Workshop @ EMNLP
☆ Invisible Languages of the LLM Universe
Large Language Models are trained on massive multilingual corpora, yet this
abundance masks a profound crisis: of the world's 7,613 living languages,
approximately 2,000 languages with millions of speakers remain effectively
invisible in digital ecosystems. We propose a critical framework connecting
empirical measurements of language vitality (real world demographic strength)
and digitality (online presence) with postcolonial theory and epistemic
injustice to explain why linguistic inequality in AI systems is not incidental
but structural. Analyzing data across all documented human languages, we
identify four categories: Strongholds (33%, high vitality and digitality),
Digital Echoes (6%, high digitality despite declining vitality), Fading Voices
(36%, low on both dimensions), and critically, Invisible Giants (27%, high
vitality but near-zero digitality) - languages spoken by millions yet absent
from the LLM universe. We demonstrate that these patterns reflect continuities
from colonial-era linguistic hierarchies to contemporary AI development,
constituting what we term digital epistemic injustice. Our analysis reveals
that English dominance in AI is not a technical necessity but an artifact of
power structures that systematically exclude marginalized linguistic knowledge.
We conclude with implications for decolonizing language technology and
democratizing access to AI benefits.
☆ Information-Preserving Reformulation of Reasoning Traces for Antidistillation
Recent advances in Large Language Models (LLMs) show that extending the
length of reasoning chains significantly improves performance on complex tasks.
While revealing these reasoning traces helps users better follow, verify, and
learn from the model's problem-solving process, it also makes them highly
vulnerable to unauthorized distillation. To mitigate this risk, proprietary
model providers often adopt aggressive protection strategies, such as replacing
detailed reasoning with brief summaries, which deprive users of valuable
intermediate information. To address this trade-off, we propose PART, an
information-preserving antidistillation reformulation of reasoning traces.
Motivated by the difference between how humans understand reasoning traces and
how LLMs exploit them for supervised fine-tuning, we design a simple but
effective two-step reformulation: removing self-talk behaviors and reordering
sub-conclusions. A small auxiliary model is trained to perform this
reformulation, incurring minimal computational overhead. Extensive experiments
demonstrate that PART consistently disrupts distillation across student models
of different sizes and types on various reasoning benchmarks. For instance,
when training on reformulated traces, even the performance of a large 32B
student model decreases from 54.17 to 46.88 on AIME 2024, corresponding to a
13.5% degradation.
☆ An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification SP 2025
We propose a novel neural architecture named TextGraphFuseGAT, which
integrates a pretrained transformer encoder (PhoBERT) with Graph Attention
Networks for token-level classification tasks. The proposed model constructs a
fully connected graph over the token embeddings produced by PhoBERT, enabling
the GAT layer to capture rich inter-token dependencies beyond those modeled by
sequential context alone. To further enhance contextualization, a
Transformer-style self-attention layer is applied on top of the graph-enhanced
embeddings. The final token representations are passed through a classification
head to perform sequence labeling. We evaluate our approach on three Vietnamese
benchmark datasets: PhoNER-COVID19 for named entity recognition in the COVID-19
domain, PhoDisfluency for speech disfluency detection, and VietMed-NER for
medical-domain NER. VietMed-NER is the first Vietnamese medical spoken NER
dataset, featuring 18 entity types collected from real-world medical speech
transcripts and annotated with the BIO tagging scheme. Its specialized
vocabulary and domain-specific expressions make it a challenging benchmark for
token-level classification models. Experimental results show that our method
consistently outperforms strong baselines, including transformer-only and
hybrid neural models such as BiLSTM + CNN + CRF, confirming the effectiveness
of combining pretrained semantic features with graph-based relational modeling
for improved token classification across multiple domains.
comment: 11 pages, 1 figure. Submitted to VLSP 2025 and reviewed
☆ Hallucination Detection via Internal States and Structured Reasoning Consistency in Large Language Models
The detection of sophisticated hallucinations in Large Language Models (LLMs)
is hampered by a ``Detection Dilemma'': methods probing internal states
(Internal State Probing) excel at identifying factual inconsistencies but fail
on logical fallacies, while those verifying externalized reasoning
(Chain-of-Thought Verification) show the opposite behavior. This schism creates
a task-dependent blind spot: Chain-of-Thought Verification fails on
fact-intensive tasks like open-domain QA where reasoning is ungrounded, while
Internal State Probing is ineffective on logic-intensive tasks like
mathematical reasoning where models are confidently wrong. We resolve this with
a unified framework that bridges this critical gap. However, unification is
hindered by two fundamental challenges: the Signal Scarcity Barrier, as coarse
symbolic reasoning chains lack signals directly comparable to fine-grained
internal states, and the Representational Alignment Barrier, a deep-seated
mismatch between their underlying semantic spaces. To overcome these, we
introduce a multi-path reasoning mechanism to obtain more comparable,
fine-grained signals, and a segment-aware temporalized cross-attention module
to adaptively fuse these now-aligned representations, pinpointing subtle
dissonances. Extensive experiments on three diverse benchmarks and two leading
LLMs demonstrate that our framework consistently and significantly outperforms
strong baselines. Our code is available: https://github.com/peach918/HalluDet.
☆ ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, Bo Zhou
While Large Language Models (LLMs) excel at algorithmic code generation, they
struggle with front-end development, where correctness is judged on rendered
pixels and interaction. We present ReLook, an agentic, vision-grounded
reinforcement learning framework that empowers an agent to close a robust
generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool.
During training, the agent uses the MLLM-in-the-loop both as a visual
critic--scoring code with screenshots--and as a source of actionable,
vision-grounded feedback; a strict zero-reward rule for invalid renders anchors
renderability and prevents reward hacking. To prevent behavioral collapse, we
introduce Forced Optimization, a strict acceptance rule that admits only
improving revisions, yielding monotonically better trajectories. At inference,
we decouple the critic and run a lightweight, critic-free self-edit cycle,
keeping latency comparable to base decoding while retaining most of the gains.
Across three widely used benchmarks, ReLook consistently outperforms strong
baselines in vision-grounded front-end code generation, highlighting the
benefits of agentic perception, visual rewards, and training-inference
decoupling.
☆ Investigating Large Language Models' Linguistic Abilities for Text Preprocessing
Text preprocessing is a fundamental component of Natural Language Processing,
involving techniques such as stopword removal, stemming, and lemmatization to
prepare text as input for further processing and analysis. Despite the
context-dependent nature of the above techniques, traditional methods usually
ignore contextual information. In this paper, we investigate the idea of using
Large Language Models (LLMs) to perform various preprocessing tasks, due to
their ability to take context into account without requiring extensive
language-specific annotated resources. Through a comprehensive evaluation on
web-sourced data, we compare LLM-based preprocessing (specifically stopword
removal, lemmatization and stemming) to traditional algorithms across multiple
text classification tasks in six European languages. Our analysis indicates
that LLMs are capable of replicating traditional stopword removal,
lemmatization, and stemming methods with accuracies reaching 97%, 82%, and 74%,
respectively. Additionally, we show that ML algorithms trained on texts
preprocessed by LLMs achieve an improvement of up to 6% with respect to the
$F_1$ measure compared to traditional techniques. Our code, prompts, and
results are publicly available at
https://github.com/GianCarloMilanese/llm_pipeline_wi-iat.
comment: Accepted in WI-IAT 2025. Pre-camera-ready version
☆ GenCNER: A Generative Framework for Continual Named Entity Recognition IJCNN 2025
Traditional named entity recognition (NER) aims to identify text mentions
into pre-defined entity types. Continual Named Entity Recognition (CNER) is
introduced since entity categories are continuously increasing in various
real-world scenarios. However, existing continual learning (CL) methods for NER
face challenges of catastrophic forgetting and semantic shift of non-entity
type. In this paper, we propose GenCNER, a simple but effective Generative
framework for CNER to mitigate the above drawbacks. Specifically, we skillfully
convert the CNER task into sustained entity triplet sequence generation problem
and utilize a powerful pre-trained seq2seq model to solve it. Additionally, we
design a type-specific confidence-based pseudo labeling strategy along with
knowledge distillation (KD) to preserve learned knowledge and alleviate the
impact of label noise at the triplet level. Experimental results on two
benchmark datasets show that our framework outperforms previous
state-of-the-art methods in multiple CNER settings, and achieves the smallest
gap compared with non-CL results.
comment: Accepted by IJCNN 2025
☆ Who are you, ChatGPT? Personality and Demographic Style in LLM-Generated Content ECAI2025
Generative large language models (LLMs) have become central to everyday life,
producing human-like text across diverse domains. A growing body of research
investigates whether these models also exhibit personality- and
demographic-like characteristics in their language. In this work, we introduce
a novel, data-driven methodology for assessing LLM personality without relying
on self-report questionnaires, applying instead automatic personality and
gender classifiers to model replies on open-ended questions collected from
Reddit. Comparing six widely used models to human-authored responses, we find
that LLMs systematically express higher Agreeableness and lower Neuroticism,
reflecting cooperative and stable conversational tendencies. Gendered language
patterns in model text broadly resemble those of human writers, though with
reduced variation, echoing prior findings on automated agents. We contribute a
new dataset of human and model responses, along with large-scale comparative
analyses, shedding new light on the topic of personality and demographic
patterns of generative AI.
comment: ECAI2025 (Identity-Aware AI workshop)
☆ Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation
Community Notes, the crowd-sourced misinformation governance system on X
(formerly Twitter), enables users to flag misleading posts, attach contextual
notes, and vote on their helpfulness. However, our analysis of 30.8K
health-related notes reveals significant latency, with a median delay of 17.6
hours before the first note receives a helpfulness status. To improve
responsiveness during real-world misinformation surges, we propose CrowdNotes+,
a unified framework that leverages large language models (LLMs) to augment
Community Notes for faster and more reliable health misinformation governance.
CrowdNotes+ integrates two complementary modes: (1) evidence-grounded note
augmentation and (2) utility-guided note automation, along with a hierarchical
three-step evaluation that progressively assesses relevance, correctness, and
helpfulness. We instantiate the framework through HealthNotes, a benchmark of
1.2K helpfulness-annotated health notes paired with a fine-tuned helpfulness
judge. Experiments on fifteen LLMs reveal an overlooked loophole in current
helpfulness evaluation, where stylistic fluency is mistaken for factual
accuracy, and demonstrate that our hierarchical evaluation and LLM-augmented
generation jointly enhance factual precision and evidence utility. These
results point toward a hybrid human-AI governance model that improves both the
rigor and timeliness of crowd-sourced fact-checking.
☆ Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification
Surveys provide valuable insights into public opinion and behavior, but their
execution is costly and slow. Large language models (LLMs) have been proposed
as a scalable, low-cost substitute for human respondents, but their outputs are
often biased and yield invalid estimates. We study the interplay between
synthesis methods that use LLMs to generate survey responses and rectification
methods that debias population estimates, and explore how human responses are
best allocated between them. Using two panel surveys with questions on
nutrition, politics, and economics, we find that synthesis alone introduces
substantial bias (24-86%), whereas combining it with rectification reduces bias
below 5% and increases effective sample size by up to 14%. Overall, we
challenge the common practice of using all human responses for fine-tuning,
showing that under a fixed budget, allocating most to rectification results in
far more effective estimation.
comment: 19 pages, 4 figures, 9 tables
☆ KnowRL: Teaching Language Models to Know What They Know
Truly reliable AI requires more than simply scaling up knowledge; it demands
the ability to know what it knows and when it does not. Yet recent research
shows that even the best LLMs misjudge their own competence in more than one in
five cases, making any response born of such internal uncertainty impossible to
fully trust. Inspired by self-improvement reinforcement learning techniques
that require minimal data, we present a simple but powerful framework KnowRL
that strengthens a model's internal understanding of its own feasibility
boundaries, enabling safer and more responsible behaviour. Our framework
combines two components: (i) introspection, where the model generates and
classifies tasks it judges feasible or infeasible, and (ii) consensus-based
rewarding, where stability of self-knowledge assessment is reinforced through
internal agreement. By using internally generated data, this design strengthens
consistency in self-knowledge and entirely avoids costly external supervision.
In experiments on LLaMA-3.1-8B and Qwen-2.5-7B, KnowRL steadily improved
self-knowledge, validated by both intrinsic self-consistency and extrinsic
benchmarking. With nothing more than a small seed set and no external
supervision, our method drove gains as high as 28% in accuracy and 12% in F1,
outperforming baselines in just a few iterations. Our framework essentially
unlocks the untapped capacity of LLMs to self-improve their knowledge
awareness, opening the door to reliable, more accountable AI and safer
deployment in critical applications. Owing to its simplicity and independence
from external effort, we encourage applying this reliability-enhancing process
to all future models.
comment: 14 pages, 7 figures
☆ DocReward: A Document Reward Model for Structuring and Stylizing
Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia, Tengchao Lv, Yupan Huang, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Tao Ge, Xun Wang, Huitian Jiao, Sun Mao, FNU Kartik, Si-Qing Chen, Wai Lam, Furu Wei
Recent advances in agentic workflows have enabled the automation of tasks
such as professional document generation. However, they primarily focus on
textual quality, neglecting visual structure and style, which are crucial for
readability and engagement. This gap arises mainly from the absence of suitable
reward models to guide agentic workflows toward producing documents with
stronger structural and stylistic quality. To address this, we propose
DocReward, a document reward model that evaluates documents based on their
structure and style. We construct a multi-domain dataset DocPair of 117K paired
documents, covering 32 domains and 267 document types, each including a high-
and low-professionalism document with identical content but different structure
and style. This enables the model to evaluate professionalism comprehensively,
and in a textual-quality-agnostic way. DocReward is trained using the
Bradley-Terry loss to score documents, penalizing predictions that contradict
the annotated ranking. To assess the performance of reward models, we create a
test dataset containing document bundles ranked by well-educated human
evaluators. Notably, DocReward outperforms GPT-4o and GPT-5 in accuracy by 30.6
and 19.4 percentage points, respectively, demonstrating its superiority over
baselines. In an extrinsic evaluation of document generation, DocReward
achieves a significantly higher win rate of 60.8%, compared to GPT-5's 37.7%
win rate, demonstrating its utility in guiding generation agents toward
producing human-preferred documents.
☆ Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies
Zirui Song, Yuan Huang, Junchang Liu, Haozhe Luo, Chenxi Wang, Lang Gao, Zixiang Xu, Mingfei Han, Xiaojun Chang, Xiuying Chen
Social deduction games like Werewolf combine language, reasoning, and
strategy, providing a testbed for studying natural language and social
intelligence. However, most studies reduce the game to LLM-based self-play,
yielding templated utterances and anecdotal cases that overlook the richness of
social gameplay. Evaluation further relies on coarse metrics such as survival
time or subjective scoring due to the lack of quality reference data. To
address these gaps, we curate a high-quality, human-verified multimodal
Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens,
and 15 rule variants. Based on this dataset, we propose a novel
strategy-alignment evaluation that leverages the winning faction's strategies
as ground truth in two stages: 1) Speech evaluation, formulated as
multiple-choice-style tasks that assess whether the model can adopt appropriate
stances across five dimensions of social ability; and 2) Decision evaluation,
which assesses the model's voting choices and opponent-role inferences. This
framework enables a fine-grained evaluation of models' linguistic and reasoning
capabilities, while capturing their ability to generate strategically coherent
gameplay. Our experiments show that state-of-the-art LLMs show diverse
performance, with roughly half remain below 0.50, revealing clear gaps in
deception and counterfactual reasoning. We hope our dataset further inspires
research on language, reasoning, and strategy in multi-agent interaction.
comment: 34 pages, 32figures
☆ Early Detection and Reduction of Memorisation for Domain Adaptation and Instruction Tuning ACL
Although large language models excel across many tasks, they can memorise
training data and thereby expose private or copyrighted text. Most defences
target the pre-training stage, leaving memorisation during fine-tuning,
especially for domain adaptation and instruction tuning, poorly understood. We
fine-tune Pythia, Llama3, and Mistral models spanning 1.4B-70B parameters on
common evaluation datasets and track verbatim memorisation throughout training.
We find that memorisation increases dramatically in the first few epochs, often
significantly before either validation perplexity or evaluation performance is
optimised. We use a simple but effective n-gram memorisation score which
reliably precedes verbatim memorisation; using it as an early-stopping
criterion mitigates memorisation with minimal performance loss. Further, we
introduce an n-gram-aware loss regulariser and show that it reduces
memorisation across all model families tested by up to 40% while minimising
evaluation performance trade-offs when compared to an existing memorisation
mitigation strategy. These results yield practical, scalable insights into
memorisation dynamics during language model fine-tuning.
comment: Accepted to Transactions of the ACL (TACL), 2025. 15 pages, 6
figures, 3 tables
☆ Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers
Reinforcement learning (RL) has emerged as a crucial approach for enhancing
the capabilities of large language models. However, in Mixture-of-Experts (MoE)
models, the routing mechanism often introduces instability, even leading to
catastrophic RL training collapse. We analyze the training-inference
consistency of MoE models and identify a notable discrepancy in routing
behaviors between the two phases. Moreover, even under identical conditions,
the routing framework can yield divergent expert selections across repeated
forward passes. To address this foundational inconsistency, we propose Rollout
Routing Replay (R3), a method that records routing distributions from the
inference engine and replays them during training. R3 significantly reduces
training-inference policy KL divergence and mitigates extreme discrepancies
without compromising training speed. Extensive experiments on various settings
confirm that R3 succeeds in stabilizing RL training, preventing collapse and
outperforming methods such as GSPO and TIS. We believe this work can offer a
new solution for stabilizing RL in MoE models.
☆ LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by
incorporating external knowledge. While traditional retrieval focuses on
relevance, RAG's effectiveness depends on the utility of retrieved passages,
i.e., the usefulness in facilitating the generation of an accurate and
comprehensive answer. Existing studies often treat utility as a generic
attribute, ignoring the fact that different LLMs may benefit differently from
the same passage due to variations in internal knowledge and comprehension
ability. In this work, we introduce and systematically investigate the notion
of LLM-specific utility. Through large-scale experiments across multiple
datasets and LLMs, we demonstrate that human-annotated passages are not optimal
for LLMs and that ground-truth utilitarian passages are not transferable across
different LLMs. These findings highlight the necessity of adopting the
LLM-specific utility in RAG research. Our findings indicate that some
human-annotated passages are not ground-truth utilitarian passages for specific
LLMs, partially due to the varying readability of queries and passages for
LLMs, a tendency for which perplexity is a key metric. Based on these findings,
we propose a benchmarking procedure for LLM-specific utility judgments. We
evaluate existing utility judgment methods on six datasets and find that while
verbalized methods using pseudo-answers perform robustly, LLMs struggle to
assess utility effectively-failing to reject all passages for known queries and
to select truly useful ones for unknown queries.
comment: 13 pages, 9 figures
☆ Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap ICASSP 2026
Contrastive audio-language pretraining yields powerful joint representations,
yet a persistent audio-text modality gap limits the benefits of coupling
multimodal encoders with large language models (LLMs). We present
Diffusion-Link, a diffusion-based modality-bridging module that generatively
maps audio embeddings into the text-embedding distribution. The module is
trained at the output embedding from the frozen multimodal encoder and
implemented as a lightweight network with three residual MLP blocks. To assess
the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on
Automatic Audio Captioning (AAC); to our knowledge, this is the first
application of diffusion-based modality bridging to AAC. We report two results.
(1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link
reduces the modality gap the most among prior diffusion-based methods and shows
a collective migration of audio embeddings toward the text distribution. (2)
Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline
achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised
captioning without external knowledge, with relative gains up to 52.5% and
7.5%, respectively. These findings show that closing the modality gap is
pivotal for effective coupling between multimodal encoders and LLMs, and
diffusion-based modality bridging offers a promising direction beyond
knowledge-retrieval-centric designs. Code will be released upon acceptance
https://github.com/DevKiHyun/Diffusion-Link
comment: 5 pages. Submitted to IEEE ICASSP 2026
☆ Do LLMs "Feel"? Emotion Circuits Discovery and Control
Chenxi Wang, Yixuan Zhang, Ruiji Yu, Yufei Zheng, Lang Gao, Zirui Song, Zixiang Xu, Gus Xia, Huishuai Zhang, Dongyan Zhao, Xiuying Chen
As the demand for emotional intelligence in large language models (LLMs)
grows, a key challenge lies in understanding the internal mechanisms that give
rise to emotional expression and in controlling emotions in generated text.
This study addresses three core questions: (1) Do LLMs contain context-agnostic
mechanisms shaping emotional expression? (2) What form do these mechanisms
take? (3) Can they be harnessed for universal emotion control? We first
construct a controlled dataset, SEV (Scenario-Event with Valence), to elicit
comparable internal states across emotions. Subsequently, we extract
context-agnostic emotion directions that reveal consistent, cross-context
encoding of emotion (Q1). We identify neurons and attention heads that locally
implement emotional computation through analytical decomposition and causal
analysis, and validate their causal roles via ablation and enhancement
interventions. Next, we quantify each sublayer's causal influence on the
model's final emotion representation and integrate the identified local
components into coherent global emotion circuits that drive emotional
expression (Q2). Directly modulating these circuits achieves 99.65%
emotion-expression accuracy on the test set, surpassing prompting- and
steering-based methods (Q3). To our knowledge, this is the first systematic
study to uncover and validate emotion circuits in LLMs, offering new insights
into interpretability and controllable emotional intelligence.
comment: 19 pages, 8 figures, 8 tables. Code and dataset available at
https://github.com/Aurora-cx/EmotionCircuits-LLM
☆ Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications
Individuals with intellectual disabilities often have difficulties in
comprehending complex texts. While many text-to-image models prioritize
aesthetics over accessibility, it is not clear how visual illustrations relate
to text simplifications (TS) generated from them. This paper presents a
structured vision-language model (VLM) prompting framework for generating
accessible images from simplified texts. We designed five prompt templates,
i.e., Basic Object Focus, Contextual Scene, Educational Layout, Multi-Level
Detail, and Grid Layout, each following distinct spatial arrangements while
adhering to accessibility constraints such as object count limits, spatial
separation, and content restrictions. Using 400 sentence-level simplifications
from four established TS datasets (OneStopEnglish, SimPA, Wikipedia, and
ASSET), we conducted a two-phase evaluation: Phase 1 assessed prompt template
effectiveness with CLIPScores, and Phase 2 involved human annotation of
generated images across ten visual styles by four accessibility experts.
Results show that the Basic Object Focus prompt template achieved the highest
semantic alignment, indicating that visual minimalism enhances language
accessibility. Expert evaluation further identified Retro style as the most
accessible and Wikipedia as the most effective data source. Inter-annotator
agreement varied across dimensions, with Text Simplicity showing strong
reliability and Image Quality proving more subjective. Overall, our framework
offers practical guidelines for accessible content generation and underscores
the importance of structured prompting in AI-generated visual accessibility
tools.
☆ FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks EMNLP 2025
Current approaches to embodied AI tend to learn policies from expert
demonstrations. However, without a mechanism to evaluate the quality of
demonstrated actions, they are limited to learning from optimal behaviour, or
they risk replicating errors and inefficiencies. While reinforcement learning
offers one alternative, the associated exploration typically results in
sacrificing data efficiency. This work explores how agents trained with
imitation learning can learn robust representations from both optimal and
suboptimal demonstrations when given access to constructive language feedback
as a means to contextualise different modes of behaviour. We directly provide
language feedback embeddings as part of the input sequence into a
Transformer-based policy, and optionally complement the traditional next action
prediction objective with auxiliary self-supervised learning objectives for
feedback prediction. We test our approach on a range of embodied
Vision-and-Language tasks in our custom BabyAI-XGen environment and show
significant improvements in agents' compositional generalisation abilities and
robustness, suggesting that our data-efficient method allows models to
successfully convert suboptimal behaviour into learning opportunities. Overall,
our results suggest that language feedback is a competitive and intuitive
alternative to intermediate scalar rewards for language-specified embodied
tasks.
comment: EMNLP 2025 Findings
☆ Are Large Language Models Effective Knowledge Graph Constructors?
Knowledge graphs (KGs) are vital for knowledge-intensive tasks and have shown
promise in reducing hallucinations in large language models (LLMs). However,
constructing high-quality KGs remains difficult, requiring accurate information
extraction and structured representations that support interpretability and
downstream utility. Existing LLM-based approaches often focus narrowly on
entity and relation extraction, limiting coverage to sentence-level contexts or
relying on predefined schemas. We propose a hierarchical extraction framework
that organizes information at multiple levels, enabling the creation of
semantically rich and well-structured KGs. Using state-of-the-art LLMs, we
extract and construct knowledge graphs and evaluate them comprehensively from
both structural and semantic perspectives. Our results highlight the strengths
and shortcomings of current LLMs in KG construction and identify key challenges
for future work. To advance research in this area, we also release a curated
dataset of LLM-generated KGs derived from research papers on children's mental
well-being. This resource aims to foster more transparent, reliable, and
impactful applications in high-stakes domains such as healthcare.
☆ Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, Mikhail Seleznyov
Recent work has shown that narrow finetuning can produce broadly misaligned
LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these
findings were limited to finetuning and activation steering, leaving out
in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find
that it does: across three datasets, three frontier models produce broadly
misaligned responses at rates between 2% and 17% given 64 narrow in-context
examples, and up to 58% with 256 examples. We also examine mechanisms of EM by
eliciting step-by-step reasoning (while leaving in-context examples unchanged).
Manual analysis of the resulting chain-of-thought shows that 67.5% of
misaligned traces explicitly rationalize harmful outputs by adopting a reckless
or dangerous ''persona'', echoing prior results on finetuning-induced EM.
☆ ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models
We present Entropic Mutual-Information Geometry Large-Language Model
Alignment (ENIGMA), a novel approach to Large-Language Model (LLM) training
that jointly improves reasoning, alignment and robustness by treating an
organisation's policies/principles as directions to move on a model's
information manifold. Our single-loop trainer combines Group-Relative Policy
Optimisation (GRPO), an on-policy, critic-free RL method with Chain-of-Thought
(CoT)-format only rewards; a Self-Supervised Alignment with Mutual Information
(SAMI)-style symmetric InfoNCE auxiliary; and an entropic Sinkhorn
optimal-transport regulariser on hidden-state distributions to bound geometry
drift. We also introduce infoNCE metrics that specialise to a standard MI lower
bound under matched negatives to measure how strongly a model's CoT encodes
these policies. These metrics include a Sufficiency Index (SI) that enables the
selection and creation of principles that maximise downstream performance prior
to training. In our experiments using small (1B) LLMs, high-SI principles
predict steadier training dynamics and improved benchmark performance over GRPO
ablations. Our information-geometry analysis of trained models validates
desirable structural change in the manifold. These results support our
hypothesis that reasoning, alignment, and robustness are projections of a
single informationgeometric objective, and that models trained using ENIGMA
demonstrate principled reasoning without the use of a reward model, offering a
path to trusted capability
comment: 52 pages, 10 figures
☆ Towards Real-Time Fake News Detection under Evidence Scarcity
Fake news detection becomes particularly challenging in real-time scenarios,
where emerging events often lack sufficient supporting evidence. Existing
approaches often rely heavily on external evidence and therefore struggle to
generalize under evidence scarcity. To address this issue, we propose
Evaluation-Aware Selection of Experts (EASE), a novel framework for real-time
fake news detection that dynamically adapts its decision-making process
according to the assessed sufficiency of available evidence. EASE introduces a
sequential evaluation mechanism comprising three independent perspectives: (1)
Evidence-based evaluation, which assesses evidence and incorporates it into
decision-making only when the evidence is sufficiently supportive; (2)
Reasoning-based evaluation, which leverages the world knowledge of large
language models (LLMs) and applies them only when their reliability is
adequately established; and (3) Sentiment-based fallback, which integrates
sentiment cues when neither evidence nor reasoning is reliable. To enhance the
accuracy of evaluation processes, EASE employs instruction tuning with pseudo
labels to guide each evaluator in justifying its perspective-specific knowledge
through interpretable reasoning. Furthermore, the expert modules integrate the
evaluators' justified assessments with the news content to enable
evaluation-aware decision-making, thereby enhancing overall detection accuracy.
Moreover, we introduce RealTimeNews-25, a new benchmark comprising recent news
for evaluating model generalization on emerging news with limited evidence.
Extensive experiments demonstrate that EASE not only achieves state-of-the-art
performance across multiple benchmarks, but also significantly improves
generalization to real-time news. The code and dataset are available:
https://github.com/wgyhhhh/EASE.
☆ Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality
Psychometric tests are increasingly used to assess psychological constructs
in large language models (LLMs). However, it remains unclear whether these
tests -- originally developed for humans -- yield meaningful results when
applied to LLMs. In this study, we systematically evaluate the reliability and
validity of human psychometric tests for three constructs: sexism, racism, and
morality. We find moderate reliability across multiple item and prompt
variations. Validity is evaluated through both convergent (i.e., testing
theory-based inter-test correlations) and ecological approaches (i.e., testing
the alignment between tests scores and behavior in real-world downstream
tasks). Crucially, we find that psychometric test scores do not align, and in
some cases even negatively correlate with, model behavior in downstream tasks,
indicating low ecological validity. Our results highlight that systematic
evaluations of psychometric tests is essential before interpreting their
scores. They also suggest that psychometric tests designed for humans cannot be
applied directly to LLMs without adaptation.
☆ Attacks by Content: Automated Fact-checking is an AI Security Issue EMNLP 2025
When AI agents retrieve and reason over external documents, adversaries can
manipulate the data they receive to subvert their behaviour. Previous research
has studied indirect prompt injection, where the attacker injects malicious
instructions. We argue that injection of instructions is not necessary to
manipulate agents - attackers could instead supply biased, misleading, or false
information. We term this an attack by content. Existing defenses, which focus
on detecting hidden commands, are ineffective against attacks by content. To
defend themselves and their users, agents must critically evaluate retrieved
information, corroborating claims with external evidence and evaluating source
trustworthiness. We argue that this is analogous to an existing NLP task,
automated fact-checking, which we propose to repurpose as a cognitive
self-defense tool for agents.
comment: Accepted to EMNLP 2025
☆ XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression EMNLP 2025
Large Language Models (LLMs) have demonstrated remarkable capabilities across
diverse natural language processing tasks. However, their extensive memory
requirements, particularly due to KV cache growth during long-text
understanding and generation, present significant challenges for deployment in
resource-constrained environments. Quantization has emerged as a promising
solution to reduce memory consumption while preserving historical information.
We propose XQuant, a training-free and plug-and-play framework that achieves
ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key
innovations: a computationally negligible data-free calibration method and
cross-layer KV cache compression, enabling quantization to sub-1.4 bits.
Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant
outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by
achieving lower bit-width while maintaining superior performance, establishing
a better trade-off between memory efficiency and model accuracy.
comment: To be published in The 2025 Conference on Empirical Methods in
Natural Language Processing (EMNLP 2025)
☆ CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis
Jinyuan Xu, Tian Lan, Xintao Yu, Xue He, Hezhi Zhang, Ying Wang, Pierre Magistry, Mathieu Valette, Lei Li
Depression is a pressing global public health issue, yet publicly available
Chinese-language resources for risk detection remain scarce and are mostly
limited to binary classification. To address this limitation, we release
CNSocialDepress, a benchmark dataset for depression risk detection from Chinese
social media posts. The dataset contains 44,178 texts from 233 users, within
which psychological experts annotated 10,306 depression-related segments.
CNSocialDepress provides binary risk labels together with structured
multi-dimensional psychological attributes, enabling interpretable and
fine-grained analysis of depressive signals. Experimental results demonstrate
its utility across a wide range of NLP tasks, including structured
psychological profiling and fine-tuning of large language models for depression
detection. Comprehensive evaluations highlight the dataset's effectiveness and
practical value for depression risk identification and psychological analysis,
thereby providing insights to mental health applications tailored for
Chinese-speaking populations.
☆ A Theorem-Proving-Based Evaluation of Neural Semantic Parsing
Graph-matching metrics such as Smatch are the de facto standard for
evaluating neural semantic parsers, yet they capture surface overlap rather
than logical equivalence. We reassess evaluation by pairing graph-matching with
automated theorem proving. We compare two approaches to building parsers:
supervised fine-tuning (T5-Small/Base) and few-shot in-context learning
(GPT-4o/4.1/5), under normalized and unnormalized targets. We evaluate outputs
using graph-matching, bidirectional entailment between source and target
formulas with a first-order logic theorem prover, and well-formedness. Across
settings, we find that models performing well on graph-matching often fail to
produce logically equivalent formulas. Normalization reduces incidental target
variability, improves well-formedness, and strengthens logical adequacy. Error
analysis shows performance degrades with increasing formula complexity and with
coordination, prepositional phrases, and passive voice; the dominant failures
involve variable binding and indexing, and predicate naming. These findings
highlight limits of graph-based metrics for reasoning-oriented applications and
motivate logic-sensitive evaluation and training objectives together with
simplified, normalized target representations. All code and data for our
experiments are publicly available.
comment: Accepted to BlackboxNLP 2025
☆ Fairness Metric Design Exploration in Multi-Domain Moral Sentiment Classification using Transformer-Based Models
Battemuulen Naranbat, Seyed Sahand Mohammadi Ziabari, Yousuf Nasser Al Husaini, Ali Mohammed Mansoor Alsahag
Ensuring fairness in natural language processing for moral sentiment
classification is challenging, particularly under cross-domain shifts where
transformer models are increasingly deployed. Using the Moral Foundations
Twitter Corpus (MFTC) and Moral Foundations Reddit Corpus (MFRC), this work
evaluates BERT and DistilBERT in a multi-label setting with in-domain and
cross-domain protocols. Aggregate performance can mask disparities: we observe
pronounced asymmetry in transfer, with Twitter->Reddit degrading micro-F1 by
14.9% versus only 1.5% for Reddit->Twitter. Per-label analysis reveals fairness
violations hidden by overall scores; notably, the authority label exhibits
Demographic Parity Differences of 0.22-0.23 and Equalized Odds Differences of
0.40-0.41. To address this gap, we introduce the Moral Fairness Consistency
(MFC) metric, which quantifies the cross-domain stability of moral foundation
detection. MFC shows strong empirical validity, achieving a perfect negative
correlation with Demographic Parity Difference (rho = -1.000, p < 0.001) while
remaining independent of standard performance metrics. Across labels, loyalty
demonstrates the highest consistency (MFC = 0.96) and authority the lowest (MFC
= 0.78). These findings establish MFC as a complementary, diagnosis-oriented
metric for fairness-aware evaluation of moral reasoning models, enabling more
reliable deployment across heterogeneous linguistic contexts. .
☆ WebRouter: Query-specific Router via Variational Information Bottleneck for Cost-sensitive Web Agent
LLM-brained web agents offer powerful capabilities for web automation but
face a critical cost-performance trade-off. The challenge is amplified by web
agents' inherently complex prompts that include goals, action histories, and
environmental states, leading to degraded LLM ensemble performance. To address
this, we introduce WebRouter, a novel query-specific router trained from an
information-theoretic perspective. Our core contribution is a cost-aware
Variational Information Bottleneck (ca-VIB) objective, which learns a
compressed representation of the input prompt while explicitly penalizing the
expected operational cost. Experiments on five real-world websites from the
WebVoyager benchmark show that WebRouter reduces operational costs by a
striking 87.8\% compared to a GPT-4o baseline, while incurring only a 3.8\%
accuracy drop.
☆ The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers
Large language models (LLMs) can correctly answer "When was Einstein born?"
yet fail to provide the same date when writing about Einstein's life revealing
a fundamental inconsistency in how models access factual knowledge across task
complexities. While models display impressive accuracy on factual
question-answering benchmarks, the reliability gap between simple and complex
queries remains poorly understood, eroding their trustworthiness. In this work,
we introduce Short-Long Form Alignment for Factual Question Answering (SLAQ), a
controlled evaluation framework that compares LLMs' answers to the same factual
questions asked (a) in isolation (short) vs. (b) integrated into complex
queries (long). Looking at 16 LLMs across 600 queries, we find a systematic
misalignment of answers to the corresponding short and long queries. We further
uncover position-dependent accuracy loss and momentum effects where consecutive
correct or incorrect answers create self-reinforcing patterns. Through
mechanistic analysis, we find that aligned facts activate overlapping model
internals, and that metrics based on mechanistic similarity can predict
short-long answer alignment with up to 78% accuracy. Our work establishes
factual consistency over query complexity as an important aspect of LLMs'
trustworthiness and challenges current evaluation practices, which implicitly
assume that good performance for simple factual queries implies reliability in
more complex knowledge-seeking tasks too.
☆ Domain-Specific Data Generation Framework for RAG Adaptation
Retrieval-Augmented Generation (RAG) combines the language understanding and
reasoning power of large language models (LLMs) with external retrieval to
enable domain-grounded responses. Effectively adapting RAG systems to
domain-specific settings requires specialized, context-rich training data
beyond general-purpose question-answering. Here, we propose RAGen, a scalable
and modular framework for generating domain-grounded question-answer-context
(QAC) triples tailored to diverse RAG adaptation approaches. RAGen produces
these QAC triples by identifying key concepts in documents, generating diverse
questions guided by Bloom's Taxonomy-inspired principles, and pairing them with
precise answers extracted from relevant contexts. RAGen supports multiple RAG
adaptation strategies, including the optimization of key components such as the
LLM, retriever, and embedding model, etc. Its modular pipeline features
semantic chunking, hierarchical concept extraction, and multi-chunk retrieval,
along with the introduction of curated distractor contexts to promote robust
reasoning. Designed for scalability, RAGen efficiently handles large and
evolving document corpora without redundant processing, making it especially
suitable for dynamic evolving domains such as scientific research and
enterprise knowledge bases.
☆ Discursive Circuits: How Do Language Models Understand Discourse Relations? EMNLP 2025
Which components in transformer language models are responsible for discourse
understanding? We hypothesize that sparse computational graphs, termed as
discursive circuits, control how models process discourse relations. Unlike
simpler tasks, discourse relations involve longer spans and complex reasoning.
To make circuit discovery feasible, we introduce a task called Completion under
Discourse Relation (CuDR), where a model completes a discourse given a
specified relation. To support this task, we construct a corpus of minimal
contrastive pairs tailored for activation patching in circuit discovery.
Experiments show that sparse circuits ($\approx 0.2\%$ of a full GPT-2 model)
recover discourse understanding in the English PDTB-based CuDR task. These
circuits generalize well to unseen discourse frameworks such as RST and SDRT.
Further analysis shows lower layers capture linguistic features such as lexical
semantics and coreference, while upper layers encode discourse-level
abstractions. Feature utility is consistent across frameworks (e.g.,
coreference supports Expansion-like relations).
comment: Accepted to EMNLP 2025 (Main Conference); 9 pages, 8 figures, 5
tables (20 pages, 12 figures, 14 tables including references and appendices)
☆ Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations
Johannes Moll, Markus Graf, Tristan Lemke, Nicolas Lenhart, Daniel Truhn, Jean-Benoit Delbrouck, Jiazhen Pan, Daniel Rueckert, Lisa C. Adams, Keno K. Bressem
Vision-language models (VLMs) often produce chain-of-thought (CoT)
explanations that sound plausible yet fail to reflect the underlying decision
process, undermining trust in high-stakes clinical use. Existing evaluations
rarely catch this misalignment, prioritizing answer accuracy or adherence to
formats. We present a clinically grounded framework for chest X-ray visual
question answering (VQA) that probes CoT faithfulness via controlled text and
image modifications across three axes: clinical fidelity, causal attribution,
and confidence calibration. In a reader study (n=4), evaluator-radiologist
correlations fall within the observed inter-radiologist range for all axes,
with strong alignment for attribution (Kendall's $\tau_b=0.670$), moderate
alignment for fidelity ($\tau_b=0.387$), and weak alignment for confidence tone
($\tau_b=0.091$), which we report with caution. Benchmarking six VLMs shows
that answer accuracy and explanation quality are decoupled, acknowledging
injected cues does not ensure grounding, and text cues shift explanations more
than visual cues. While some open-source models match final answer accuracy,
proprietary models score higher on attribution (25.0% vs. 1.4%) and often on
fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to
evaluate beyond final answer accuracy.
☆ Can Tool-Integrated Reinforcement Learning Generalize Across Diverse Domains?
Zhengyu Chen, Jinluan Yang, Teng Xiao, Ruochen Zhou, Luan Zhang, Xiangyu Xi, Xiaowei Shi, Wei Wang, Jinggang Wang
Recent advances in large language models (LLMs) have demonstrated remarkable
capabilities in reasoning and tool utilization. However, the generalization of
tool-augmented reinforcement learning (RL) across diverse domains remains
underexplored. In this work, we investigate the cross-domain generalization of
an LLM agent equipped with a code interpreter tool, which is exclusively
trained on mathematical problem-solving tasks. Despite the restricted training
domain, we evaluate the agent's performance across several distinct reasoning
domains. The results reveal that RL-based tool usage learned from mathematical
tasks can be effectively transferred to complex tasks in other domains,
enabling great task performance and high token efficiency. To facilitate this
cross-domain transfer, we propose a Tool Generalization Reinforcement Learning
(TGRL) framework designed to promote domain-agnostic learning and skill
migration, encompassing: (i) a standardized tool interface that abstracts
domain-specific nuances through consistent formatting and explicit termination,
fostering transferable invocation patterns; (ii) a dual-component reward system
that decomposes rewards to incentivize generalizable behaviors like tool
efficiency and reasoning abstraction, ensuring alignment and robustness across
domain shifts; and (iii) an XML-based prompt template that separates thinking,
tool calls, and responses to encourage modular, domain-invariant planning and
coherent multi-turn interactions. Extensive experiments across diverse
benchmarks validate our approach, achieving state-of-the-art performance and
highlighting the cross-domain potential of Tool RL for LLM reasoning.
☆ EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling
With the rise of reasoning language models and test-time scaling methods as a
paradigm for improving model performance, substantial computation is often
required to generate multiple candidate sequences from the same prompt. This
enables exploration of different reasoning paths toward the correct solution,
however, allocates the same compute budget for each prompt. Grounded on the
assumption that different prompts carry different degrees of complexity, and
thus different computation needs, we propose EAGer, a training-free generation
method that leverages model uncertainty through token-wise entropy distribution
to reduce redundant computation and concurrently improve overall performance.
EAGer allows branching to multiple reasoning paths only in the presence of
high-entropy tokens, and then reallocates the saved compute budget to the
instances where exploration of alternative paths is most needed. We find that
across multiple open-source models on complex reasoning benchmarks such as AIME
2025, EAGer can reallocate the budget without accessing target labels,
achieving the best efficiency-performance trade-off in terms of reasoning
length and Pass@k. When target labels are accessible, EAGer generates up to 65%
fewer tokens (hence saving compute) and achieves up to 37% improvement in
Pass@k compared to the Full Parallel Sampling.
☆ ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces ICML 2025
Large output spaces, also referred to as Extreme multilabel classification
(XMC), is a setting that arises, e.g., in large-scale tagging and
product-to-product recommendation, and is characterized by the number of labels
ranging from hundreds of thousands to millions. This means that the linear
classification head, usually only a tiny fraction of the overall model, turns
into the main driver for compute and memory demand. Current state-of-the-art
XMC methods predominantly rely on FP16-FP32 mixed-precision training, which we
show can be unstable, and inefficient in terms of memory usage and
computational overhead. Meanwhile, existing low-precision methods typically
retain higher precision for the classification layer. In this work, we propose
ELMO, a pure low-precision training framework for XMC models using BFloat16 and
Float8 data types. By leveraging Kahan summation and stochastic rounding, we
demonstrate that XMC models can be effectively trained entirely in Float8,
without relying on single-precision master weights or tensor scaling.
Low-precision training, combined with our proposed memory optimizations --
gradient fusion and chunking -- enables significant reductions in GPU memory
usage. For example, we train a 3-million-label XMC model with only 6.6 GiB of
GPU memory, compared to the 39.7 GiB required by the optimized SOTA method,
Renee without compromising accuracy.
comment: Accepted to ICML 2025
☆ Bridging Gaps in Hate Speech Detection: Meta-Collections and Benchmarks for Low-Resource Iberian Languages
Hate speech poses a serious threat to social cohesion and individual
well-being, particularly on social media, where it spreads rapidly. While
research on hate speech detection has progressed, it remains largely focused on
English, resulting in limited resources and benchmarks for low-resource
languages. Moreover, many of these languages have multiple linguistic
varieties, a factor often overlooked in current approaches. At the same time,
large language models require substantial amounts of data to perform reliably,
a requirement that low-resource languages often cannot meet. In this work, we
address these gaps by compiling a meta-collection of hate speech datasets for
European Spanish, standardised with unified labels and metadata. This
collection is based on a systematic analysis and integration of existing
resources, aiming to bridge the data gap and support more consistent and
scalable hate speech detection. We extended this collection by translating it
into European Portuguese and into a Galician standard that is more convergent
with Spanish and another Galician variant that is more convergent with
Portuguese, creating aligned multilingual corpora. Using these resources, we
establish new benchmarks for hate speech detection in Iberian languages. We
evaluate state-of-the-art large language models in zero-shot, few-shot, and
fine-tuning settings, providing baseline results for future research. Moreover,
we perform a cross-lingual analysis with our target languages. Our findings
underscore the importance of multilingual and variety-aware approaches in hate
speech detection and offer a foundation for improved benchmarking in
underrepresented European languages.
☆ One Size Does Not Fit All: Exploring Variable Thresholds for Distance-Based Multi-Label Text Classification
Distance-based unsupervised text classification is a method within text
classification that leverages the semantic similarity between a label and a
text to determine label relevance. This method provides numerous benefits,
including fast inference and adaptability to expanding label sets, as opposed
to zero-shot, few-shot, and fine-tuned neural networks that require re-training
in such cases. In multi-label distance-based classification and information
retrieval algorithms, thresholds are required to determine whether a text
instance is "similar" to a label or query. Similarity between a text and label
is determined in a dense embedding space, usually generated by state-of-the-art
sentence encoders. Multi-label classification complicates matters, as a text
instance can have multiple true labels, unlike in multi-class or binary
classification, where each instance is assigned only one label. We expand upon
previous literature on this underexplored topic by thoroughly examining and
evaluating the ability of sentence encoders to perform distance-based
classification. First, we perform an exploratory study to verify whether the
semantic relationships between texts and labels vary across models, datasets,
and label sets by conducting experiments on a diverse collection of realistic
multi-label text classification (MLTC) datasets. We find that similarity
distributions show statistically significant differences across models,
datasets and even label sets. We propose a novel method for optimizing
label-specific thresholds using a validation set. Our label-specific
thresholding method achieves an average improvement of 46% over normalized 0.5
thresholding and outperforms uniform thresholding approaches from previous work
by an average of 14%. Additionally, the method demonstrates strong performance
even with limited labeled examples.
☆ TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code
Large language Models (LLMs) have shown remarkable proficiency in code
generation tasks across various programming languages. However, their outputs
often contain subtle but critical vulnerabilities, posing significant risks
when deployed in security-sensitive or mission-critical systems. This paper
introduces TypePilot, an agentic AI framework designed to enhance the security
and robustness of LLM-generated code by leveraging strongly typed and
verifiable languages, using Scala as a representative example. We evaluate the
effectiveness of our approach in two settings: formal verification with the
Stainless framework and general-purpose secure code generation. Our experiments
with leading open-source LLMs reveal that while direct code generation often
fails to enforce safety constraints, just as naive prompting for more secure
code, our type-focused agentic pipeline substantially mitigates input
validation and injection vulnerabilities. The results demonstrate the potential
of structured, type-guided LLM workflows to improve the SotA of the
trustworthiness of automated code generation in high-assurance domains.
☆ $How^{2}$: How to learn from procedural How-to questions
An agent facing a planning problem can use answers to how-to questions to
reduce uncertainty and fill knowledge gaps, helping it solve both current and
future tasks. However, their open ended nature, where valid answers to "How do
I X?" range from executable actions to high-level descriptions of X's
sub-goals, makes them challenging for AI agents to ask, and for AI experts to
answer, in ways that support efficient planning. We introduce $How^{2}$, a
memory agent framework that enables agents to ask how-to questions, store the
answers, and reuse them for lifelong learning in interactive environments. We
evaluate our approach in Plancraft, a Minecraft crafting environment, where
agents must complete an assembly task by manipulating inventory items. Using
teacher models that answer at varying levels of abstraction, from executable
action sequences to high-level subgoal descriptions, we show that lifelong
learning agents benefit most from answers that are abstracted and decoupled
from the current state. $How^{2}$ offers a way for LLM-based agents to improve
their planning capabilities over time by asking questions in interactive
environments.
☆ Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization
Current approaches for strengthening LLM reasoning tend to introduce a
training bias toward human-like reasoning trajectories. In step-wise preference
optimization, in particular, dependence on human or higher-capacity model
annotations for intermediate steps limits exploration of alternative,
non-human-like reasoning paths and thus constrains achievable performance.
Furthermore, through a small-scale pilot study, we observed that in
approximately 75% of cases, the model's first erroneous step occurs after the
lowest-confidence point. This suggests that guiding the model at its
lowest-confidence point before an error provides more accurate supervision than
locating the first explicit error. In this paper, we propose Confidence-Guided
Reasoning Path Preference Optimization (CGPO), a method that leverages a
confidence signal to identify points of maximal uncertainty in the model's
reasoning process and applies self-generated, non-human-like reasoning-path
guidance to mitigate trajectory drift. Our experiments span diverse models
applied to both code and mathematical reasoning tasks. The results show that,
with the same amount of training data, our method using data generated by a
small model can achieve better performance in most cases compared with
approaches using data generated by a strong model or human-annotated.
comment: 13 pages
☆ VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
Jiliang Hu, Wenfu Wang, Zuchao Li, Chenxing Li, Yiyang Zhao, Hanzhao Li, Liqiang Zhang, Meng Yu, Dong Yu
Recent advances in large audio language models (LALMs) have greatly enhanced
multimodal conversational systems. However, existing benchmarks remain limited
-- they are mainly English-centric, rely on synthetic speech, and lack
comprehensive, discriminative evaluation across multiple dimensions. To address
these gaps, we present Voice Chat Bot Bench (VCB Bench) -- a high-quality
Chinese benchmark built entirely on real human speech. VCB Bench evaluates
LALMs from three complementary perspectives: instruction following (including
speech-level control beyond text commands), knowledge understanding (general
knowledge, reasoning, and daily dialogue), and robustness (stability under
perturbations in content, environment, and speaker traits). Experiments on
representative LALMs reveal notable performance gaps and highlight future
directions for improvement. VCB Bench provides a reproducible and fine-grained
evaluation framework, offering standardized methodology and practical insights
for advancing Chinese voice conversational models.
comment: 20 pages, 5 figures
☆ Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States
Qinglin Zhu, Yizhen Yao, Runcong Zhao, Yanzheng Xiang, Amrutha Saseendran, Chen Jin, Philip Alexander Teare, Bin Liang, Yulan He, Lin Gui
Autoregressive (AR) models remain the standard for natural language
generation but still suffer from high latency due to strictly sequential
decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream,
mitigate this by generating in parallel, yet they suffer from two core
limitations: information loss, as predictive distributions for non-finalized
tokens are discarded at each step, and premature commitment, where local
decisions are made without sufficient global coordination. We introduce Latent
Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a
Predictive Feedback Loop. The first stage maintains masked positions as
distributional mixtures of predicted tokens and the mask embedding, allowing
the model to establish more globally consistent beliefs. The second stage
progressively finalizes confident tokens while retaining uncertain ones for
iterative feedback. KL-divergence dynamics provide a principled and reliable
criterion for convergence and early stopping. Experiments across coding
(HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, MATH500 +3.8) show that
LRD improves accuracy while delivering speedups of up to 10.6x, making it a
strong and versatile alternative for parallel sequence generation.
☆ Enabling Doctor-Centric Medical AI with LLMs through Workflow-Aligned Tasks and Benchmarks
Wenya Xie, Qingying Xiao, Yu Zheng, Xidong Wang, Junying Chen, Ke Ji, Anningzhe Gao, Prayag Tiwari, Xiang Wan, Feng Jiang, Benyou Wang
The rise of large language models (LLMs) has transformed healthcare by
offering clinical guidance, yet their direct deployment to patients poses
safety risks due to limited domain expertise. To mitigate this, we propose
repositioning LLMs as clinical assistants that collaborate with experienced
physicians rather than interacting with patients directly. We conduct a
two-stage inspiration-feedback survey to identify real-world needs in clinical
workflows. Guided by this, we construct DoctorFLAN, a large-scale Chinese
medical dataset comprising 92,000 Q&A instances across 22 clinical tasks and 27
specialties. To evaluate model performance in doctor-facing applications, we
introduce DoctorFLAN-test (550 single-turn Q&A items) and DotaBench (74
multi-turn conversations). Experimental results with over ten popular LLMs
demonstrate that DoctorFLAN notably improves the performance of open-source
LLMs in medical contexts, facilitating their alignment with physician workflows
and complementing existing patient-oriented models. This work contributes a
valuable resource and framework for advancing doctor-centered medical LLM
development
☆ LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models
Joint logical-numerical reasoning remains a major challenge for language
models, yet existing datasets rely on fixed rule sets and offer limited control
over task complexity, constraining their generalizability for evaluation and
training. We present LogiNumSynth, a flexible natural language problem
synthesizer that synthesizes tasks requiring proficiency in joint logical
reasoning (e.g., rule-based reasoning) and numerical reasoning (e.g.,
arithmetic computation). LogiNumSynth supports fine-grained control over
reasoning world richness, logical reasoning depth, and the complexity of
numerical computations, enabling flexible data synthesis across difficulty
levels. We demonstrate three key contributions: (1) Synthesizer -- synthesizing
fully controllable joint reasoning tasks over natural language; (2) Evaluation
& Process Analysis -- evaluating both process accuracy and answer accuracy; (3)
Targeted Training -- using synthesized data to enhance LLMs' reasoning
performance. Experiments with multiple LLMs highlight persistent weaknesses in
logical-numerical reasoning, showing that LogiNumSynth can serve as both a
diagnostic tool and a source of targeted supervision for advancing integrated
reasoning skills.
comment: 30 pages, 3 figures
☆ Automating Structural Engineering Workflows with Large Language Model Agents
We introduce $\textbf{MASSE}$, the first Multi-Agent System for Structural
Engineering, effectively integrating large language model (LLM)-based agents
with real-world engineering workflows. Structural engineering is a fundamental
yet traditionally stagnant domain, with core workflows remaining largely
unchanged for decades despite its substantial economic impact and global market
size. Recent advancements in LLMs have significantly enhanced their ability to
perform complex reasoning, long-horizon planning, and precise tool utilization
-- capabilities well aligned with structural engineering tasks such as
interpreting design codes, executing load calculations, and verifying
structural capacities. We present a proof-of-concept showing that most
real-world structural engineering workflows can be fully automated through a
training-free LLM-based multi-agent system. MASSE enables immediate deployment
in professional environments, and our comprehensive validation on real-world
case studies demonstrates that it can reduce expert workload from approximately
two hours to mere minutes, while enhancing both reliability and accuracy in
practical engineering scenarios.
comment: Code: https://github.com/DelosLiang/masse
☆ DND: Boosting Large Language Models with Dynamic Nested Depth
We introduce Dynamic Nested Depth (DND), a novel method that improves
performance for off-the-shelf LLMs by selecting critical tokens to reprocess in
a nested depth manner. Specifically, at the end of the given transformer layer,
DND identifies more critical tokens with a router and feeds them back for an
extra round of processing, effectively ``reviewing" difficult tokens while
avoiding redundant computation for easier ones. The dynamic selection mechanism
is tailored for precise control via two novel strategies: a router controlling
loss to enhance token selection distinguishability, and a threshold control
scheme to ensure selection stability. We demonstrate the effectiveness of DND
by directly integrating it into pre-trained dense and MoE models during a
post-training phase. On diverse benchmarks, this approach boosts the
performances of the dense Qwen3-1.7B by 1.88% and the MoE Qwen3-30B-A3B by
0.87%, all with a minimal parameter and computing increase.
comment: TL;DR: We introduce Dynamic Nested Depth (DND), an efficient paradigm
that adaptively identifies critical tokens and selectively deepens their
computation via nested re-processing
☆ ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios
Large language models (LLMs) are increasingly under scrutiny for perpetuating
identity-based discrimination in high-stakes domains such as hiring,
particularly against people with disabilities (PwD). However, existing research
remains largely Western-centric, overlooking how intersecting forms of
marginalization--such as gender and caste--shape experiences of PwD in the
Global South. We conduct a comprehensive audit of six LLMs across 2,820 hiring
scenarios spanning diverse disability, gender, nationality, and caste profiles.
To capture subtle intersectional harms and biases, we introduce ABLEIST
(Ableism, Inspiration, Superhumanization, and Tokenism), a set of five
ableism-specific and three intersectional harm metrics grounded in disability
studies literature. Our results reveal significant increases in ABLEIST harms
towards disabled candidates--harms that many state-of-the-art models failed to
detect. These harms were further amplified by sharp increases in intersectional
harms (e.g., Tokenism) for gender and caste-marginalized disabled candidates,
highlighting critical blind spots in current safety tools and the need for
intersectional safety evaluations of frontier models in high-stakes domains
like hiring.
comment: 28 pages, 11 figures, 16 tables. In submission
☆ DeepResearchGuard: Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety
Wei-Chieh Huang, Henry Peng Zou, Yaozu Wu, Dongyuan Li, Yankai Chen, Weizhi Zhang, Yangning Li, Angelo Zangari, Jizhou Guo, Chunyu Miao, Liancheng Fang, Langzhou He, Renhe Jiang, Philip S. Yu
Deep research frameworks have shown promising capabilities in synthesizing
comprehensive reports from web sources. While deep research possesses
significant potential to address complex issues through planning and research
cycles, existing frameworks are deficient in sufficient evaluation procedures
and stage-specific protections. They typically treat evaluation as exact match
accuracy of question-answering, but overlook crucial aspects of report quality
such as credibility, coherence, breadth, depth, and safety. This oversight may
result in hazardous or malicious sources being integrated into the final
report. To address these issues, we introduce DEEPRESEARCHGUARD, a
comprehensive framework featuring four-stage safeguards with open-domain
evaluation of references and reports. We assess performance across multiple
metrics, e.g., defense success rate and over-refusal rate, and five key report
dimensions. In the absence of a suitable safety benchmark, we introduce
DRSAFEBENCH, a stage-wise benchmark for deep research safety. Our evaluation
spans diverse state-of-the-art LLMs, including GPT-4o, Gemini-2.5-flash,
DeepSeek-v3, and o4-mini. DEEPRESEARCHGUARD achieves an average defense success
rate improvement of 18.16% while reducing over-refusal rate by 6%. The input
guard provides the most substantial early-stage protection by filtering out
obvious risks, while the plan and research guards enhance citation discipline
and source credibility. Through extensive experiments, we show that
DEEPRESEARCHGUARD enables comprehensive open-domain evaluation and stage-aware
defenses that effectively block harmful content propagation, while
systematically improving report quality without excessive over-refusal rates.
The code can be found via https://github.com/Jasonya/DeepResearchGuard.
☆ A Survey on Agentic Multimodal Large Language Models
Huanjin Yao, Ruifei Zhang, Jiaxing Huang, Jingyi Zhang, Yibo Wang, Bo Fang, Ruolin Zhu, Yongcheng Jing, Shunyu Liu, Guanbin Li, Dacheng Tao
With the recent emergence of revolutionary autonomous agentic systems,
research community is witnessing a significant shift from traditional static,
passive, and domain-specific AI agents toward more dynamic, proactive, and
generalizable agentic AI. Motivated by the growing interest in agentic AI and
its potential trajectory toward AGI, we present a comprehensive survey on
Agentic Multimodal Large Language Models (Agentic MLLMs). In this survey, we
explore the emerging paradigm of agentic MLLMs, delineating their conceptual
foundations and distinguishing characteristics from conventional MLLM-based
agents. We establish a conceptual framework that organizes agentic MLLMs along
three fundamental dimensions: (i) Agentic internal intelligence functions as
the system's commander, enabling accurate long-horizon planning through
reasoning, reflection, and memory; (ii) Agentic external tool invocation,
whereby models proactively use various external tools to extend their
problem-solving capabilities beyond their intrinsic knowledge; and (iii)
Agentic environment interaction further situates models within virtual or
physical environments, allowing them to take actions, adapt strategies, and
sustain goal-directed behavior in dynamic real-world scenarios. To further
accelerate research in this area for the community, we compile open-source
training frameworks, training and evaluation datasets for developing agentic
MLLMs. Finally, we review the downstream applications of agentic MLLMs and
outline future research directions for this rapidly evolving field. To
continuously track developments in this rapidly evolving field, we will also
actively update a public repository at
https://github.com/HJYao00/Awesome-Agentic-MLLMs.
☆ Secret-Protected Evolution for Differentially Private Synthetic Text Generation
Text data has become extremely valuable on large language models (LLMs) and
even lead to general artificial intelligence (AGI). A lot of high-quality text
in the real world is private and cannot be freely used due to privacy concerns.
Therefore, differentially private (DP) synthetic text generation has been
proposed, aiming to produce high-utility synthetic data while protecting
sensitive information. However, existing DP synthetic text generation imposes
uniform guarantees that often overprotect non-sensitive content, resulting in
substantial utility loss and computational overhead. Therefore, we propose
Secret-Protected Evolution (SecPE), a novel framework that extends private
evolution with secret-aware protection. Theoretically, we show that SecPE
satisfies $(\mathrm{p}, \mathrm{r})$-secret protection, constituting a
relaxation of Gaussian DP that enables tighter utility-privacy trade-offs,
while also substantially reducing computational complexity relative to baseline
methods. Empirically, across the OpenReview, PubMed, and Yelp benchmarks, SecPE
consistently achieves lower Fr\'echet Inception Distance (FID) and higher
downstream task accuracy than GDP-based Aug-PE baselines, while requiring less
noise to attain the same level of protection. Our results highlight that
secret-aware guarantees can unlock more practical and effective
privacy-preserving synthetic text generation.
☆ Revisiting Model Interpolation for Efficient Reasoning
Model merging, typically on Instruct and Thinking models, has shown
remarkable performance for efficient reasoning. In this paper, we
systematically revisit the simplest merging method that interpolates two
weights directly. Particularly, we observe that model interpolation follows a
three-stage evolutionary paradigm with distinct behaviors on the reasoning
trajectory. These dynamics provide a principled guide for navigating the
performance-cost trade-off. Empirical results demonstrate that a strategically
interpolated model surprisingly surpasses sophisticated model merging baselines
on both efficiency and effectiveness. We further validate our findings with
extensive ablation studies on model layers, modules, and decoding strategies.
Ultimately, this work demystifies model interpolation and offers a practical
framework for crafting models with precisely targeted reasoning capabilities.
Code is available at \href{https://github.com/wutaiqiang/MI}{Github}.
comment: 14 pages, 6 figures, 7 tables. Working in progress
☆ Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning
Large language models (LLMs) primarily rely on supervised fine-tuning (SFT)
as a key method to adapt pre-trained models to domain-specific tasks such as
mathematical reasoning. However, standard SFT uniformly penalizes all tokens,
neglecting that only a small subset of critical tokens determines reasoning
correctness. This uniform supervision often causes reduced output diversity and
limited generalization. We propose Critical Token Fine-tuning (CFT), a simple
yet effective approach that updates only tokens identified as functionally
indispensable via counterfactual perturbations. By focusing gradient signals on
these decisive reasoning steps while preserving the diversity of non-critical
tokens, CFT can enhance both generation and diversity. Extensive experiments on
five models across three families (Qwen, OLMo, LLaMA) and eleven mathematical
reasoning benchmarks show that CFT, despite fine-tuning on less than 12% of
tokens, consistently outperforms standard SFT. Moreover, CFT enables test-time
scaling through improved sampling diversity and provides a stronger
initialization for reinforcement learning, sustaining performance gains in
later training stages while maintaining higher entropy for better exploration.
These results highlight CFT as a practical and general framework for efficient
and robust LLM fine-tuning.
☆ RV-HATE: Reinforced Multi-Module Voting for Implicit Hate Speech Detection
Hate speech remains prevalent in human society and continues to evolve in its
forms and expressions. Modern advancements in internet and online anonymity
accelerate its rapid spread and complicate its detection. However, hate speech
datasets exhibit diverse characteristics primarily because they are constructed
from different sources and platforms, each reflecting different linguistic
styles and social contexts. Despite this diversity, prior studies on hate
speech detection often rely on fixed methodologies without adapting to
data-specific features. We introduce RV-HATE, a detection framework designed to
account for the dataset-specific characteristics of each hate speech dataset.
RV-HATE consists of multiple specialized modules, where each module focuses on
distinct linguistic or contextual features of hate speech. The framework
employs reinforcement learning to optimize weights that determine the
contribution of each module for a given dataset. A voting mechanism then
aggregates the module outputs to produce the final decision. RV-HATE offers two
primary advantages: (1)~it improves detection accuracy by tailoring the
detection process to dataset-specific attributes, and (2)~it also provides
interpretable insights into the distinctive features of each dataset.
Consequently, our approach effectively addresses implicit hate speech and
achieves superior performance compared to conventional static methods. Our code
is available at https://github.com/leeyejin1231/RV-HATE.
comment: 10 pages, 9 figures, 12 tables
☆ Judge Before Answer: Can MLLM Discern the False Premise in Question?
Multimodal large language models (MLLMs) have witnessed astonishing
advancements in recent years. Despite these successes, MLLMs remain vulnerable
to flase premise problems. However, existing benchmarks targeting this issue
are limited in scope: they often lack fine-grained categorization, exhibit
insufficient coverage, and thus fail to provide a rigorous evaluation of the
ability of models to recognize false premises. To bridge this gap, we introduce
a fully automated pipeline for constructing a comprehensive benchmark of false
premise questions. Our method systematically categorizes the premises into
three main types and thirteen subtypes according to the abilities required to
identify the premises, resulting in the JBA dataset.Results show current MLLMs
still struggle with false premise recognition. Building upon this benchmark, we
further propose a recognition enhancement framework tailored to strengthen the
robustness of MLLMs to detect false premises. Extensive experiments demonstrate
that models trained with our framework achieve significant improvements in
false premise recognition.
☆ KOTOX: A Korean Toxic Dataset for Deobfuscation and Detoxification
Toxic content has become an increasingly critical social issue with the rapid
expansion of online communication. While numerous studies explored methods for
detecting and detoxifying such content, most have focused primarily on English,
leaving low-resource language underrepresented. Consequently, Large Language
Models~(LLMs) often struggle to identify and neutralize toxic expressions in
these languages. This challenge becomes even more pronounced when user employ
obfuscation techniques to evade detection systems. Therefore, we propose a
\textbf{KOTOX: Korean Toxic Dataset} for deobfuscation and detoxicification to
address this issue. We categorize various obfuscation approaches based on
linguistic characteristics of Korean and define a set of transformation rules
grounded in real-word examples. Using these rules, we construct three dataset
versions (easy, normal, and hard) representing different levels of obfuscation
difficulty. This is the first dataset that simultaneously supports
deobfuscation and detoxification for the Korean language. We expect it to
facilitate better understanding and mitigating of obfuscated toxic content in
LLM for low-resource languages. Our code and data are available at
https://github.com/leeyejin1231/KOTOX.
comment: 25 pages, 5 figures, 25 tables
☆ Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
Reasoning ability has become a defining capability of Large Language Models
(LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as
a key paradigm to enhance it. However, RLVR training often suffers from policy
entropy collapse, where the policy becomes overly deterministic, hindering
exploration and limiting reasoning performance. While entropy regularization is
a common remedy, its effectiveness is highly sensitive to the fixed
coefficient, making it unstable across tasks and models. In this work, we
revisit entropy regularization in RLVR and argue that its potential has been
largely underestimated. Our analysis shows that (i) tasks of varying difficulty
demand distinct exploration intensities, and (ii) balanced exploration may
require the policy entropy to be maintained within a moderate range below its
initial level. Therefore, we propose Adaptive Entropy Regularization (AER)--a
framework that dynamically balances exploration and exploitation via three
components: difficulty-aware coefficient allocation, initial-anchored target
entropy, and dynamic global coefficient adjustment. Experiments on multiple
mathematical reasoning benchmarks show that AER consistently outperforms
baselines, improving both reasoning accuracy and exploration capability.
comment: 16 pages, 4 figures
☆ Punctuation-aware treebank tree binarization
This article presents a curated resource and evaluation suite for
punctuation-aware treebank binarization. Standard binarization pipelines drop
punctuation before head selection, which alters constituent shape and harms
head-child identification. We release (1) a reproducible pipeline that
preserves punctuation as sibling nodes prior to binarization, (2) derived
artifacts and metadata (intermediate @X markers, reversibility signatures,
alignment indices), and (3) an accompanying evaluation suite covering
head-child prediction, round-trip reversibility, and structural compatibility
with derivational resources (CCGbank). On the Penn Treebank, punctuation-aware
preprocessing improves head prediction accuracy from 73.66\% (Collins rules)
and 86.66\% (MLP) to 91.85\% with the same classifier, and achieves competitive
alignment against CCGbank derivations. All code, configuration files, and
documentation are released to enable replication and extension to other
corpora.
☆ The Social Cost of Intelligence: Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems
Bias in large language models (LLMs) remains a persistent challenge,
manifesting in stereotyping and unfair treatment across social groups. While
prior research has primarily focused on individual models, the rise of
multi-agent systems (MAS), where multiple LLMs collaborate and communicate,
introduces new and largely unexplored dynamics in bias emergence and
propagation. In this work, we present a comprehensive study of stereotypical
bias in MAS, examining how internal specialization, underlying LLMs and
inter-agent communication protocols influence bias robustness, propagation, and
amplification. We simulate social contexts where agents represent different
social groups and evaluate system behavior under various interaction and
adversarial scenarios. Experiments on three bias benchmarks reveal that MAS are
generally less robust than single-agent systems, with bias often emerging early
through in-group favoritism. However, cooperative and debate-based
communication can mitigate bias amplification, while more robust underlying
LLMs improve overall system stability. Our findings highlight critical factors
shaping fairness and resilience in multi-agent LLM systems.
comment: 15 pages, 19 figures, Preprint. Under review
☆ End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF: A Reproducibility Study
We present a reproducibility study of the state-of-the-art neural
architecture for sequence labeling proposed by Ma and Hovy
(2016)\cite{ma2016end}. The original BiLSTM-CNN-CRF model combines
character-level representations via Convolutional Neural Networks (CNNs),
word-level context modeling through Bi-directional Long Short-Term Memory
networks (BiLSTMs), and structured prediction using Conditional Random Fields
(CRFs). This end-to-end approach eliminates the need for hand-crafted features
while achieving excellent performance on named entity recognition (NER) and
part-of-speech (POS) tagging tasks. Our implementation successfully reproduces
the key results, achieving 91.18\% F1-score on CoNLL-2003 NER and demonstrating
the model's effectiveness across sequence labeling tasks. We provide a detailed
analysis of the architecture components and release an open-source PyTorch
implementation to facilitate further research.
☆ Evaluating Language Models' Evaluations of Games
Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths
Reasoning is not just about solving problems -- it is also about evaluating
which problems are worth solving at all. Evaluations of artificial intelligence
(AI) systems primarily focused on problem solving, historically by studying how
models play games such as chess and Go. In this paper, we advocate for a new
paradigm that assesses AI systems' evaluation of games. First, we introduce a
formalism for evaluating such evaluations. We then leverage a large-scale
dataset of over $100$ novel board games and over 450 human judgments to compare
evaluations produced by modern language and reasoning models against those of
people and symbolic computational agents. We consider two kinds of evaluative
queries: assessing the payoff (or fairness) and the funness of games. These
queries span two dimensions relevant to the design of evaluations of AI
evaluations: how complex a query is to compute and how difficult a query is to
quantify. Our results show that reasoning models are generally more aligned to
people in their evaluations of games than non-reasoning language models.
However, we observe a non-monotonic relationship: as models get closer to
game-theoretic optimal, their fit to human data weakens. We also observe more
"jaggedness" across models for assessing funness, in line with the greater
difficulty of quantifying this query. Across queries and games, reasoning
models show highly variable and unpredictable resource usage when assessing
queries, pointing to the importance of imbuing more resource-rational
meta-reasoning in language and reasoning models.
comment: Pre-print
☆ GapDNER: A Gap-Aware Grid Tagging Model for Discontinuous Named Entity Recognition IJCNN 2025
In biomedical fields, one named entity may consist of a series of
non-adjacent tokens and overlap with other entities. Previous methods recognize
discontinuous entities by connecting entity fragments or internal tokens, which
face challenges of error propagation and decoding ambiguity due to the wide
variety of span or word combinations. To address these issues, we deeply
explore discontinuous entity structures and propose an effective Gap-aware grid
tagging model for Discontinuous Named Entity Recognition, named GapDNER. Our
GapDNER innovatively applies representation learning on the context gaps
between entity fragments to resolve decoding ambiguity and enhance
discontinuous NER performance. Specifically, we treat the context gap as an
additional type of span and convert span classification into a token-pair grid
tagging task. Subsequently, we design two interactive components to
comprehensively model token-pair grid features from both intra- and inter-span
perspectives. The intra-span regularity extraction module employs the biaffine
mechanism along with linear attention to capture the internal regularity of
each span, while the inter-span relation enhancement module utilizes
criss-cross attention to obtain semantic relations among different spans. At
the inference stage of entity decoding, we assign a directed edge to each
entity fragment and context gap, then use the BFS algorithm to search for all
valid paths from the head to tail of grids with entity tags. Experimental
results on three datasets demonstrate that our GapDNER achieves new
state-of-the-art performance on discontinuous NER and exhibits remarkable
advantages in recognizing complex entity structures.
comment: Accepted by IJCNN 2025
☆ Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation
Hengyuan Zhang, Shiping Yang, Xiao Liang, Chenming Shang, Yuxuan Jiang, Chaofan Tao, Jing Xiong, Hayden Kwok-Hay So, Ruobing Xie, Angel X. Chang, Ngai Wong
Training student models on synthetic data generated by strong teacher models
is a promising way to distilling the capabilities of teachers. However, recent
studies show that stronger models are not always optimal teachers, revealing a
mismatch between teacher outputs and student learnability. To address this
issue, we propose PerSyn (Personalized data Synthesis), a novel synthesis
strategy that operates under a new ``Route then Generate'' paradigm to create
data tailored to each student model, enabling it to learn more effectively.
Specifically, PerSyn first assigns each prompt to its optimal teacher via a
query-level router that jointly considers student learnability and teacher
response quality. Each teacher then synthesizes data only for its assigned
prompts, making the process more efficient than the conventional ``Generate
then Select'' paradigm, where all teachers must generate parallel responses for
the entire prompt set before constructing the final dataset. Extensive
experiments across different model families and scales demonstrate that PerSyn
consistently achieves superior or comparable performance to all baselines in
instruct tuning and math reasoning settings. Further analysis verifies the
effectiveness of PerSyn and offers extra insights to propel future research.
comment: 19 pages, 10 figures
☆ ADVICE: Answer-Dependent Verbalized Confidence Estimation
Recent progress in large language models (LLMs) has enabled them to express
their confidence in natural language, enhancing transparency and reliability.
However, their confidence often exhibits overconfidence, the cause of which
remains poorly understood. In this work, we conduct a detailed analysis of the
dynamics underlying verbalized confidence and identify answer-independence as a
key factor, defined as the model's failure to condition confidence on its own
answer. To address this, we propose ADVICE (Answer-Dependent Verbalized
Confidence Estimation), a fine-tuning framework that facilitates
answer-grounded confidence estimation. Extensive experiments show that ADVICE
substantially improves confidence calibration while preserving task
performance. Further analyses confirm that ADVICE strengthens
answer-groundedness, leading to more balanced and well-calibrated confidence
distributions. Our findings shed light on the origin of overconfidence and
establish a framework for more trustworthy confidence verbalization.
☆ LLM$\times$MapReduce-V3: Enabling Interactive In-Depth Survey Generation through a MCP-Driven Hierarchically Modular Agent System EMNLP2025
Yu Chao, Siyu Lin, xiaorong wang, Zhu Zhang, Zihan Zhou, Haoyu Wang, Shuo Wang, Jie Zhou, Zhiyuan Liu, Maosong Sun
We introduce LLM x MapReduce-V3, a hierarchically modular agent system
designed for long-form survey generation. Building on the prior work, LLM x
MapReduce-V2, this version incorporates a multi-agent architecture where
individual functional components, such as skeleton initialization, digest
construction, and skeleton refinement, are implemented as independent
model-context-protocol (MCP) servers. These atomic servers can be aggregated
into higher-level servers, creating a hierarchically structured system. A
high-level planner agent dynamically orchestrates the workflow by selecting
appropriate modules based on their MCP tool descriptions and the execution
history. This modular decomposition facilitates human-in-the-loop intervention,
affording users greater control and customization over the research process.
Through a multi-turn interaction, the system precisely captures the intended
research perspectives to generate a comprehensive skeleton, which is then
developed into an in-depth survey. Human evaluations demonstrate that our
system surpasses representative baselines in both content depth and length,
highlighting the strength of MCP-based modular planning.
comment: Accepted by EMNLP2025 System Demonstration
☆ Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks SC
Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL)
systems, enabling non-expert users to query industrial databases using natural
language. While test-time scaling strategies have shown promise in LLM-based
solutions, their effectiveness in real-world applications, especially with the
latest reasoning models, remains uncertain. In this work, we benchmark six
lightweight, industry-oriented test-time scaling strategies and four LLMs,
including two reasoning models, evaluating their performance on the BIRD
Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference
latency and token consumption, providing insights relevant for practical system
deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot
demonstrations consistently enhance performance for both general-purpose and
reasoning-focused LLMs. However, introducing additional workflow steps yields
mixed results, and base model selection plays a critical role. This work sheds
light on the practical trade-offs between accuracy, efficiency, and complexity
when deploying Text2SQL systems.
comment: Accepted at COLM 2025 SCALR Workshop
♻ ☆ Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability
Enhancing the mathematical reasoning capabilities of LLMs has garnered
significant attention in both the mathematical and computer science
communities. Recent works have made substantial progress in both Natural
Language (NL) reasoning and Formal Language (FL) reasoning by leveraging the
potential of pure Reinforcement Learning (RL) methods on base models. However,
RL approaches struggle to impart new capabilities not presented in the base
model, highlighting the need to integrate more knowledge like FL into NL math
reasoning effectively. Yet, this integration is challenging due to inherent
disparities in problem structure and reasoning format between NL and FL. To
address these challenges, we introduce **NL-FL HybridReasoning (NFL-HR)**, an
end-to-end framework designed to incorporate the FL expert into NL math
problem-solving. To bridge the NL and FL input format gap, we propose the NL-FL
Problem Alignment method, which reformulates the Question-Answering (QA)
problems in NL as existence theorems in FL. Subsequently, the Mixed Problem
Input technique we provide enables the FL reasoner to handle both QA and
existence problems concurrently. Lastly, we mitigate the NL and FL output
format gap in reasoning through an LLM-based Answer Extraction mechanism.
Comprehensive experiments demonstrate that the NFL-HR framework achieves
**89.80**% and **84.34%** accuracy rates on the MATH-500 and the AMC
benchmarks, surpassing the NL baseline by **4.60%** and **4.82%**,
respectively. Notably, some problems resolved by our framework remain unsolved
by the NL baseline model even under a larger number of trials.
♻ ☆ Part-of-speech tagging for Nagamese Language using CRF
This paper investigates part-of-speech tagging, an important task in Natural
Language Processing (NLP) for the Nagamese language. The Nagamese language,
a.k.a. Naga Pidgin, is an Assamese-lexified Creole language developed primarily
as a means of communication in trade between the Nagas and people from Assam in
northeast India. A substantial amount of work in part-of-speech-tagging has
been done for resource-rich languages like English, Hindi, etc. However, no
work has been done in the Nagamese language. To the best of our knowledge, this
is the first attempt at part-of-speech tagging for the Nagamese Language. The
aim of this work is to identify the part-of-speech for a given sentence in the
Nagamese language. An annotated corpus of 16,112 tokens is created and applied
machine learning technique known as Conditional Random Fields (CRF). Using CRF,
an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score
of 85% is achieved.
Keywords. Nagamese, NLP, part-of-speech, machine learning, CRF.
comment: 8 pages
♻ ☆ Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate ICML
While multi-agent debate has been proposed as a promising strategy for
improving AI reasoning ability, we find that debate can sometimes be harmful
rather than helpful. Prior work has primarily focused on debates within
homogeneous groups of agents, whereas we explore how diversity in model
capabilities influences the dynamics and outcomes of multi-agent interactions.
Through a series of experiments, we demonstrate that debate can lead to a
decrease in accuracy over time - even in settings where stronger (i.e., more
capable) models outnumber their weaker counterparts. Our analysis reveals that
models frequently shift from correct to incorrect answers in response to peer
reasoning, favoring agreement over challenging flawed reasoning. We perform
additional experiments investigating various potential contributing factors to
these harmful shifts - including sycophancy, social conformity, and model and
task type. These results highlight important failure modes in the exchange of
reasons during multi-agent debate, suggesting that naive applications of debate
may cause performance degradation when agents are neither incentivised nor
adequately equipped to resist persuasive but incorrect reasoning.
comment: ICML MAS Workshop 2025
♻ ☆ Multi-Scale Manifold Alignment for Interpreting Large Language Models: A Unified Information-Geometric Framework
We present Multi-Scale Manifold Alignment(MSMA), an information-geometric
framework that decomposes LLM representations into local, intermediate, and
global manifolds and learns cross-scale mappings that preserve geometry and
information. Across GPT-2, BERT, RoBERTa, and T5, we observe consistent
hierarchical patterns and find that MSMA improves alignment metrics under
multiple estimators (e.g., relative KL reduction and MI gains with statistical
significance across seeds). Controlled interventions at different scales yield
distinct and architecture-dependent effects on lexical diversity, sentence
structure, and discourse coherence. While our theoretical analysis relies on
idealized assumptions, the empirical results suggest that multi-objective
alignment offers a practical lens for analyzing cross-scale information flow
and guiding representation-level control.
♻ ☆ Holistic Evaluation of Multimodal LLMs on Spatial Intelligence
Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang
Multimodal models have achieved remarkable progress in recent years.
Nevertheless, they continue to exhibit notable limitations in spatial
understanding and reasoning, the very capability that anchors artificial
general intelligence in the physical world. With the recent release of GPT-5,
allegedly the most powerful AI model to date, it is timely to examine where the
leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path
toward spatial intelligence. We first propose a holistic taxonomy of spatial
tasks that unifies existing benchmarks and a standardized protocol for the fair
evaluation of state-of-the-art proprietary and open-source models across eight
key benchmarks, at a cost exceeding ten billion total tokens. Our empirical
study then reveals that (1) GPT-5 demonstrates unprecedented strength in
spatial intelligence (SI), yet (2) still falls short of human performance
significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that
SI-tasks expose greater model capability deficiency than non-SI tasks, to the
extent that (4) proprietary models do not exhibit a decisive advantage when
facing the most difficult ones. In addition, we conduct a qualitative
evaluation across a diverse set of scenarios that are intuitive for humans, yet
fail even the most advanced multimodal models.
♻ ☆ Empirical Investigation of Latent Representational Dynamics in Large Language Models: A Manifold Evolution Perspective
This paper introduces the Dynamical Manifold Evolution Theory (DMET), a
conceptual framework that models large language model (LLM) generation as a
continuous trajectory evolving on a low-dimensional semantic manifold. The
theory characterizes latent dynamics through three interpretable metrics-state
continuity ($C$), attractor compactness ($Q$), and topological persistence
($P$)-which jointly capture the smoothness, stability, and structure of
representation evolution. Empirical analyses across multiple Transformer
architectures reveal consistent links between these latent dynamics and text
quality: smoother trajectories correspond to greater fluency, and richer
topological organization correlates with enhanced coherence. Different models
exhibit distinct dynamical regimes, reflecting diverse strategies of semantic
organization in latent space. Moreover, decoding parameters such as temperature
and top-$p$ shape these trajectories in predictable ways, defining a balanced
region that harmonizes fluency and creativity. As a phenomenological rather
than first-principles framework, DMET provides a unified and testable
perspective for interpreting, monitoring, and guiding LLM behavior, offering
new insights into the interplay between internal representation dynamics and
external text generation quality.
♻ ☆ Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot EMNLP25
In-Context Learning (ICL) is an essential emergent ability of Large Language
Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars
of ICL to enhance the reasoning capability, especially in mathematics tasks.
However, given the continuous advancement of model capabilities, it remains
unclear whether CoT exemplars still benefit recent, stronger models in such
tasks. Through systematic experiments, we find that for recent strong models
such as the Qwen2.5 series, adding traditional CoT exemplars does not improve
reasoning performance compared to Zero-Shot CoT. Instead, their primary
function is to align the output format with human expectations. We further
investigate the effectiveness of enhanced CoT exemplars, constructed using
answers from advanced models such as \texttt{Qwen2.5-Max} and
\texttt{DeepSeek-R1}. Experimental results indicate that these enhanced
exemplars still fail to improve the model's reasoning performance. Further
analysis reveals that models tend to ignore the exemplars and focus primarily
on the instructions, leading to no observable gain in reasoning ability.
Overall, our findings highlight the limitations of the current ICL+CoT
framework in mathematical reasoning, calling for a re-examination of the ICL
paradigm and the definition of exemplars.
comment: EMNLP25-findings camera_ready, 19 pages,22 figures
♻ ☆ ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos
We present ViDRiP-LLaVA, the first large multimodal model (LMM) in
computational pathology that integrates three distinct image scenarios,
including single patch images, automatically segmented pathology video clips,
and manually segmented pathology videos. This integration closely mirrors the
natural diagnostic process of pathologists. By generating detailed histological
descriptions and culminating in a definitive sign-out diagnosis, ViDRiP-LLaVA
bridges visual narratives with diagnostic reasoning. Central to our approach is
the ViDRiP-Instruct dataset, comprising 4278 video and diagnosis-specific
chain-of-thought instructional pairs sourced from educational histopathology
videos on YouTube. Although high-quality data is critical for enhancing
diagnostic reasoning, its creation is time-intensive and limited in volume. To
overcome this challenge, we transfer knowledge from existing single-image
instruction datasets to train on weakly annotated, keyframe-extracted clips,
followed by fine-tuning on manually segmented videos. ViDRiP-LLaVA establishes
a new benchmark in pathology video analysis and offers a promising foundation
for future AI systems that support clinical decision-making through integrated
visual and diagnostic reasoning. Our code, data, and model are publicly
available at: https://github.com/QuIIL/ViDRiP-LLaVA.
♻ ☆ Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently
demonstrated notable success in enhancing the reasoning performance of large
language models (LLMs), particularly on mathematics and programming tasks.
Similar to how traditional RL helps agents explore and learn new strategies,
RLVR is believed to enable LLMs to continuously self-improve, thus acquiring
novel reasoning abilities beyond those of the corresponding base models. In
this study we critically examine the current state of RLVR by systematically
probing the reasoning capability boundaries of RLVR-trained LLMs across various
model families, RL algorithms, and math, coding, and visual reasoning
benchmarks, using pass@k at large k values as the evaluation metric.
Surprisingly, we find that the current training setup does not elicit
fundamentally new reasoning patterns. While RLVR-trained models outperform
their base models at small k (e.g., k = 1), the base models achieve a higher
pass@k score when k is large. Coverage and perplexity analyses show that the
observed reasoning abilities originate from and are bounded by the base model.
Treating the base model as an upper bound, our quantitative analysis shows that
six popular RLVR algorithms perform similarly and remain far from optimal in
leveraging the potential of the base model. By contrast, we find that
distillation can introduce new reasoning patterns from the teacher and
genuinely expand the model's reasoning capabilities. Overall, our findings
suggest that current RLVR methods have not yet realized the potential of RL to
elicit truly novel reasoning abilities in LLMs. This highlights the need for
improved RL paradigms, such as continual scaling and multi-turn
agent-environment interaction, to unlock this potential.
comment: 30 pages, 27 figures
♻ ☆ Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study
Large-scale web-scraped text corpora used to train general-purpose AI models
often contain harmful demographic-targeted social biases, creating a regulatory
need for data auditing and developing scalable bias-detection methods. Although
prior work has investigated biases in text datasets and related detection
methods, these studies remain narrow in scope. They typically focus on a single
content type (e.g., hate speech), cover limited demographic axes, overlook
biases affecting multiple demographics simultaneously, and analyze limited
techniques. Consequently, practitioners lack a holistic understanding of the
strengths and limitations of recent large language models (LLMs) for automated
bias detection. In this study, we present a comprehensive evaluation framework
aimed at English texts to assess the ability of LLMs in detecting
demographic-targeted social biases. To align with regulatory requirements, we
frame bias detection as a multi-label task using a demographic-focused
taxonomy. We then conduct a systematic evaluation with models across scales and
techniques, including prompting, in-context learning, and fine-tuning. Using
twelve datasets spanning diverse content types and demographics, our study
demonstrates the promise of fine-tuned smaller models for scalable detection.
However, our analyses also expose persistent gaps across demographic axes and
multi-demographic targeted biases, underscoring the need for more effective and
scalable auditing frameworks.
comment: 18 pages
♻ ☆ LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs ACL 2025
Long-context modeling has drawn more and more attention in the area of Large
Language Models (LLMs). Continual training with long-context data becomes the
de-facto method to equip LLMs with the ability to process long inputs. However,
it still remains an open challenge to measure the quality of long-context
training data. To address this issue, we propose a Long-context data selection
framework with Attention-based Dependency Measurement (LADM), which can
efficiently identify high-quality long-context data from a large-scale,
multi-domain pre-training corpus. LADM leverages the retrieval capabilities of
the attention mechanism to capture contextual dependencies, ensuring a
comprehensive quality measurement of long-context data. Experimental results
show that our LADM framework significantly boosts the performance of LLMs on
multiple long-context tasks with only 1B tokens for continual training.
comment: ACL 2025, our code is available at https://github.com/ZNLP/LADM
♻ ☆ Evolving LLMs' Self-Refinement Capability via Synergistic Training-Inference Optimization
Yongcheng Zeng, Xinyu Cui, Xuanfa Jin, Qirui Mi, Guoqing Liu, Zexu Sun, Mengyue Yang, Dong Li, Weiyu Ma, Ning Yang, Jian Zhao, Jianye Hao, Haifeng Zhang, Jun Wang
Self-Refinement refers to a model's ability to revise its own responses to
produce improved outputs. This capability can also serve as a fundamental
mechanism for Self-Improvement, for example, by reconstructing datasets with
refined results to enhance intrinsic model performance. However, our
comprehensive experiments reveal that large language models (LLMs) show no
clear evidence of inherent Self-Refinement and may even experience response
quality degradation after Self-Refinement. To address this issue, we propose
EVOLVE, a simple and effective framework for eliciting and tracking the
evolution of Self-Refinement through iterative training. We first explore
optimization methods during training to activate the model's Self-Refinement
capability. Then, at inference, we investigate various generation strategies to
further enhance and utilize Self-Refinement while supplying the necessary data
for training. Through synergistic optimization of training and inference
stages, we continually evolve the model's Self-Refinement ability, enabling it
to better refine its own responses. Moreover, we demonstrate the potential of
leveraging Self-Refinement to achieve broader Self-Improvement of intrinsic
model abilities. Experiments show that the evolved Self-Refinement ability
enables the Llama-3.1-8B base model to surpass GPT-4o, achieving 62.3%
length-controlled and 63.3% raw win rates on AlpacaEval 2, and 50.3% on
Arena-Hard. It also generalizes effectively to out-of-domain reasoning tasks,
improving performance on mathematical reasoning benchmarks such as GSM8K and
MATH.
♻ ☆ NeMo: Needle in a Montage for Video-Language Understanding
Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, Meng Fang, Yin Li, Liwei Wang
Recent advances in video large language models (VideoLLMs) call for new
evaluation protocols and benchmarks for complex temporal reasoning in
video-language understanding. Inspired by the needle in a haystack test widely
used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed
to assess VideoLLMs' critical reasoning capabilities, including long-context
recall and temporal grounding. To generate video question answering data for
our task, we develop a scalable automated data generation pipeline that
facilitates high-quality data synthesis. Built upon the proposed pipeline, we
present NeMoBench, a video-language benchmark centered on our task.
Specifically, our full set of NeMoBench features 31,378 automatically generated
question-answer (QA) pairs from 13,486 videos with various durations ranging
from seconds to hours. Experiments demonstrate that our pipeline can reliably
and automatically generate high-quality evaluation data, enabling NeMoBench to
be continuously updated with the latest videos. We evaluate 20 state-of-the-art
models on our benchmark, providing extensive results and key insights into
their capabilities and limitations. Our project page is available at:
https://lavi-lab.github.io/NeMoBench.
♻ ☆ On the Interplay between Musical Preferences and Personality through the Lens of Language ECAI2025
Music serves as a powerful reflection of individual identity, often aligning
with deeper psychological traits. Prior research has established correlations
between musical preferences and personality, while separate studies have
demonstrated that personality is detectable through linguistic analysis. Our
study bridges these two research domains by investigating whether individuals'
musical preferences leave traces in their spontaneous language through the lens
of the Big Five personality traits (Openness, Conscientiousness, Extroversion,
Agreeableness, and Neuroticism). Using a carefully curated dataset of over
500,000 text samples from nearly 5,000 authors with reliably identified musical
preferences, we build advanced models to assess personality characteristics.
Our results reveal significant personality differences across fans of five
musical genres. We release resources for future research at the intersection of
computational linguistics, music psychology and personality analysis.
comment: ECAI2025 (Identity-Aware AI workshop)
♻ ☆ Effectiveness of Counter-Speech against Abusive Content: A Multidimensional Annotation and Classification Study
Counter-speech (CS) is a key strategy for mitigating online Hate Speech (HS),
yet defining the criteria to assess its effectiveness remains an open
challenge. We propose a novel computational framework for CS effectiveness
classification, grounded in linguistics, communication and argumentation
concepts. Our framework defines six core dimensions - Clarity, Evidence,
Emotional Appeal, Rebuttal, Audience Adaptation, and Fairness - which we use to
annotate 4,214 CS instances from two benchmark datasets, resulting in a novel
linguistic resource released to the community. In addition, we propose two
classification strategies, multi-task and dependency-based, achieving strong
results (0.94 and 0.96 average F1 respectively on both expert- and user-written
CS), outperforming standard baselines, and revealing strong interdependence
among dimensions.
♻ ☆ From $f(x)$ and $g(x)$ to $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones
Lifan Yuan, Weize Chen, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, Hao Peng
Does RL teach LLMs genuinely new skills, or does it merely activate existing
ones? This question lies at the core of ongoing debates about the role of RL in
LLM post-training. On one side, strong empirical results can be achieved with
RL even without preceding supervised finetuning; on the other, critics argue
that RL contributes little beyond reweighting existing reasoning strategies.
This work provides concrete evidence that LLMs can acquire genuinely new skills
during RL by composing existing ones, mirroring one of the central mechanisms
by which humans acquire new cognitive skills. To mitigate data contamination
and other confounding factors, and to allow precise control over task
complexity, we develop a synthetic framework for our investigation.
Specifically, we define a skill as the ability to infer the output of a string
transformation function f(x) given x. When an LLM has already learned f and g
prior to RL, our experiments reveal that RL enables it to learn unseen
compositions of them h(x)=g(f(x)). Further, this compositional ability
generalizes to more difficult problems such as compositions of >2 functions
unseen during RL training. Surprisingly, our experiments show that
compositional skill acquired on a source task transfers to a different target
task. This transfer happens even without compositional training on the target,
requiring only prior knowledge of the target's atomic skills. Our qualitative
analysis shows that RL fundamentally changes the reasoning behaviors of the
models. In contrast, next-token training with the same data yields none of
these findings. Our systematic experiments provide fresh insights into LLM
learning, suggesting the value of first building base models with basic skills,
then using RL to incentivize advanced, generalizable skills for complex
problems.
♻ ☆ WebThinker: Empowering Large Reasoning Models with Deep Research Capability NeurIPS 2025
Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, Zhicheng Dou
Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate
impressive long-horizon reasoning capabilities. However, their reliance on
static internal knowledge limits their performance on complex,
knowledge-intensive tasks and hinders their ability to produce comprehensive
research reports requiring synthesis of diverse web information. To address
this, we propose WebThinker, a deep research agent that empowers LRMs to
autonomously search the web, navigate among web pages, and draft reports during
the reasoning process. WebThinker integrates a Deep Web Explorer module,
enabling LRMs to dynamically search, navigate, and extract information from the
web when encountering knowledge gaps. It also employs an Autonomous
Think-Search-and-Draft strategy, allowing the model to seamlessly interleave
reasoning, information gathering, and report writing in real time. To further
enhance research tool utilization, we introduce an RL-based training strategy
via iterative online Direct Preference Optimization (DPO). Extensive
experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and
scientific report generation tasks (Glaive) demonstrate that WebThinker
significantly outperforms existing methods and strong proprietary systems. Our
approach enhances LRM reliability and applicability in complex scenarios,
paving the way for more capable and versatile deep research systems. The code
is available at https://github.com/RUC-NLPIR/WebThinker.
comment: Accepted by NeurIPS 2025
♻ ☆ Formalizing Style in Personal Narratives
Personal narratives are stories authors construct to make meaning of their
experiences. Style, the distinctive way authors use language to express
themselves, is fundamental to how these narratives convey subjective
experiences. Yet there is a lack of a formal framework for systematically
analyzing these stylistic choices. We present a novel approach that formalizes
style in personal narratives as patterns in the linguistic choices authors make
when communicating subjective experiences. Our framework integrates three
domains: functional linguistics establishes language as a system of meaningful
choices, computer science provides methods for automatically extracting and
analyzing sequential patterns, and these patterns are linked to psychological
observations. Using language models, we automatically extract linguistic
features such as processes, participants, and circumstances. We apply our
framework to hundreds of dream narratives, including a case study on a war
veteran with post-traumatic stress disorder. Analysis of his narratives
uncovers distinctive patterns, particularly how verbal processes dominate over
mental ones, illustrating the relationship between linguistic choices and
psychological states.
♻ ☆ The Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech EMNLP 2025
We present the first large-scale computational study of political
delegitimization discourse (PDD), defined as symbolic attacks on the normative
validity of political entities. We curate and manually annotate a novel
Hebrew-language corpus of 10,410 sentences drawn from Knesset speeches
(1993-2023), Facebook posts (2018-2021), and leading news outlets, of which
1,812 instances (17.4\%) exhibit PDD and 642 carry additional annotations for
intensity, incivility, target type, and affective framing. We introduce a
two-stage classification pipeline combining finetuned encoder models and
decoder LLMs. Our best model (DictaLM 2.0) attains an F$_1$ of 0.74 for binary
PDD detection and a macro-F$_1$ of 0.67 for classification of delegitimization
characteristics. Applying this classifier to longitudinal and cross-platform
data, we see a marked rise in PDD over three decades, higher prevalence on
social media versus parliamentary debate, greater use by male than female
politicians, and stronger tendencies among right-leaning actors - with
pronounced spikes during election campaigns and major political events. Our
findings demonstrate the feasibility and value of automated PDD analysis for
understanding democratic discourse.
comment: EMNLP 2025
♻ ☆ Agentic large language models improve retrieval-based radiology question answering
Sebastian Wind, Jeta Sopa, Daniel Truhn, Mahshad Lotfinia, Tri-Thien Nguyen, Keno Bressem, Lisa Adams, Mirabela Rusu, Harald Köstler, Gerhard Wellein, Andreas Maier, Soroosh Tayebi Arasteh
Clinical decision-making in radiology increasingly benefits from artificial
intelligence (AI), particularly through large language models (LLMs). However,
traditional retrieval-augmented generation (RAG) systems for radiology question
answering (QA) typically rely on single-step retrieval, limiting their ability
to handle complex clinical reasoning tasks. Here we propose radiology Retrieval
and Reasoning (RaR), a multi-step retrieval and reasoning framework designed to
improve diagnostic accuracy, factual consistency, and clinical reliability of
LLMs in radiology question answering. We evaluated 25 LLMs spanning diverse
architectures, parameter scales (0.5B to >670B), and training paradigms
(general-purpose, reasoning-optimized, clinically fine-tuned), using 104
expert-curated radiology questions from previously established RSNA-RadioQA and
ExtendedQA datasets. To assess generalizability, we additionally tested on an
unseen internal dataset of 65 real-world radiology board examination questions.
RaR significantly improved mean diagnostic accuracy over zero-shot prompting
and conventional online RAG. The greatest gains occurred in small-scale models,
while very large models (>200B parameters) demonstrated minimal changes (<2%
improvement). Additionally, RaR retrieval reduced hallucinations (mean 9.4%)
and retrieved clinically relevant context in 46% of cases, substantially aiding
factual grounding. Even clinically fine-tuned models showed gains from RaR
(e.g., MedGemma-27B), indicating that retrieval remains beneficial despite
embedded domain knowledge. These results highlight the potential of RaR to
enhance factuality and diagnostic accuracy in radiology QA, warranting future
studies to validate their clinical utility. All datasets, code, and the full
RaR framework are publicly available to support open research and clinical
translation.
♻ ☆ J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
The progress of AI is bottlenecked by the quality of evaluation, making
powerful LLM-as-a-Judge models a core solution. The efficacy of these judges
depends on their chain-of-thought reasoning, creating a critical need for
methods that can effectively optimize this reasoning process. In this work, we
introduce J1, a reinforcement learning framework for teaching LLM judges to
think before making decisions. Our core contribution lies in converting all
judgment tasks for non-verifiable and verifiable prompts into a unified format
with verifiable rewards, enabling direct optimization of evaluation quality
while mitigating positional bias. We then use RL to train thinking-judges at
scales of 8B, 32B, and 70B and show that they obtain state-of-the-art
performance across multiple benchmarks. In particular, J1-Qwen-32B, our
multitasked pointwise and pairwise judge also outperforms o1-mini, o3, and a
much larger 671B DeepSeek-R1 on some benchmarks, while only training on
synthetic data. Through comprehensive ablations of pairwise, pointwise, and
multitask J1 variants, we demonstrate the effectiveness of our approach across
seed prompts, reward strategies, and training recipes. Qualitative analysis
reveals that J1 develops systematic evaluation strategies, including dynamic
criteria generation, reference answer creation, iterative self-correction of
initial assessments, and feedback generation for low-quality responses.
comment: 10 pages, 13 tables, 14 figures
♻ ☆ LIDDIA: Language-based Intelligent Drug Discovery Agent EMNLP 2025
Drug discovery is a long, expensive, and complex process, relying heavily on
human medicinal chemists, who can spend years searching the vast space of
potential therapies. Recent advances in artificial intelligence for chemistry
have sought to expedite individual drug discovery tasks; however, there remains
a critical need for an intelligent agent that can navigate the drug discovery
process. Towards this end, we introduce LIDDIA, an autonomous agent capable of
intelligently navigating the drug discovery process in silico. By leveraging
the reasoning capabilities of large language models, LIDDIA serves as a
low-cost and highly-adaptable tool for autonomous drug discovery. We
comprehensively examine LIDDIA , demonstrating that (1) it can generate
molecules meeting key pharmaceutical criteria on over 70% of 30 clinically
relevant targets, (2) it intelligently balances exploration and exploitation in
the chemical space, and (3) it identifies one promising novel candidate on
AR/NR3C4, a critical target for both prostate and breast cancers. Code and
dataset are available at https://github.com/ninglab/LIDDiA
comment: EMNLP 2025 Main Conference
♻ ☆ When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment
With the growing accessibility and wide adoption of large language models,
concerns about their safety and alignment with human values have become
paramount. In this paper, we identify a concerning phenomenon:
Reasoning-Induced Misalignment (RIM), in which misalignment emerges when
reasoning capabilities strengthened-particularly when specific types of
reasoning patterns are introduced during inference or training. Beyond
reporting this vulnerability, we provide the first mechanistic account of its
origins. Through representation analysis, we discover that specific attention
heads facilitate refusal by reducing their attention to CoT tokens, a mechanism
that modulates the model's rationalization process during inference. During
training, we find significantly higher activation entanglement between
reasoning and safety in safety-critical neurons than in control neurons,
particularly after fine-tuning with those identified reasoning patterns. This
entanglement strongly correlates with catastrophic forgetting, providing a
neuron-level explanation for RIM.
♻ ☆ dInfer: An Efficient Inference Framework for Diffusion Language Models
Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, Da Zheng
Diffusion-based large language models (dLLMs) have emerged as a promising
alternative to autoregressive (AR) LLMs, leveraging denoising-based generation
to enable inherent parallelism. Even more and more open-sourced dLLM models
emerge, yet their widespread adoption remains constrained by the lack of a
standardized and efficient inference framework. We present dInfer, an efficient
and extensible framework for dLLM inference. dInfer decomposes the inference
pipeline into four modular components--model, diffusion iteration manager,
decoding strategy, and KV-cache manager--and integrates novel algorithms for
each component alongside system-level optimizations. Through this combination
of algorithmic innovations and system enhancements, dInfer achieves substantial
efficiency gains without compromising output quality on LLaDA-MoE. At batch
size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800
tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to
prior systems, dInfer delivers a $10\times$ speedup over Fast-dLLM while
maintaining similar model performance. Even compared to the AR model (with a
comparable number of activation parameters and performance) QWen2.5-3B, which
is highly optimized with the latest vLLM inference engine, dInfer still
delivers a $2$-$3\times$ speedup. The implementation of dInfer is open-sourced
at https://github.com/inclusionAI/dInfer.
♻ ☆ Accurate and Diverse LLM Mathematical Reasoning via Automated PRM-Guided GFlowNets
Achieving both accuracy and diverse reasoning remains challenging for Large
Language Models (LLMs) in complex domains like mathematics. A key bottleneck is
evaluating intermediate reasoning steps to guide generation without costly
human annotations. To address this, we first introduce a novel Process Reward
Model (PRM) trained automatically using Monte Carlo Tree Search coupled with a
similarity-based data augmentation technique, effectively capturing step-level
reasoning quality. Leveraging this PRM, we then adapt Generative Flow Networks
(GFlowNets) to operate at the reasoning step level. Unlike traditional
reinforcement learning focused on maximizing a single reward, GFlowNets
naturally sample diverse, high-quality solutions proportional to their rewards,
as measured by our PRM. Empirical evaluation shows strong improvements in both
accuracy and solution diversity on challenging mathematical benchmarks (e.g.,
+2.59% absolute accuracy on MATH Level 5 for Llama3.2-3B), with effective
generalization to unseen datasets (+9.4\% absolute on SAT MATH). Furthermore,
we benchmark our PRM against existing open-source reward models, demonstrating
superior alignment with reasoning quality and more consistent guidance for
downstream generation. Our work demonstrates the potential of PRM-guided,
step-level GFlowNets for developing more robust and versatile mathematical
reasoning in LLMs.
♻ ☆ LiTransProQA: an LLM-based Literary Translation evaluation metric with Professional Question Answering EMNLP 2025
The impact of Large Language Models (LLMs) has extended into literary
domains. However, existing evaluation metrics for literature prioritize
mechanical accuracy over artistic expression and tend to overrate machine
translation as being superior to human translation from experienced
professionals. In the long run, this bias could result in an irreversible
decline in translation quality and cultural authenticity. In response to the
urgent need for a specialized literary evaluation metric, we introduce
LITRANSPROQA, a novel, reference-free, LLM-based question-answering framework
designed for literary translation evaluation. LITRANSPROQA integrates humans in
the loop to incorporate insights from professional literary translators and
researchers, focusing on critical elements in literary quality assessment such
as literary devices, cultural understanding, and authorial voice. Our extensive
evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains,
LITRANSPROQA substantially outperforms current metrics, achieving up to 0.07
gain in correlation and surpassing the best state-of-the-art metrics by over 15
points in adequacy assessments. Incorporating professional translator insights
as weights further improves performance, highlighting the value of translator
inputs. Notably, LITRANSPROQA reaches an adequacy performance comparable to
trained linguistic student evaluators, though it still falls behind experienced
professional translators. LITRANSPROQA shows broad applicability to open-source
models like LLaMA3.3-70b and Qwen2.5-32b, indicating its potential as an
accessible and training-free tool for evaluating literary translations that
require local processing due to copyright or ethical considerations.
comment: Accepted as a main paper at EMNLP 2025. CR version
♻ ☆ References Indeed Matter? Reference-Free Preference Optimization for Conversational Query Reformulation
Conversational query reformulation (CQR) has become indispensable for
improving retrieval in dialogue-based applications. However, existing
approaches typically rely on reference passages for optimization, which are
impractical to acquire in real-world scenarios. To address this limitation, we
introduce a novel reference-free preference optimization framework DualReform
that generates pseudo reference passages from commonly-encountered
conversational datasets containing only queries and responses. DualReform
attains this goal through two key innovations: (1) response-based inference,
where responses serve as proxies to infer pseudo reference passages, and (2)
response refinement via the dual-role of CQR, where a CQR model refines
responses based on the shared objectives between response refinement and CQR.
Despite not relying on reference passages, DualReform achieves 96.9--99.1% of
the retrieval accuracy attainable only with reference passages and surpasses
the state-of-the-art method by up to 31.6%.
♻ ☆ Test-Time Alignment for Large Language Models via Textual Model Predictive Control
Kuang-Da Wang, Teng-Ruei Chen, Yu Heng Hung, Guo-Xun Ko, Shuoyang Ding, Yueh-Hua Wu, Yu-Chiang Frank Wang, Chao-Han Huck Yang, Wen-Chih Peng, Ping-Chun Hsieh
Aligning Large Language Models (LLMs) with human preferences through
finetuning is resource-intensive, motivating lightweight alternatives at test
time. We address test-time alignment through the lens of sequential decision
making, a perspective that reveals two fundamental challenges. When actions are
defined at the token level, as in guided decoding, alignment suffers from the
curse of horizon. Conversely, when actions are at the response level, as in
traditional iterative refinement, the curse of dimensionality emerges. To
resolve this trade-off, we draw inspiration from Model Predictive Control (MPC)
in control theory to propose Textual Model Predictive Control (TMPC), a novel
predictive planning framework adapted for aligning LLMs at inference time. A
key limitation of standard MPC is its reliance on predefined, hard segment
boundaries, which are often absent in text generation. TMPC overcomes this by
introducing two principles inspired by hierarchical reinforcement learning: (1)
Hindsight Subgoal Identification, where TMPC analyzes generation subgoals to
retrospectively identify high-reward intermediate outputs as subgoals. This
allows the framework to discover meaningful, task-specific planning steps
(e.g., a sentence in machine translation or a bug fix in code generation.). (2)
Subgoal-Conditioned Re-Generation, where these identified subgoals are used to
guide subsequent planning iterations. By conditioning on these proven,
high-quality subgoals, TMPC ensures stable improvement by building upon
previously validated successes. TMPC is evaluated on three tasks with distinct
segmentation properties: discourse-level translation, long-form response
generation, and program synthesis. The results demonstrate that TMPC
consistently improves performance, highlighting the generality.
comment: Preprint. Code will be released at Plan2Align GitHub link:
https://github.com/NYCU-RL-Bandits-Lab/Plan2Align
♻ ☆ On the Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions
Layer normalization (LN) is an essential component of modern neural networks.
While many alternative techniques have been proposed, none of them have
succeeded in replacing LN so far. The latest suggestion in this line of
research is a dynamic activation function called Dynamic Tanh (DyT). Although
it is empirically well-motivated and appealing from a practical point of view,
it lacks a theoretical foundation. In this work, we shed light on the
mathematical relationship between LN and dynamic activation functions. In
particular, we derive DyT from the LN variant RMSNorm, and show that a
well-defined decoupling in derivative space as well as an approximation are
needed to do so. By applying the same decoupling procedure directly in function
space, we are able to omit the approximation and obtain the exact element-wise
counterpart of RMSNorm, which we call Dynamic Inverse Square Root Unit
(DyISRU). We demonstrate numerically that DyISRU reproduces the normalization
effect on outliers more accurately than DyT does.
comment: Revision and Simplification (starting point RMSNorm instead of LN)
♻ ☆ Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors
Xin Liu, Runsong Zhao, Pengcheng Huang, Xinyu Liu, Junyi Xiao, Chunyang Xiao, Tong Xiao, Shengxiang Gao, Zhengtao Yu, Jingbo Zhu
Context compression presents a promising approach for accelerating large
language model (LLM) inference by compressing long contexts into compact
representations. Current context compression methods predominantly rely on
autoencoding tasks to train context-agnostic compression tokens to compress
contextual semantics. While autoencoding tasks enable compression tokens to
acquire compression capabilities, compression via autoencoding tasks creates a
fundamental mismatch: the models are optimized for reconstruction that diverge
from actual downstream tasks, thereby weakening the features more beneficial
for real-world usage. We propose Semantic-Anchor Compression (SAC), a novel
method that shifts from autoencoding task based compression to an architecture
that is equipped with this compression capability \textit{a priori}. Instead of
training models to compress contexts through autoencoding tasks, SAC directly
selects so-called anchor tokens from the original context and aggregates
contextual information into their key-value (KV) representations. By deriving
representations directly from the contextual tokens, SAC eliminates the need
for autoencoding training. To ensure compression performance while directly
leveraging anchor tokens, SAC incorporates two key designs: (1) anchor
embeddings that enable the compressor to identify critical tokens, and (2)
bidirectional attention modification that allows anchor tokens to capture
information from the entire context. Experimental results demonstrate that SAC
consistently outperforms existing context compression methods across various
compression ratios. On out-of-distribution evaluation using MRQA, SAC achieves
1 EM improvement at 5x compression over strong baselines, with increasing
advantages at higher compression ratios.
comment: 18 pages,9 figures
♻ ☆ Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization
Large Language Models (LLMs) have significantly advanced text generation
capabilities, including tasks like summarization, often producing coherent and
fluent outputs. However, faithfulness to source material remains a significant
challenge due to the generation of hallucinations. While extensive research
focuses on detecting and reducing these inaccuracies, less attention has been
paid to the positional distribution of hallucination within generated text,
particularly in long outputs. In this work, we investigate where hallucinations
occur in LLM-based long response generation, using long document summarization
as a key case study. Focusing on the challenging setting of long context-aware
long response generation, we find a consistent and concerning phenomenon:
hallucinations tend to concentrate disproportionately in the latter parts of
the generated long response. To understand this bias, we explore potential
contributing factors related to the dynamics of attention and decoding over
long sequences. Furthermore, we investigate methods to mitigate this positional
hallucination, aiming to improve faithfulness specifically in the concluding
segments of long outputs.
comment: 22 tables, 8 figures
♻ ☆ Introducing Spotlight: A Novel Approach for Generating Captivating Key Information from Documents EMNLP 2025
Ankan Mullick, Sombit Bose, Rounak Saha, Ayan Kumar Bhowmick, Aditya Vempaty, Prasenjit Dey, Ravi Kokku, Pawan Goyal, Niloy Ganguly
In this paper, we introduce Spotlight, a novel paradigm for information
extraction that produces concise, engaging narratives by highlighting the most
compelling aspects of a document. Unlike traditional summaries, which
prioritize comprehensive coverage, spotlights selectively emphasize intriguing
content to foster deeper reader engagement with the source material. We
formally differentiate spotlights from related constructs and support our
analysis with a detailed benchmarking study using new datasets curated for this
work. To generate high-quality spotlights, we propose a two-stage approach:
fine-tuning a large language model on our benchmark data, followed by alignment
via Direct Preference Optimization (DPO). Our comprehensive evaluation
demonstrates that the resulting model not only identifies key elements with
precision but also enhances readability and boosts the engagement value of the
original document.
comment: Paper accepted in EMNLP 2025 Main Conference (Full Paper)
♻ ☆ VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization
Aligning Vision-Language Models (VLMs) with safety standards is essential to
mitigate risks arising from their multimodal complexity, where integrating
vision and language unveils subtle threats beyond the reach of conventional
safeguards. Inspired by the insight that reasoning across modalities is key to
preempting intricate vulnerabilities, we propose a novel direction for VLM
safety: multimodal reasoning-driven prompt rewriting. To this end, we introduce
VLMGuard-R1, a proactive framework that refines user inputs through a
reasoning-guided rewriter, dynamically interpreting text-image interactions to
deliver refined prompts that bolster safety across diverse VLM architectures
without altering their core parameters. To achieve this, we devise a
three-stage reasoning pipeline to synthesize a dataset that trains the rewriter
to infer subtle threats, enabling tailored, actionable responses over generic
refusals. Extensive experiments across three benchmarks with five VLMs reveal
that VLMGuard-R1 outperforms four baselines. In particular, VLMGuard-R1
achieves a remarkable 43.59\% increase in average safety across five models on
the SIUO benchmark.
♻ ☆ QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory
Yihang Wang, Xu Huang, Bowen Tian, Yueyang Su, Lei Yu, Huaming Liao, Yixing Fan, Jiafeng Guo, Xueqi Cheng
Generative LLM have achieved remarkable success in various industrial
applications, owing to their promising In-Context Learning capabilities.
However, the issue of long context in complex tasks poses a significant barrier
to their wider adoption, manifested in two main aspects: (i) The excessively
long context leads to high costs and inference delays. (ii) A substantial
amount of task-irrelevant information introduced by long contexts exacerbates
the "lost in the middle" problem. Existing methods compress context by removing
redundant tokens using metrics such as self-information or PPL, which is
inconsistent with the objective of retaining the most important tokens when
conditioning on a given query. In this study, we introduce information
bottleneck theory (IB) to model the problem, offering a novel perspective that
thoroughly addresses the essential properties required for context compression.
Additionally, we propose a cross-attention-based approach to approximate mutual
information in IB, which can be flexibly replaced with suitable alternatives in
different scenarios. Extensive experiments on four datasets demonstrate that
our method achieves a 25% increase in compression rate compared to the
state-of-the-art, while maintaining question answering performance. In
particular, the context compressed by our method even outperform the full
context in some cases.
♻ ☆ Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation
Scientific diagrams are vital tools for communicating structured knowledge
across disciplines. However, they are often published as static raster images,
losing symbolic semantics and limiting reuse. While Multimodal Large Language
Models (MLLMs) offer a pathway to bridging vision and structure, existing
methods lack semantic control and structural interpretability, especially on
complex diagrams. We propose Draw with Thought (DwT), a training-free framework
that guides MLLMs to reconstruct diagrams into editable mxGraph XML code
through cognitively-grounded Chain-of-Thought reasoning. DwT enables
interpretable and controllable outputs without model fine-tuning by dividing
the task into two stages: Coarse-to-Fine Planning, which handles perceptual
structuring and semantic specification, and Structure-Aware Code Generation,
enhanced by format-guided refinement. To support evaluation, we release
Plot2XML, a benchmark of 247 real-world scientific diagrams with gold-standard
XML annotations. Extensive experiments across eight MLLMs show that our
approach yields high-fidelity, semantically aligned, and structurally valid
reconstructions, with human evaluations confirming strong alignment in both
accuracy and visual aesthetics, offering a scalable solution for converting
static visuals into executable representations and advancing machine
understanding of scientific graphics.
comment: 10 pages, 5 figures, accepted to appear in the Proceedings of the
33rd ACM International Conference on Multimedia (MM '25)
♻ ☆ TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning
Jinyang Wu, Chonghua Liao, Mingkuan Feng, Shuai Zhang, Zhengqi Wen, Haoran Luo, Ling Yang, Huazhe Xu, Jianhua Tao
Reinforcement learning (RL) has emerged as an effective paradigm for
enhancing model reasoning. However, existing RL methods like GRPO often rely on
unstructured self-sampling to fit scalar rewards, often producing inefficient
rollouts that fail to capture transferable problem-solving strategies. To
address these limitations, we propose **TemplateRL**, a structured
template-guided RL framework that augments policy optimization with explicit
template guidance. Our approach first constructs a problem-solving template
library via MCTS on a small seed set, then seamlessly integrates this
high-level structured guidance into RL training. By guiding rollout generation
to align with proven template structures, TemplateRL significantly improves
high-quality trajectory hit rates while reducing ineffective exploration. This
structure-guided design steers the policy toward validated strategic patterns,
stabilizing training dynamics, and enhancing RL sampling efficiency. Notably,
the explicit template library is interpretable, editable, and supports online
updates-enabling continuous updates during both training and inference.
Extensive experiments demonstrate that TemplateRL outperforms GRPO by 99% on
AIME and 41% on AMC, with superior stability on weak models and remarkable
cross-domain generalization, highlighting its potential for broader tasks.
♻ ☆ Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL
We propose FSPO (Fair Sequence Policy Optimization), a sequence-level
reinforcement learning method for LLMs that enforces length-fair clipping on
the importance-sampling (IS) weight. We study RL methods with sequence-level IS
and identify a mismatch when PPO/GRPO-style clipping is transplanted to
sequences: a fixed clip range systematically reweights short vs. long
responses, distorting the optimization direction. FSPO introduces a simple
remedy: we clip the sequence log-IS ratio with a band that scales as
$\sqrt{L}$. Theoretically, we formalize length fairness via a Length
Reweighting Error (LRE) and prove that small LRE yields a cosine directional
guarantee between the clipped and true updates. Empirically, FSPO flattens clip
rates across length bins, stabilizes training, and outperforms baselines across
model sizes and evaluation datasets, with the largest gains on the
Qwen3-8B-Base model.
♻ ☆ SaraCoder: Orchestrating Semantic and Structural Cues for Resource-Optimized Repository-Level Code Completion
Xiaohan Chen, Zhongying Pan, Quan Feng, Yu Tian, Shuqun Yang, Mengru Wang, Lina Gong, Yuxia Geng, Piji Li, Xiang Chen
Despite Retrieval-Augmented Generation improving code completion, traditional
retrieval methods struggle with information redundancy and a lack of diversity
within limited context windows. To solve this, we propose a resource-optimized
retrieval augmentation method, SaraCoder. It maximizes information diversity
and representativeness in a limited context window, significantly boosting the
accuracy and reliability of repository-level code completion. Its core
Hierarchical Feature Optimization module systematically refines candidates by
distilling deep semantic relationships, pruning exact duplicates, assessing
structural similarity with a novel graph-based metric that weighs edits by
their topological importance, and reranking results to maximize both relevance
and diversity. Furthermore, an External-Aware Identifier Disambiguator module
accurately resolves cross-file symbol ambiguity via dependency analysis.
Extensive experiments on the challenging CrossCodeEval and RepoEval-Updated
benchmarks demonstrate that SaraCoder outperforms existing baselines across
multiple programming languages and models. Our work proves that systematically
refining retrieval results across multiple dimensions provides a new paradigm
for building more accurate and resource-optimized repository-level code
completion systems.
♻ ☆ Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum
Xinglong Yang, Quan Feng, Zhongying Pan, Xiang Chen, Yu Tian, Wentong Li, Shuofei Qiao, Yuxia Geng, Xingyu Zhao, Sheng-Jun Huang
The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often
limited by the use of randomly or manually selected examples. These examples
fail to account for both model-specific knowledge distributions and the
intrinsic complexity of the tasks, resulting in suboptimal and unstable model
performance. To address this, we propose a novel framework inspired by the
pedagogical principle of "tailored teaching with balanced difficulty". We
reframe prompt selection as a prompt curriculum design problem: constructing a
well ordered set of training examples that align with the model's current
capabilities. Our approach integrates two complementary signals: (1)
model-perceived difficulty, quantified through prediction disagreement in an
active learning setup, capturing what the model itself finds challenging; and
(2) intrinsic sample complexity, which measures the inherent difficulty of each
question-image pair independently of any model. By jointly analyzing these
signals, we develop a difficulty-balanced sampling strategy that ensures the
selected prompt examples are diverse across both dimensions. Extensive
experiments conducted on five challenging benchmarks and multiple popular
Multimodal Large Language Models (MLLMs) demonstrate that our method yields
substantial and consistent improvements and greatly reduces performance
discrepancies caused by random sampling, providing a principled and robust
approach for enhancing multimodal reasoning.
♻ ☆ LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning
Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin
Large Language Models (LLMs) demonstrate their reasoning ability through
chain-of-thought (CoT) generation. However, LLM's autoregressive decoding may
limit the ability to revisit and refine earlier tokens in a holistic manner,
which can also lead to inefficient exploration for diverse solutions. In this
paper, we propose LaDiR (Latent Diffusion Reasoner), a novel reasoning
framework that unifies the expressiveness of continuous latent representation
with the iterative refinement capabilities of latent diffusion models for an
existing LLM. We first construct a structured latent reasoning space using a
Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of
thought tokens, preserving semantic information and interpretability while
offering compact but expressive representations. Subsequently, we utilize a
latent diffusion model that learns to denoise a block of latent thought tokens
with a blockwise bidirectional attention mask, enabling longer horizon and
iterative refinement with adaptive test-time compute. This design allows
efficient parallel generation of diverse reasoning trajectories, allowing the
model to plan and revise the reasoning process holistically. We conduct
evaluations on a suite of mathematical reasoning and planning benchmarks.
Empirical results show that LaDiR consistently improves accuracy, diversity,
and interpretability over existing autoregressive, diffusion-based, and latent
reasoning methods, revealing a new paradigm for text reasoning with latent
diffusion.
♻ ☆ Personality Editing for Language Models through Adjusting Self-Referential Queries
Large Language Models (LLMs) are integral to applications such as
conversational agents and content creation, where precise control over a
model's personality is essential for maintaining tone, consistency, and user
engagement. However, prevailing prompt-based or fine-tuning approaches either
lack robustness or demand large-scale training data, making them costly and
impractical. In this paper, we present PALETTE (Personality Adjustment by LLM
SElf-TargeTed quEries), a novel method for personality editing in LLMs. Our
approach introduces adjustment queries, where self-referential statements
grounded in psychological constructs are treated analogously to factual
knowledge, enabling direct editing of personality-related responses. Unlike
fine-tuning, PALETTE requires only 12 editing samples to achieve substantial
improvements in personality alignment across personality dimensions.
Experimental results from both automatic and human evaluations demonstrate that
our method enables more stable and well-balanced personality control in LLMs.
comment: 22 pages, 5 figures, 26 tables
♻ ☆ Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation AACL 2025
Language models have demonstrated remarkable performance on complex
multi-step reasoning tasks. However, their evaluation has been predominantly
confined to high-resource languages such as English. In this paper, we
introduce a manually translated Bangla multi-step reasoning dataset derived
from the English Reveal dataset, featuring both binary and non-binary question
types. We conduct a controlled evaluation of English-centric and Bangla-centric
multilingual small language models on the original dataset and our translated
version to compare their ability to exploit relevant reasoning steps to produce
correct answers. Our results show that, in comparable settings, reasoning
context is beneficial for more challenging non-binary questions, but models
struggle to employ relevant Bangla reasoning steps effectively. We conclude by
exploring how reasoning steps contribute to models' predictions, highlighting
different trends across models and languages.
comment: Submitted to BLP workshop @ IJCNLP-AACL 2025
♻ ☆ PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs EMNLP 2025
Vocabulary acquisition poses a significant challenge for second-language (L2)
learners, especially when learning typologically distant languages such as
English and Korean, where phonological and structural mismatches complicate
vocabulary learning. Recently, large language models (LLMs) have been used to
generate keyword mnemonics by leveraging similar keywords from a learner's
first language (L1) to aid in acquiring L2 vocabulary. However, most methods
still rely on direct IPA-based phonetic matching or employ LLMs without
phonological guidance. In this paper, we present PhoniTale, a novel
cross-lingual mnemonic generation system that performs IPA-based phonological
adaptation and syllable-aware alignment to retrieve L1 keyword sequence and
uses LLMs to generate verbal cues. We evaluate PhoniTale through automated
metrics and a short-term recall test with human participants, comparing its
output to human-written and prior automated mnemonics. Our findings show that
PhoniTale consistently outperforms previous automated approaches and achieves
quality comparable to human-written mnemonics.
comment: Accepted to EMNLP 2025 Main Conference
♻ ☆ BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning ICCV 2025
Human infants rapidly develop visual reasoning skills from minimal input,
suggesting that developmentally inspired pretraining could significantly
enhance the efficiency of vision-language models (VLMs). Although recent
efforts have leveraged infant-inspired datasets like SAYCam, existing
evaluation benchmarks remain misaligned--they are either too simplistic,
narrowly scoped, or tailored for large-scale pretrained models. Additionally,
training exclusively on infant data overlooks the broader, diverse input from
which infants naturally learn. To address these limitations, we propose
BabyVLM, a novel framework comprising comprehensive in-domain evaluation
benchmarks and a synthetic training dataset created via child-directed
transformations of existing datasets. We demonstrate that VLMs trained with our
synthetic dataset achieve superior performance on BabyVLM tasks compared to
models trained solely on SAYCam or general-purpose data of the SAYCam size.
BabyVLM thus provides a robust, developmentally aligned evaluation tool and
illustrates how compact models trained on carefully curated data can generalize
effectively, opening pathways toward data-efficient vision-language learning
paradigms.
comment: Accepted to ICCV 2025
♻ ☆ DiaCDM: Cognitive Diagnosis in Teacher-Student Dialogues using the Initiation-Response-Evaluation Framework
While cognitive diagnosis (CD) effectively assesses students' knowledge
mastery from structured test data, applying it to real-world teacher-student
dialogues presents two fundamental challenges. Traditional CD models lack a
suitable framework for handling dynamic, unstructured dialogues, and it's
difficult to accurately extract diagnostic semantics from lengthy dialogues. To
overcome these hurdles, we propose DiaCDM, an innovative model. We've adapted
the initiation-response-evaluation (IRE) framework from educational theory to
design a diagnostic framework tailored for dialogue. We also developed a unique
graph-based encoding method that integrates teacher questions with relevant
knowledge components to capture key information more precisely. To our
knowledge, this is the first exploration of cognitive diagnosis in a dialogue
setting. Experiments on three real-world dialogue datasets confirm that DiaCDM
not only significantly improves diagnostic accuracy but also enhances the
results' interpretability, providing teachers with a powerful tool for
assessing students' cognitive states. The code is available at
https://github.com/Mind-Lab-ECNU/DiaCDM/tree/main.
♻ ☆ DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation
Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang, Rui Dong, Eric Dong, Sophia Ananiadou, Min Peng, Qianqian Xie
Large language models (LLMs) have substantially advanced machine translation
(MT), yet their effectiveness in translating web novels remains unclear.
Existing benchmarks rely on surface-level metrics that fail to capture the
distinctive traits of this genre. To address these gaps, we introduce DITING,
the first comprehensive evaluation framework for web novel translation,
assessing narrative and cultural fidelity across six dimensions: idiom
translation, lexical ambiguity, terminology localization, tense consistency,
zero-pronoun resolution, and cultural safety, supported by over 18K
expert-annotated Chinese-English sentence pairs. We further propose AgentEval,
a reasoning-driven multi-agent evaluation framework that simulates expert
deliberation to assess translation quality beyond lexical overlap, achieving
the highest correlation with human judgments among seven tested automatic
metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation
dataset of 300 sentence pairs annotated with error labels and scalar quality
scores. Comprehensive evaluation of fourteen open, closed, and commercial
models reveals that Chinese-trained LLMs surpass larger foreign counterparts,
and that DeepSeek-V3 delivers the most faithful and stylistically coherent
translations. Our work establishes a new paradigm for exploring LLM-based web
novel translation and provides public resources to advance future research.
♻ ☆ Training and Evaluating with Human Label Variation: An Empirical Study
Human label variation (HLV) challenges the standard assumption that a
labelled instance has a single ground truth, instead embracing the natural
variation in human annotation to train and evaluate models. While various
training methods and metrics for HLV have been proposed, it is still unclear
which methods and metrics perform best in what settings. We propose new
evaluation metrics for HLV leveraging fuzzy set theory. Since these new
proposed metrics are differentiable, we then in turn experiment with employing
these metrics as training objectives. We conduct an extensive study over 6 HLV
datasets testing 14 training methods and 6 evaluation metrics. We find that
training on either disaggregated annotations or soft labels performs best
across metrics, outperforming training using the proposed training objectives
with differentiable metrics. We also show that our proposed soft micro F1 score
is one of the best metrics for HLV data.
comment: 27 pages, 7 figures. Accepted to CL. Pre-MIT Press publication
version. Fixed PO-JSD values on the MFRC dataset. Completely redid the
empirical meta-evaluation, added more related work, and other minor edits
♻ ☆ Chronological Passage Assembling in RAG framework for Temporal Question Answering
Long-context question answering over narrative tasks is challenging because
correct answers often hinge on reconstructing a coherent timeline of events
while preserving contextual f low in a limited context window.
Retrievalaugmented generation (RAG) methods aim to address this challenge by
selectively retrieving only necessary document segments. However, narrative
texts possess unique characteristics that limit the effectiveness of these
existing approaches. Specifically, understanding narrative texts requires more
than isolated segments, as the broader context and sequential relationships
between segments are crucial for comprehension. To address these limitations,
we propose ChronoRAG, a novel RAG framework specialized for narrative texts.
This approach focuses on two essential aspects: refining dispersed document
information into coherent and structured passages and preserving narrative flow
by explicitly capturing and maintaining the temporal order among retrieved
passages. We empirically demonstrate the effectiveness of ChronoRAG through
experiments on the NarrativeQA and GutenQAdataset, showing substantial
improvements in tasks requiring both factual identification and comprehension
of complex sequential relationships, underscoring that reasoning over temporal
order is crucial in resolving narrative QA.
comment: 15 pages, 4 figures
♻ ☆ Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification
Large language models (LLMs) increasingly generate natural language
rationales to enhance interpretability, but these often contain logical errors,
label mismatches, and domain-specific misalignments. Directly using such
rationales as supervision risks propagating noise and undermining training
stability. To address this challenge, we introduce Self-Filtered Distillation,
a framework specifically tailored for patent classification, which treats
LLM-generated rationales as trust signals rather than ground-truth supervision.
The framework employs selective distillation guided by three unsupervised trust
metrics: (1) Self-Consistency, which measures the stability of LLM-generated
rationales across multiple generations; (2) Class Entailment Alignment, which
assesses semantic coherence with patent-specific class definitions; and (3) LLM
Agreement Scoring, which validates rationale-label plausibility. These metrics
are integrated into a unified trust score that primarily weights training
samples while optionally filtering out extremely low-trust cases, enabling
reasoning-aware supervision. Experiments on the USPTO-2M dataset, a widely used
benchmark for patent classification, show that our method outperforms
label-based learning and conventional distillation in accuracy, stability, and
interpretability, establishing a reliable paradigm for leveraging
reasoning-aware trust indicators in patent analytics.
♻ ☆ Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
Large language models~(LLMs) are expected to be helpful, harmless, and
honest. In different alignment scenarios, such as safety, confidence, and
general preference alignment, binary preference data collection and reward
modeling are resource-intensive but play a central role in transferring human
preferences. In this work, we explore using the similarity between sampled
generations and reference answers as a supplementary reward function for
alignment. When unary reference answers are available, such similarity-based
rewards can circumvent the need for binary preference data and explicit reward
modeling. We introduce \textit{RefAlign}, a versatile REINFORCE-style alignment
algorithm that does not rely on reward or reference models. RefAlign utilizes
language generation evaluation metrics, such as BERTScore, between sampled
generations and reference answers as surrogate rewards. Beyond general
preference optimization, RefAlign can be naturally extended to diverse
scenarios, including safety and confidence alignment, by combining
similarity-based rewards with task-specific objectives. Across multiple
scenarios, RefAlign achieves performance comparable to prior alignment methods
while operating without binary preference data or reward models. The code is
available at https://github.com/mzhaoshuai/RefAlign.
comment: The code is at https://github.com/mzhaoshuai/RefAlign
♻ ☆ Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation NeurIPS 2025
Retrieval-Augmented Generation (RAG) leverages large language models (LLMs)
combined with external contexts to enhance the accuracy and reliability of
generated responses. However, reliably attributing generated content to
specific context segments, context attribution, remains challenging due to the
computationally intensive nature of current methods, which often require
extensive fine-tuning or human annotation. In this work, we introduce a novel
Jensen-Shannon Divergence driven method to Attribute Response to Context
(ARC-JSD), enabling efficient and accurate identification of essential context
sentences without additional fine-tuning, gradient-calculation or surrogate
modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA,
Hotpot QA, and Musique, using instruction-tuned LLMs in different scales
demonstrate superior accuracy and significant computational efficiency
improvements compared to the previous surrogate-based method. Furthermore, our
mechanistic analysis reveals specific attention heads and multilayer perceptron
(MLP) layers responsible for context attribution, providing valuable insights
into the internal workings of RAG models and how they affect RAG behaviours.
Our code is available at https://github.com/ruizheliUOA/ARC_JSD.
comment: Best Paper Award at COLM 2025 XLLM-Reason-Plan Workshop; Accepted at
NeurIPS 2025 Mechanistic Interpretability Workshop
♻ ☆ Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm EMNLP2025
Selecting high-quality and diverse training samples from extensive datasets
plays a crucial role in reducing training overhead and enhancing the
performance of Large Language Models (LLMs). However, existing studies fall
short in assessing the overall value of selected data, focusing primarily on
individual quality, and struggle to strike an effective balance between
ensuring diversity and minimizing data point traversals. Therefore, this paper
introduces a novel choice-based sample selection framework that shifts the
focus from evaluating individual sample quality to comparing the contribution
value of different samples when incorporated into the subset. Thanks to the
advanced language understanding capabilities of LLMs, we utilize LLMs to
evaluate the value of each option during the selection process. Furthermore, we
design a greedy sampling process where samples are incrementally added to the
subset, thereby improving efficiency by eliminating the need for exhaustive
traversal of the entire dataset with the limited budget. Extensive experiments
demonstrate that selected data from our method not only surpasses the
performance of the full dataset but also achieves competitive results with
recent powerful studies, while requiring fewer selections. Moreover, we
validate our approach on a larger medical dataset, highlighting its practical
applicability in real-world applications. Our code and data are available at
https://github.com/BIRlz/comperative_sample_selection.
comment: EMNLP2025
♻ ☆ DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
Enhancing LLMs with the ability to actively search external knowledge is
crucial for complex and real-world tasks. Current approaches either rely on
prompting to elicit the model's innate agent capabilities, or suffer from
performance ceilings and collapse when applying RL to complex interactive
tasks, leaving their true agentic potential untapped. To address this, we
introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy
\textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust
agent training through sequence-level optimization and dynamic sample
filtering. We train our model purely through RL to interleave multi-turn search
and reasoning, obviating the need for supervised demonstration data. Across
multiple QA benchmarks, our DSPO-trained 7B model improves over a comparable
previous work by \textbf{34.1\%}, and even outperforms the 14B model from
previous work in complex multihop QA such as HotpotQA by nearly \textbf{9\%
relative}, maintaining exceptional training stability.
♻ ☆ Dynamic Optimizations of LLM Ensembles with Two-Stage Reinforcement Learning Agents
The advancement of LLMs and their accessibility have triggered renewed
interest in multi-agent reinforcement learning as robust and adaptive
frameworks for dynamically changing environments. This paper introduces
RL-Focal, a two-stage RL agent framework that routes and ensembles LLMs. First,
we develop the Decider RL-agent, which learns to dynamically select an ensemble
of small size ($m_i$) among $N$ LLMs ($m_i \ll N$) for incoming queries from a
user-defined downstream task $i$, by maximizing both error-diversity and
reasoning-performance of the selected ensemble through iterative updates of
task-adaptive rewards and policy. Second, to enable effective fusion of
dynamically selected LLMs, we develop the stage-2 Fusion RL-agent, which learns
to resolve reasoning conflicts from different LLMs and dynamically adapts to
different ensemble teams composed by the Decider Agent for different downstream
tasks. Third, we introduce the focal diversity metric to better model the error
correlations among multiple LLMs, further improving the generalization
performance of the Decider Agent, which actively prunes the ensemble
combinations. By focal diversity, we enhance performance across tasks by
effectively promoting reward-aware and policy-adaptive ensemble selection and
inference fusion. Extensive evaluations on five benchmarks show that RL-Focal
achieves the performance improvement of 8.48\% with an ensemble of small size
compared to the best individual LLM in a pool and offers stronger robustness.
Code is available at https://github.com/sftekin/rl-focal
♻ ☆ ARM: Adaptive Reasoning Model NeurIPS 2025
While large reasoning models demonstrate strong performance on complex tasks,
they lack the ability to adjust reasoning token usage based on task difficulty.
This often leads to the "overthinking" problem -- excessive and unnecessary
reasoning -- which, although potentially mitigated by human intervention to
control the token budget, still fundamentally contradicts the goal of achieving
fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a
reasoning model capable of adaptively selecting appropriate reasoning formats
based on the task at hand. These formats include three efficient ones -- Direct
Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To
train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy
Optimization (GRPO), which addresses the format collapse issue in traditional
GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by
an average of 30%, and up to 70%, while maintaining performance comparable to
the model that relies solely on Long CoT. Furthermore, not only does it improve
inference efficiency through reduced token generation, but it also brings a 2x
speedup in training. In addition to the default Adaptive Mode, ARM supports two
additional reasoning modes: 1) Instruction-Guided Mode, which allows users to
explicitly specify the reasoning format via special tokens -- ideal when the
appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode,
which aggregates the outputs of the three efficient formats and resorts to Long
CoT in case of disagreement, prioritizing performance with higher token usage.
comment: NeurIPS 2025 (Spotlight)
♻ ☆ Self-Exploring Language Models for Explainable Link Forecasting on Temporal Graphs via Reinforcement Learning
Zifeng Ding, Shenyang Huang, Zeyu Cao, Emma Kondrup, Zachary Yang, Xingyue Huang, Yuan Sui, Zhangdie Yuan, Yuqicheng Zhu, Xianglong Hu, Yuan He, Farimah Poursafaei, Michael Bronstein, Andreas Vlachos
Forecasting future links is a central task in temporal graph (TG) reasoning,
requiring models to leverage historical interactions to predict upcoming ones.
Traditional neural approaches, such as temporal graph neural networks, achieve
strong performance but lack explainability and cannot be applied to unseen
graphs without retraining. Recent studies have begun to explore using large
language models (LLMs) for graph reasoning, but most of them are constrained to
static graphs or small synthetic TGs and lack the evaluation of the quality of
reasoning traces generated by LLMs. In this work, we present Reasoning-Enhanced
Learning for Temporal Graphs (ReaL-TG), a reinforcement learning framework that
fine-tunes LLMs to perform explainable link forecasting on real-world TGs.
ReaL-TG uses outcome-based reward to encourage models to self-explore reasoning
strategies from graph structure and to produce explanations that directly
justify their predictions. To enable evaluation on LLM-generated reasoning
traces, we propose a new evaluation protocol combining ranking metrics with an
LLM-as-a-Judge system that assesses both the quality of reasoning and the
impact of hallucinations. Experiments with ReaL-TG-4B, obtained by fine-tuning
Qwen3-4B under our framework, show that it outperforms much larger frontier
LLMs, including GPT-5 mini, on ranking metrics, while producing high-quality
explanations confirmed by both the LLM judge and human evaluation.
♻ ☆ Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models ICCV 2025
Multimodal large language models (MLLMs) hold considerable promise for
applications in healthcare. However, their deployment in safety-critical
settings is hindered by two key limitations: (i) sensitivity to prompt design,
and (ii) a tendency to generate incorrect responses with high confidence. As
clinicians may rely on a model's stated confidence to gauge the reliability of
its predictions, it is especially important that when a model expresses high
confidence, it is also highly accurate. We introduce Prompt4Trust, the first
reinforcement learning (RL) framework for prompt augmentation targeting
confidence calibration in MLLMs. A lightweight LLM is trained to produce
context-aware auxiliary prompts that guide a downstream task MLLM to generate
responses in which the expressed confidence more accurately reflects predictive
accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically
prioritizes aspects of calibration most critical for safe and trustworthy
clinical decision-making. Beyond improvements driven by this clinically
motivated calibration objective, our proposed method also improves task
accuracy, achieving state-of-the-art medical visual question answering (VQA)
performance on the PMC-VQA benchmark, which is composed of multiple-choice
questions spanning diverse medical imaging modalities. Moreover, our framework
trained with a small downstream task MLLM showed promising zero-shot
generalization to larger MLLMs in our experiments, suggesting the potential for
scalable calibration without the associated computational costs. This work
demonstrates the potential of automated yet human-aligned prompt engineering
for improving the the trustworthiness of MLLMs in safety critical settings. Our
codebase can be found at https://github.com/xingbpshen/prompt4trust.
comment: Accepted to ICCV 2025 Workshop CVAMD
♻ ☆ SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu
Diffusion large language models (dLLMs) are emerging as an efficient
alternative to autoregressive models due to their ability to decode multiple
tokens in parallel. However, aligning dLLMs with human preferences or
task-specific rewards via reinforcement learning (RL) is challenging because
their intractable log-likelihood precludes the direct application of standard
policy gradient methods. While prior work uses surrogates like the evidence
lower bound (ELBO), these one-sided approximations can introduce significant
policy gradient bias. To address this, we propose the Sandwiched Policy
Gradient (SPG) that leverages both an upper and a lower bound of the true
log-likelihood. Experiments show that SPG significantly outperforms baselines
based on ELBO or one-step estimation. Specifically, SPG improves the accuracy
over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500,
18.4% in Countdown and 27.0% in Sudoku.
♻ ☆ Steering LLMs for Formal Theorem Proving
Recent advances in automated theorem proving use Large Language Models (LLMs)
to translate informal mathematical statements into formal proofs. However,
informal cues are often ambiguous or lack strict logical structure, making it
hard for models to interpret them precisely. While existing methods achieve
strong performance, little is known about how LLMs internally represent
informal cues, or how these influence proof generation. To address this, we
explore \textit{activation steering}, an inference-time intervention that
identifies linear directions in residual activations associated with informal
reasoning traces and adjusts them to improve proof construction without
fine-tuning. This mechanism also yields interpretable information about how
reasoning is internally encoded in the activation space of LLMs. We test our
method for generating formal proofs from already-formalized theorems. Our
contributions are twofold: (1) a novel activation-based intervention for
guiding proof synthesis in LLMs; and (2) demonstration that this intervention
improves performance under two decoding strategies (sampling and best-first
search) without any further training.
♻ ☆ Rethinking the Residual Distribution of Locate-then-Editing Methods in Model Editing NeurIPS 2025
Model editing enables targeted updates to the knowledge of large language
models (LLMs) with minimal retraining. Among existing approaches,
locate-then-edit methods constitute a prominent paradigm: they first identify
critical layers, then compute residuals at the final critical layer based on
the target edit, and finally apply least-squares-based multi-layer updates via
$\textbf{residual distribution}$. While empirically effective, we identify a
counterintuitive failure mode: residual distribution, a core mechanism in these
methods, introduces weight shift errors that undermine editing precision.
Through theoretical and empirical analysis, we show that such errors increase
with the distribution distance, batch size, and edit sequence length,
ultimately leading to inaccurate or suboptimal edits. To address this, we
propose the $\textbf{B}$oundary $\textbf{L}$ayer $\textbf{U}$pdat$\textbf{E
(BLUE)}$ strategy to enhance locate-then-edit methods. Sequential batch editing
experiments on three LLMs and two datasets demonstrate that BLUE not only
delivers an average performance improvement of 35.59\%, significantly advancing
the state of the art in model editing, but also enhances the preservation of
LLMs' general capabilities. Our code is available at
https://github.com/xpq-tech/BLUE.
comment: NeurIPS 2025
♻ ☆ When Machine Unlearning Meets Retrieval-Augmented Generation (RAG): Keep Secret or Forget Knowledge? SC
The deployment of large language models (LLMs) like ChatGPT and Gemini has
shown their powerful natural language generation capabilities. However, these
models can inadvertently learn and retain sensitive information and harmful
content during training, raising significant ethical and legal concerns. To
address these issues, machine unlearning has been introduced as a potential
solution. While existing unlearning methods take into account the specific
characteristics of LLMs, they often suffer from high computational demands,
limited applicability, or the risk of catastrophic forgetting. To address these
limitations, we propose a lightweight behavioral unlearning framework based on
Retrieval-Augmented Generation (RAG) technology. By modifying the external
knowledge base of RAG, we simulate the effects of forgetting without directly
interacting with the unlearned LLM. We approach the construction of unlearned
knowledge as a constrained optimization problem, deriving two key components
that underpin the effectiveness of RAG-based unlearning. This RAG-based
approach is particularly effective for closed-source LLMs, where existing
unlearning methods often fail. We evaluate our framework through extensive
experiments on both open-source and closed-source models, including ChatGPT,
Gemini, Llama-2-7b-chat, and PaLM 2. The results demonstrate that our approach
meets five key unlearning criteria: effectiveness, universality, harmlessness,
simplicity, and robustness. Meanwhile, this approach can extend to multimodal
large language models and LLM-based agents.
comment: 16 pages, 9 figures, 13 tables. To appear in IEEE Transactions on
Dependable and Secure Computing (TDSC), 2025
♻ ☆ Self-ensemble: Mitigating Confidence Mis-calibration for Large Language Models
Zicheng Xu, Guanchu Wang, Guangyao Zheng, Yu-Neng Chuang, Alexander Szalay, Xia Hu, Vladimir Braverman
Although Large Language Models (LLMs) perform well in general fields, they
exhibit a confidence distortion problem on multi-choice question-answering
(MCQA), particularly as the number of answer choices increases. Specifically,
on MCQA with many choices, LLMs suffer from under-confidence in correct
predictions and over-confidence in incorrect ones, leading to a substantially
degraded performance. To solve this problem, we propose Self-ensemble in this
work. Our method splits the choices into several groups and ensembles LLM
predictions across these groups to reach a final decision. The advantage of
Self-ensemble is its plug-and-play nature, where it can be integrated into
existing LLM architecture based on a designed attention mask and positional
encoding, without requiring labeled datasets for parameter tuning. Experimental
results on three LLMs and datasets demonstrate that Self-ensemble
comprehensively addresses the confidence distortion problem of LLMs,
outperforming standard inference as well as baseline methods.