Computation and Language
☆ TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He
Process Reward Models (PRMs) have recently emerged as a powerful framework
for enhancing the reasoning capabilities of large reasoning models (LRMs),
particularly in the context of test-time scaling (TTS). However, their
potential for supervising LRMs on tabular reasoning domains remains
underexplored. Through detailed empirical analyses, we identify that existing
PRMs, though widely adopted for supervising text-only reasoning steps, struggle
with table-specific operations such as sub-table retrieval and schema
interaction, leading to critical performance bottlenecks. To address this
limitation, we propose TaTToo, a novel table-grounded PRM framework that (i)
reasons explicitly over tabular reasoning steps and (ii) integrates tool-based
verification to provide precise reward supervision. Concretely, we first design
a scalable data curation pipeline that constructs over 60k high-quality
step-level annotations by integrating table verification rationales with
tool-based executions. Building on the collected data, we train TaTToo with a
dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use
reasoning patterns, followed by reinforcement learning with tool-grounded
reward shaping to align our model with table-based verification. We provide a
comprehensive evaluation of the policy improvement induced by our newly
designed PRM. Across 5 challenging tabular reasoning benchmarks covering
numerical reasoning, fact-checking, and data analysis, TaTToo improves
downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines
such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong
generalizability across diverse TTS strategies.
☆ Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents
Large language model (LLM) agents increasingly rely on external tools such as
search engines to solve complex, multi-step problems, and reinforcement
learning (RL) has become a key paradigm for training them. However, the
trajectories of search agents are structurally heterogeneous, where variations
in the number, placement, and outcomes of search calls lead to fundamentally
different answer directions and reward distributions. Standard policy gradient
methods, which use a single global baseline, suffer from what we identify and
formalize as cross-stratum bias-an "apples-to-oranges" comparison of
heterogeneous trajectories. This cross-stratum bias distorts credit assignment
and hinders exploration of complex, multi-step search strategies. To address
this, we propose Stratified GRPO, whose central component, Stratified Advantage
Normalization (SAN), partitions trajectories into homogeneous strata based on
their structural properties and computes advantages locally within each
stratum. This ensures that trajectories are evaluated only against their true
peers. Our analysis proves that SAN eliminates cross-stratum bias, yields
conditionally unbiased unit-variance estimates inside each stratum, and retains
the global unbiasedness and unit-variance properties enjoyed by standard
normalization, resulting in a more pure and scale-stable learning signal. To
improve practical stability under finite-sample regimes, we further linearly
blend SAN with the global estimator. Extensive experiments on diverse
single-hop and multi-hop question-answering benchmarks demonstrate that
Stratified GRPO consistently and substantially outperforms GRPO by up to 11.3
points, achieving higher training rewards, greater training stability, and more
effective search policies. These results establish stratification as a
principled remedy for structural heterogeneity in RL for LLM search agents.
☆ TokenChain: A Discrete Speech Chain via Semantic Token Modeling ICASSP
Machine Speech Chain, simulating the human perception-production loop, proves
effective in jointly improving ASR and TTS. We propose TokenChain, a fully
discrete speech chain coupling semantic-token ASR with a two-stage TTS: an
autoregressive text-to-semantic model co-trained with ASR and a
masked-generative semantic-to-acoustic model for synthesis only. End-to-end
feedback across the text interface is enabled with straight-through
argmax/Gumbel-Softmax and balanced with supervised ASR via dynamic weight
averaging. Ablations examine optimal temperature schedules for in- and
cross-domain transfer. Evaluation reveals TokenChain surpasses baseline
accuracy 2-6 epochs earlier and yields 5-13% lower equal-epoch error with
stable T2S on LibriSpeech, and reduces relative ASR WER by 56% and T2S WER by
31% on TED-LIUM with minimal forgetting, showing that chain learning remains
effective with token interfaces and models.
comment: 5 pages, 3 figures. Submitted to IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP) 2026
☆ Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction
This paper introduces a framework for relation extraction (RE) that enhances
both accuracy and explainability. The framework has two key components: (i) a
reasoning mechanism that formulates relation extraction as a series of
text-processing steps inspired by cognitive science, and (ii) an optimization
process driven by reinforcement learning (RL) with a novel reward function
designed to improve both task accuracy and explanation quality. We call our
approach CogRE. Our framework addresses the lack of supervision for
language-based explanations in traditional RE by promoting outputs that include
important relation keywords. These keywords are drawn from a high-quality
dictionary that is automatically constructed using an LLM. We evaluate our
approach for the task of one-shot RE using two LLMs and two RE datasets. Our
experiments show that CogRE improves explanation quality by addressing two
common failure patterns in one-shot RE: poor attention focus and limited
one-shot learning capability. For example, our cognitive-structured reasoning
with Qwen2.5-15B-Instruct on One-shot NYT29 achieves 24.65% F1, surpassing
prior reasoning-based designs. Optimizing this approach with RL using our
reward further improves performance by +23.46% (absolute). Finally, human
evaluation shows that our best model generates relational keywords closely
aligned with gold labels, increasing human explanation quality ratings by 54%
(relative).
comment: Working in process
☆ Latent Speech-Text Transformer
Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le
Auto-regressive speech-text models are typically pre-trained on a large
number of interleaved sequences of text tokens and raw speech encoded as speech
tokens using vector quantization. These models have demonstrated
state-of-the-art performance in speech-to-speech understanding and generation
benchmarks, together with promising scaling laws, primarily enabled by the
representational alignment between text and speech. Nevertheless, they suffer
from shortcomings, partly owing to the disproportionately longer sequences of
speech tokens in contrast to textual tokens. This results in a large compute
imbalance between modalities during pre-training as well as during inference,
and a potential hindrance to effectively aligning speech and text, ultimately
translating to several orders of magnitude slower scaling laws. We introduce
the Latent Speech-Text Transformer (LST), which makes pre-training speech-text
models more data-efficient by dynamically and inexpensively aggregating speech
tokens into latent speech patches. These patches serve as higher-level units
that can either align with corresponding textual units to aid capability
transfer or even encapsulate common speech sequences like silences to be more
compute-efficient. We show that LST outperforms vanilla approaches on
speech-to-speech as well as text-to-text benchmarks in both data- and
compute-controlled settings, the former indicating more effective
representational alignment and the latter indicating steeper scaling laws for
speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute
gain in speech accuracy under compute-controlled training and 5.3% under
data-controlled training, while also improving text performance. We will
release our models, code, and the evaluation data to facilitate further
research.
comment: 16 pages, 13 figures
☆ BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects
Real-time speech assistants are becoming increasingly popular for ensuring
improved accessibility to information. Bengali, being a low-resource language
with a high regional dialectal diversity, has seen limited progress in
developing such systems. Existing systems are not optimized for real-time use
and focus only on standard Bengali. In this work, we present BanglaTalk, the
first real-time speech assistance system for Bengali regional dialects.
BanglaTalk follows the client-server architecture and uses the Real-time
Transport Protocol (RTP) to ensure low-latency communication. To address
dialectal variation, we introduce a dialect-aware ASR system, BRDialect,
developed by fine-tuning the IndicWav2Vec model in ten Bengali regional
dialects. It outperforms the baseline ASR models by 12.41-33.98% on the
RegSpeech12 dataset. Furthermore, BanglaTalk can operate at a low bandwidth of
24 kbps while maintaining an average end-to-end delay of 4.9 seconds. Low
bandwidth usage and minimal end-to-end delay make the system both
cost-effective and interactive for real-time use cases, enabling inclusive and
accessible speech technology for the diverse community of Bengali speakers.
☆ RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback
Chunyu Miao, Henry Peng Zou, Yangning Li, Yankai Chen, Yibo Wang, Fangxin Wang, Yifan Li, Wooseong Yang, Bowei He, Xinni Zhang, Dianzhi Yu, Hanchen Yang, Hoang H Nguyen, Yue Zhou, Jie Yang, Jizhou Guo, Wenzhe Fan, Chin-Yuan Yeh, Panpan Meng, Liancheng Fang, Jinhu Qi, Wei-Chieh Huang, Zhengyao Gu, Yuwei Han, Langzhou He, Yuyao Yang, Xue Liu, Irwin King, Philip S. Yu
Large language models (LLMs) show the promise in supporting scientific
research implementation, yet their ability to generate correct and executable
code remains limited. Existing works largely adopt one-shot settings, ignoring
the iterative and feedback-driven nature of realistic workflows of scientific
research development. To address this gap, we present RECODE-H, a benchmark of
102 tasks from research papers and repositories that evaluates LLM agents
through multi-turn interactions with LLM-simulated human feedback. It includes
structured instructions,unit tests, and a five-level feedback hierarchy to
reflect realistic researcher-agent collaboration. We further present
ReCodeAgent, a framework that integrates feedback into iterative code
generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4,
DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer
feedback, while also highlighting ongoing challenges in the generation of
complex research code. RECODE-H establishes a foundation for developing
adaptive, feedback-driven LLM agents in scientific research implementation
comment: Code and dataset are available at github.com/ChunyuMiao98/RECODE
☆ Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context
A key component of in-context reasoning is the ability of language models
(LMs) to bind entities for later retrieval. For example, an LM might represent
"Ann loves pie" by binding "Ann" to "pie", allowing it to later retrieve "Ann"
when asked "Who loves pie?" Prior research on short lists of bound entities
found strong evidence that LMs implement such retrieval via a positional
mechanism, where "Ann" is retrieved based on its position in context. In this
work, we find that this mechanism generalizes poorly to more complex settings;
as the number of bound entities in context increases, the positional mechanism
becomes noisy and unreliable in middle positions. To compensate for this, we
find that LMs supplement the positional mechanism with a lexical mechanism
(retrieving "Ann" using its bound counterpart "pie") and a reflexive mechanism
(retrieving "Ann" through a direct pointer). Through extensive experiments on
nine models and ten binding tasks, we uncover a consistent pattern in how LMs
mix these mechanisms to drive model behavior. We leverage these insights to
develop a causal model combining all three mechanisms that estimates next token
distributions with 95% agreement. Finally, we show that our model generalizes
to substantially longer inputs of open-ended text interleaved with entity
groups, further demonstrating the robustness of our findings in more natural
settings. Overall, our study establishes a more complete picture of how LMs
bind and retrieve entities in-context.
☆ VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
The Key-Value (KV) cache introduces substantial memory overhead during large
language model (LLM) inference. Although existing vector quantization (VQ)
methods reduce KV cache usage and provide flexible representational capacity
across bit-widths, they suffer severe performance degradation at ultra-low
bit-widths due to key cache outliers that hinder effective codebook
utilization. To address this challenge, we propose VecInfer, a novel VQ method
for aggressive KV cache compression while enabling efficient inference. By
applying smooth and Hadamard transformations, VecInfer suppresses outliers in
the key cache, enabling the codebook to comprehensively cover the original data
distribution and thereby reducing quantization difficulty. To facilitate
efficient deployment, we design an optimized CUDA kernel that fuses computation
with dequantization to minimize memory access overhead. Extensive evaluations
demonstrate that VecInfer consistently outperforms existing quantization
baselines across both long-context understanding and mathematical reasoning
tasks. With only 2-bit quantization, VecInfer achieves performance comparable
to full precision, while delivering up to $\mathbf{2.7\times}$ speedup in
large-batch self-attention computation and $\mathbf{8.3\times}$ reduction in
single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.
☆ RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM Generators without Human Test Sets
LLMs are powerful generators of synthetic data, which are used for training
smaller, specific models. This is especially valuable for low-resource
languages, where human-labelled data is scarce but LLMs can still produce
high-quality text. However, LLMs differ in how useful their outputs are for
training. Selecting the best LLM as a generator is challenging because
extrinsic evaluation requires costly human annotations (which are often
unavailable for low-resource languages), while intrinsic metrics correlate
poorly with downstream performance. We introduce Round robin Synthetic data
Evaluation (RoSE), a proxy metric for selecting the best LLM generator without
human test sets. RoSE trains a small model on the outputs of a candidate
generator (LLM) and then evaluates it on generated synthetic examples from all
other candidate LLMs. The final RoSE score is the mean performance of this
small model. Across six LLMs, eleven languages, and three tasks (sentiment,
topic, intent), RoSE identifies the optimal generator more often than any other
intrinsic heuristics. RoSE outperforms intrinsic heuristics and comes within
0.76 percentage points of the optimal generator baseline. This result is
measured in terms of downstream performance, obtained by training a small model
on the chosen generator's outputs (optimal vs. proxy metric selected) and
evaluating it on human-labelled test data. Additionally, RoSE is the only
metric to achieve a positive correlation with performance on human test data.
comment: 16 pages
☆ CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credits
Diffusion large language models (dLLMs) generate text through iterative
denoising steps, achieving parallel decoding by denoising only high-confidence
positions at each step. However, existing approaches often repetitively remask
tokens due to initially low confidence scores, leading to redundant iterations
and limiting overall acceleration. Through the analysis of dLLM decoding
traces, we observe that the model often determines the final prediction for a
token several steps before the decoding step. To leverage this historical
information and avoid redundant steps, we introduce the concept of Trace
Credit, which quantifies each token's convergence potential by accumulating
historical logits. Furthermore, we propose CreditDecoding, a training-free
parallel decoding algorithm that accelerates the confidence convergence of
correct but underconfident tokens by fusing current logits with Trace Credit.
This process significantly reduces redundant iterations and enhances decoding
robustness. On eight benchmarks, CreditDecoding achieves a 5.48 times speedup
and a 0.48 performance improvement over LLaDA-8B-Instruct, and a 4.11 times
speedup with a 0.15 performance improvement over LLaDA-MoE-Instruct.
Importantly, CreditDecoding scales effectively to long sequences and is
orthogonal to mainstream inference optimizations, making it a readily
integrable and versatile solution.
comment: 18 pages,8 figures,4 tables
☆ Parallel Tokenizers: Rethinking Vocabulary Design for Cross-Lingual Transfer
Tokenization defines the foundation of multilingual language models by
determining how words are represented and shared across languages. However,
existing methods often fail to support effective cross-lingual transfer because
semantically equivalent words are assigned distinct embeddings. For example, "I
eat rice" in English and "Ina cin shinkafa" in Hausa are typically mapped to
different vocabulary indices, preventing shared representations and limiting
cross-lingual generalization. We introduce parallel tokenizers. This new
framework trains tokenizers monolingually and then aligns their vocabularies
exhaustively using bilingual dictionaries or word-to-word translation, ensuring
consistent indices for semantically equivalent words. This alignment enforces a
shared semantic space across languages while naturally improving fertility
balance. To assess their effectiveness, we pretrain a transformer encoder from
scratch on thirteen low-resource languages and evaluate it on sentiment
analysis, hate speech detection, emotion classification, and sentence embedding
similarity. Across all tasks, models trained with parallel tokenizers
outperform conventional multilingual baselines, confirming that rethinking
tokenization is essential for advancing multilingual representation
learning--especially in low-resource settings.
comment: 18 pages, 25 tables, 7 figures
☆ Taxonomy of User Needs and Actions
The growing ubiquity of conversational AI highlights the need for frameworks
that capture not only users' instrumental goals but also the situated,
adaptive, and social practices through which they achieve them. Existing
taxonomies of conversational behavior either overgeneralize, remain
domain-specific, or reduce interactions to narrow dialogue functions. To
address this gap, we introduce the Taxonomy of User Needs and Actions (TUNA),
an empirically grounded framework developed through iterative qualitative
analysis of 1193 human-AI conversations, supplemented by theoretical review and
validation across diverse contexts. TUNA organizes user actions into a
three-level hierarchy encompassing behaviors associated with information
seeking, synthesis, procedural guidance, content creation, social interaction,
and meta-conversation. By centering user agency and appropriation practices,
TUNA enables multi-scale evaluation, supports policy harmonization across
products, and provides a backbone for layering domain-specific taxonomies. This
work contributes a systematic vocabulary for describing AI use, advancing both
scholarly understanding and practical design of safer, more responsive, and
more accountable conversational systems.
☆ Influence Functions for Efficient Data Selection in Reasoning
Fine-tuning large language models (LLMs) on chain-of-thought (CoT) data shows
that a small amount of high-quality data can outperform massive datasets. Yet,
what constitutes "quality" remains ill-defined. Existing reasoning methods rely
on indirect heuristics such as problem difficulty or trace length, while
instruction-tuning has explored a broader range of automated selection
strategies, but rarely in the context of reasoning. We propose to define
reasoning data quality using influence functions, which measure the causal
effect of individual CoT examples on downstream accuracy, and introduce
influence-based pruning, which consistently outperforms perplexity and
embedding-based baselines on math reasoning within a model family.
☆ Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models
Large Language Models (LLMs) are prone to hallucination, the generation of
plausible yet factually incorrect statements. This work investigates the
intrinsic, architectural origins of this failure mode through three primary
contributions.First, to enable the reliable tracing of internal semantic
failures, we propose \textbf{Distributional Semantics Tracing (DST)}, a unified
framework that integrates established interpretability techniques to produce a
causal map of a model's reasoning, treating meaning as a function of context
(distributional semantics). Second, we pinpoint the model's layer at which a
hallucination becomes inevitable, identifying a specific \textbf{commitment
layer} where a model's internal representations irreversibly diverge from
factuality. Third, we identify the underlying mechanism for these failures. We
observe a conflict between distinct computational pathways, which we interpret
using the lens of dual-process theory: a fast, heuristic \textbf{associative
pathway} (akin to System 1) and a slow, deliberate \textbf{contextual pathway}
(akin to System 2), leading to predictable failure modes such as
\textit{Reasoning Shortcut Hijacks}. Our framework's ability to quantify the
coherence of the contextual pathway reveals a strong negative correlation
($\rho = -0.863$) with hallucination rates, implying that these failures are
predictable consequences of internal semantic weakness. The result is a
mechanistic account of how, when, and why hallucinations occur within the
Transformer architecture.
☆ The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models NeurIPS 2025
Distilling the thinking traces of a Large Language Model (LLM) with reasoning
capabilities into a smaller model has been proven effective. Yet, there is a
scarcity of work done on how model performances scale with the quantity of
distillation data. In this work, we study the scaling trend of distilling
competitive coding skills on two small non-reasoning LLMs. We validate the
hypothesis that there is a $\textit{valley of code reasoning}$: downstream
performance on competitive coding first drops as data quantity increases, then
it steadily increases in a sharper-than-log-linear fashion. Having identified
the trend, we further fine-tune the models at two different distillation stages
on the same data to ground conclusions on their respective learning phases. We
learn that across stages in the low and medium-low data regimes, small models
benefit significantly from easier coding questions than from harder ones. We
also find that, surprisingly, the correctness of outputs in training data makes
no difference to distillation outcomes. Our work represents a step forward in
understanding the training dynamics of code reasoning distillation outside
intuition
comment: NeurIPS 2025 Workshop on Deep Learning for Code (DL4C), Project page:
https://collinear.ai/valley-of-reasoning
☆ The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
The objectives that Large Language Models (LLMs) implicitly optimize remain
dangerously opaque, making trustworthy alignment and auditing a grand
challenge. While Inverse Reinforcement Learning (IRL) can infer reward
functions from behaviour, existing approaches either produce a single,
overconfident reward estimate or fail to address the fundamental ambiguity of
the task (non-identifiability). This paper introduces a principled auditing
framework that re-frames reward inference from a simple estimation task to a
comprehensive process for verification. Our framework leverages Bayesian IRL to
not only recover a distribution over objectives but to enable three critical
audit capabilities: (i) Quantifying and systematically reducing
non-identifiability by demonstrating posterior contraction over sequential
rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics
that expose spurious shortcuts and identify out-of-distribution prompts where
the inferred objective cannot be trusted; and (iii) Validating policy-level
utility by showing that the refined, low-uncertainty reward can be used
directly in RLHF to achieve training dynamics and toxicity reductions
comparable to the ground-truth alignment process. Empirically, our framework
successfully audits a detoxified LLM, yielding a well-calibrated and
interpretable objective that strengthens alignment guarantees. Overall, this
work provides a practical toolkit for auditors, safety teams, and regulators to
verify what LLMs are truly trying to achieve, moving us toward more trustworthy
and accountable AI.
comment: Preprint
☆ Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL
Reinforcement Learning from Human Feedback (RLHF) aligns Large Language
Models (LLMs) with human preferences, yet the underlying reward signals they
internalize remain hidden, posing a critical challenge for interpretability and
safety. Existing approaches attempt to extract these latent incentives using
Inverse Reinforcement Learning (IRL), but treat all preference pairs equally,
often overlooking the most informative signals: those examples the extracted
reward model misclassifies or assigns nearly equal scores, which we term
\emph{failures}. We introduce a novel \emph{failure-aware} IRL algorithm that
focuses on misclassified or difficult examples to recover the latent rewards
defining model behaviors. By learning from these failures, our failure-aware
IRL extracts reward functions that better reflect the true objectives behind
RLHF. We demonstrate that failure-aware IRL outperforms existing IRL baselines
across multiple metrics when applied to LLM detoxification, without requiring
external classifiers or supervision. Crucially, failure-aware IRL yields
rewards that better capture the true incentives learned during RLHF, enabling
more effective re-RLHF training than standard IRL. This establishes
failure-aware IRL as a robust, scalable method for auditing model alignment and
reducing ambiguity in the IRL process.
comment: Preprint
☆ Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability
Taylor Sorensen, Benjamin Newman, Jared Moore, Chan Park, Jillian Fisher, Niloofar Mireshghallah, Liwei Jiang, Yejin Choi
Language model post-training has enhanced instruction-following and
performance on many downstream tasks, but also comes with an often-overlooked
cost on tasks with many possible valid answers. We characterize three
desiderata for conditional distributional modeling: in-context steerability,
valid output space coverage, and distributional alignment, and document across
three model families how current post-training can reduce these properties. In
particular, we disambiguate between two kinds of in-context learning: ICL for
eliciting existing underlying knowledge or capabilities, and in-context
steerability, where a model must use in-context information to override its
priors and steer to a novel data generating distribution. To better evaluate
and improve these desiderata, we introduce Spectrum Suite, a large-scale
resource compiled from >40 data sources and spanning >90 tasks requiring models
to steer to and match diverse distributions ranging from varied human
preferences to numerical distributions and more. We find that while current
post-training techniques help elicit underlying capabilities and knowledge,
they hurt models' ability to flexibly steer in-context. To mitigate these
issues, we propose Spectrum Tuning, a post-training method using Spectrum Suite
to improve steerability and distributional coverage. We find that Spectrum
Tuning often improves over pretrained models and their instruction-tuned
counterparts, enhancing steerability, spanning more of the output space, and
improving distributional alignment on held-out datasets.
☆ ASPO: Asymmetric Importance Sampling Policy Optimization
Recent Large Language Model (LLM) post-training methods rely on token-level
clipping mechanisms during Reinforcement Learning (RL). However, we identify a
fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance
Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to
unbalanced token weighting for positive and negative tokens. This mismatch
suppresses the update of low-probability tokens while over-amplifying already
high-probability ones. To address this, we propose Asymmetric Importance
Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy
that flips the IS ratios of positive-advantage tokens, aligning their update
direction with the learning dynamics of negative ones. AIS further incorporates
a soft dual-clipping mechanism to stabilize extreme updates while maintaining
gradient flow. Comprehensive experiments on coding and mathematical reasoning
benchmarks demonstrate that ASPO significantly mitigates premature convergence,
improves training stability, and enhances final performance over strong
GRPO-based baselines. Our analysis provides new insights into the role of
token-level weighting in OSRL and highlights the critical importance of
correcting IS in LLM RL. The code and models of ASPO are available at
https://github.com/wizard-III/Archer2.0.
☆ MixReasoning: Switching Modes to Think
Reasoning models enhance performance by tackling problems in a step-by-step
manner, decomposing them into sub-problems and exploring long chains of thought
before producing an answer. However, applying extended reasoning to every step
introduces substantial redundancy, as sub-problems vary widely in difficulty
and complexity: a small number of pivotal steps are genuinely challenging and
decisive for the final answer, while many others only involve straightforward
revisions or simple computations. Therefore, a natural idea is to endow
reasoning models with the ability to adaptively respond to this variation,
rather than treating all steps with the same level of elaboration. To this end,
we propose MixReasoning, a framework that dynamically adjusts the depth of
reasoning within a single response. The resulting chain of thought then becomes
a mixture of detailed reasoning on difficult steps and concise inference on
simpler ones. Experiments on GSM8K, MATH-500, and AIME show that MixReasoning
shortens reasoning length and substantially improves efficiency without
compromising accuracy.
☆ CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs
Chengwei Wu, Jiapu Wang, Mingyang Gao, Xingrui Zhuo, Jipeng Guo, Runlin Lei, Haoran Luo, Tianyu Chen, Haoyi Zhou, Shirui Pan, Zechao Li
Large Language Models (LLMs) have achieved remarkable success across a wide
range of natural language processing tasks. However, Chinese LLMs face unique
challenges, primarily due to the dominance of unstructured free text and the
lack of structured representations in Chinese corpora. While existing
benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly
English-centric and fail to address the unique linguistic characteristics of
Chinese, lacking structured datasets essential for robust evaluation. To
address these challenges, we present a Comprehensive Benchmark for Evaluating
Chinese Large Language Models (CB-ECLLM) based on the newly constructed Chinese
Data-Text Pair (CDTP) dataset. Specifically, CDTP comprises over 7 million
aligned text pairs, each consisting of unstructured text coupled with one or
more corresponding triples, alongside a total of 15 million triples spanning
four critical domains. The core contributions of CDTP are threefold: (i)
enriching Chinese corpora with high-quality structured information; (ii)
enabling fine-grained evaluation tailored to knowledge-driven tasks; and (iii)
supporting multi-task fine-tuning to assess generalization and robustness
across scenarios, including Knowledge Graph Completion, Triple-to-Text
generation, and Question Answering. Furthermore, we conduct rigorous
evaluations through extensive experiments and ablation studies to assess the
effectiveness, Supervised Fine-Tuning (SFT), and robustness of the benchmark.
To support reproducible research, we offer an open-source codebase and outline
potential directions for future investigations based on our insights.
☆ Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance
Recent studies employing Large Language Models (LLMs) to test the Argument
from the Poverty of the Stimulus (APS) have yielded contrasting results across
syntactic phenomena. This paper investigates the hypothesis that
characteristics of the stimuli used in recent studies, including lexical
ambiguities and structural complexities, may confound model performance. A
methodology is proposed for re-evaluating LLM competence on syntactic
prediction, focusing on GPT-2. This involves: 1) establishing a baseline on
previously used (both filtered and unfiltered) stimuli, and 2) generating a
new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5
Pro Preview) guided by linguistically-informed templates designed to mitigate
identified confounds. Our preliminary findings indicate that GPT-2 demonstrates
notably improved performance on these refined PG stimuli compared to baselines,
suggesting that stimulus quality significantly influences outcomes in
surprisal-based evaluations of LLM syntactic competency.
comment: Presented at https://brigap-workshop.github.io/ Information to be
updated upon publication of proceedings
☆ MASA: Rethinking the Representational Bottleneck in LoRA with Multi-A Shared Adaptation
Qin Dong, Yuntian Tang, Heming Jia, Yunhang Shen, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Shaohui Lin
Low-Rank Adaptation (LoRA) has emerged as a dominant method in
Parameter-Efficient Fine-Tuning (PEFT) for large language models, which
augments the transformer layer with one down-projection $A$ and one
up-projection $B$. However, LoRA's reliance on a single down-projection matrix
($A$) creates a representational bottleneck, as this solitary feature extractor
is inherently insufficient for capturing the diverse signals required by
complex tasks. This motivates our architectural shift to focus on enriching the
feature adaptation to improve the downstream task adaptation ability. We
propose MASA (Multi-$A$ Shared Adaptation), an architecture that implements a
multi-$A$, single-$B$ structure where the multi-$A$ expert ensemble is
asymmetrically shared across layers to ensure parameter efficiency. In MASA,
these specialized experts capture diverse features, which are then integrated
by a single, layer-specific $B$-matrix. The effectiveness and versatility of
our method are validated through a comprehensive suite of experiments spanning
multi-domain generalization, single-domain specialization, and multi-task
reasoning. For example, on the MMLU benchmark, MASA achieves an average
accuracy of 59.62%, outperforming the standard LoRA by 1.08 points (a relative
improvement of 1.84%) with comparable learnable parameters of 0.52%.
comment: 14 pages, 5 figures
☆ Deterministic Legal Retrieval: An Action API for Querying the SAT-Graph RAG
The Structure-Aware Temporal Graph RAG (SAT-Graph RAG) addresses core
limitations of standard Retrieval-Augmented Generation in the legal domain by
providing a verifiable knowledge graph that models hierarchical structure,
temporal evolution, and causal events of legal norms. However, a critical gap
remains: how to reliably query this structured knowledge without sacrificing
its deterministic properties. This paper introduces the SAT-Graph API, a formal
query execution layer centered on canonical actions-atomic, composable, and
auditable primitives that isolate probabilistic discovery from deterministic
retrieval. These actions enable: (i) high-precision hybrid search; (ii) robust
reference resolution; (iii) point-in-time version retrieval; and (iv) auditable
causal tracing. We demonstrate how planner-guided agents can decompose complex
queries into Directed Acyclic Graphs (DAGs) of these actions. This two-layer
architecture transforms retrieval from an opaque black box to a transparent,
auditable process, directly addressing Explainable AI (XAI) requirements for
high-stakes domains.
☆ Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments
Recent studies probing the Argument from the Poverty of the Stimulus (APS)
have applied Large Language Models (LLMs) to test the learnability of complex
syntax through surprisal-based metrics. However, divergent conclusions raise
questions concerning the insights these metrics offer. While Wilcox et al.
(2024) used direct minimal pair comparisons (the "wh-effect") to demonstrate
that models successfully generalise knowledge of filler-gap dependencies, Lan
et al. (2024) used a Difference-in-Differences (DiD) metric and found that
models largely fail on parasitic gaps (PGs). This paper argues that the direct
minimal pair approach offers greater diagnostic transparency. We demonstrate
this by generating a full 8-permutation paradigm of refined PG stimuli and
evaluating the GPT-2 model used in previous studies with a systematic
Wilcox-style wh-effect analysis. Our results show that GPT-2 succeeds across
all four tested conditions, indicating robust knowledge of filler-gap licensing
principles even in complex PG environments. This finding, which contrasts with
the more ambiguous results from DiD-style metrics, suggests that the choice of
evaluation metric is critical for assessing an LLM's syntactic competence.
comment: Presented at the https://brigap-workshop.github.io/ Information to be
updated after publication of proceedings
☆ Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs
Large Language Models (LLMs) are increasingly applied to complex tasks that
require extended reasoning. In such settings, models often benefit from diverse
chains-of-thought to arrive at multiple candidate solutions. This requires two
competing objectives: to inject enough stochasticity to explore multiple
reasoning chains, and to ensure sufficient accuracy and quality in each path.
Existing works pursue the first objective by increasing exploration at highly
uncertain steps with higher temperature or larger candidate token sets, while
others improve reliability by rejecting samples with low confidence
post-generation, implying that low confidence correlates with low answer
quality. These two lines of thought are in conflict, as they conflate different
sources of uncertainty. To resolve this, we argue that the decoding rule should
be calibrated by correctness, not confidence alone. We should sample from
tokens with higher estimated correctness, and reduce sampling where expected
correctness is low. We propose simple strategies that achieve this goal:
Greedy-Threshold makes sampling greedy at very low confidence steps.
Calibrated-TopK and Calibrated-epsilon set truncation threshold based on
estimated rank-wise correctness. Together, our findings challenge prevailing
heuristics about decoding under uncertainty and show gains across math and
general reasoning benchmarks.
☆ LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language
Owing to their reasoning capabilities, large language models (LLMs) have been
evaluated on planning tasks described in natural language. However, LLMs have
largely been tested on planning domains without constraints. In order to deploy
them in real-world settings where adherence to constraints, in particular
safety constraints, is critical, we need to evaluate their performance on
constrained planning tasks. We introduce LexiCon -- a natural language-based
(Lexi) constrained (Con) planning benchmark, consisting of a suite of
environments, that can be used to evaluate the planning capabilities of LLMs in
a principled fashion. The core idea behind LexiCon is to take existing planning
environments and impose temporal constraints on the states. These constrained
problems are then translated into natural language and given to an LLM to
solve. A key feature of LexiCon is its extensibility. That is, the set of
supported environments can be extended with new (unconstrained) environment
generators, for which temporal constraints are constructed automatically. This
renders LexiCon future-proof: the hardness of the generated planning problems
can be increased as the planning capabilities of LLMs improve. Our experiments
reveal that the performance of state-of-the-art LLMs, including reasoning
models like GPT-5, o3, and R1, deteriorates as the degree of constrainedness of
the planning tasks increases.
☆ Probing the Difficulty Perception Mechanism of Large Language Models
Large language models (LLMs) are increasingly deployed on complex reasoning
tasks, yet little is known about their ability to internally evaluate problem
difficulty, which is an essential capability for adaptive reasoning and
efficient resource allocation. In this work, we investigate whether LLMs
implicitly encode problem difficulty in their internal representations. Using a
linear probe on the final-token representations of LLMs, we demonstrate that
the difficulty level of math problems can be linearly modeled. We further
locate the specific attention heads of the final Transformer layer: these
attention heads have opposite activation patterns for simple and difficult
problems, thus achieving perception of difficulty. Our ablation experiments
prove the accuracy of the location. Crucially, our experiments provide
practical support for using LLMs as automatic difficulty annotators,
potentially substantially reducing reliance on costly human labeling in
benchmark construction and curriculum learning. We also uncover that there is a
significant difference in entropy and difficulty perception at the token level.
Our study reveals that difficulty perception in LLMs is not only present but
also structurally organized, offering new theoretical insights and practical
directions for future research.
☆ MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization
Conducting contamination-free evaluation of mathematical capabilities can be
difficult for two reasons: models may memorize a test set once it is made
public, and current mathematical benchmarks are prone to overfitting due to
having limited diversity of symbols and rules, coupled with closed-ended
answers. This paper proposes a method to leverage these shortcomings as useful
features to a construct dynamic, counterfactual benchmark, which can be used to
both reveal overfitting and measure true reasoning. We demonstrate this via
MatheMagic, which generates math test instances with the interpretations of
numbers and operators altered, yet has automatically verifiable answers. Test
instances are randomly seeded and constructed at test time to evaluate a
model's induction or deduction capability, offering stability, extensibility,
comparability, and robustness to overfitting. Our experiments find that models
solve deduction more easily than induction, but they revert to standard math.
Further analysis reveals that math-adapted models fail to exhibit a general
"skill" of reasoning, and fine-tuning on induction tasks generalizes poorly.
☆ EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that
uses two scoring methods (log-probabilities and direct ratings) plus a
model-as-judge peer review to evaluate moral alignment in 20 large language
models. We assess models on the World Values Survey (55 countries, 19 topics)
and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL,
top models align closely with survey responses (Pearson's r approximately 0.90
on WVS). Yet we find a clear regional difference: Western regions average
r=0.82 while non-Western regions average r=0.61 (a 0.21 absolute gap),
indicating consistent regional bias. Our framework adds three parts: (1) two
scoring methods for all models to enable fair comparison, (2) a structured
chain-of-thought protocol with self-consistency checks, and (3) a
model-as-judge peer review that flags 348 conflicts using a data-driven
threshold. Peer agreement relates to survey alignment (WVS r=0.74, PEW r=0.39,
both p<.001), supporting automated quality checks. These results show real
progress toward culture-aware AI while highlighting open challenges for use
across regions.
☆ Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens
Cultural evaluation of large language models has become increasingly
important, yet current benchmarks often reduce culture to static facts or
homogeneous values. This view conflicts with anthropological accounts that
emphasize culture as dynamic, historically situated, and enacted in practice.
To analyze this gap, we introduce a four-part framework that categorizes how
benchmarks frame culture, such as knowledge, preference, performance, or bias.
Using this lens, we qualitatively examine 20 cultural benchmarks and identify
six recurring methodological issues, including treating countries as cultures,
overlooking within-culture diversity, and relying on oversimplified survey
formats. Drawing on established anthropological methods, we propose concrete
improvements: incorporating real-world narratives and scenarios, involving
cultural communities in design and validation, and evaluating models in context
rather than isolation. Our aim is to guide the development of cultural
benchmarks that go beyond static recall tasks and more accurately capture the
responses of the models to complex cultural situations.
comment: 12 pages; 2 figures; First two author contributed equally
☆ Prompt reinforcing for long-term planning of large language models
Hsien-Chin Lin, Benjamin Matthias Ruppik, Carel van Niekerk, Chia-Hao Shen, Michael Heck, Nurul Lubis, Renato Vukovic, Shutong Feng, Milica Gašić
Large language models (LLMs) have achieved remarkable success in a wide range
of natural language processing tasks and can be adapted through prompting.
However, they remain suboptimal in multi-turn interactions, often relying on
incorrect early assumptions and failing to track user goals over time, which
makes such tasks particularly challenging. Prior works in dialogue systems have
shown that long-term planning is essential for handling interactive tasks. In
this work, we propose a prompt optimisation framework inspired by reinforcement
learning, which enables such planning to take place by only modifying the task
instruction prompt of the LLM-based agent. By generating turn-by-turn feedback
and leveraging experience replay for prompt rewriting, our proposed method
shows significant improvement in multi-turn tasks such as text-to-SQL and
task-oriented dialogue. Moreover, it generalises across different LLM-based
agents and can leverage diverse LLMs as meta-prompting agents. This warrants
future research in reinforcement learning-inspired parameter-free optimisation
methods.
☆ Paying Attention to Hybrid Attention: Untangling the Issues with Conversion Methods
Transformers' quadratic computational complexity limits their scalability
despite remarkable performance. While linear attention reduces this to linear
complexity, pre-training such models from scratch remains, in most cases,
prohibitively expensive. Recent post-training linearisation methods convert
pre-trained Transformers to linear models efficiently, often using hybrid
approaches that combine linear attention with sliding-window softmax. We
identify a critical flaw: existing hybrid methods inadvertently bypass the
linear component, relying almost entirely on SWA. Component-level diagnostics
reveal this previously undetected behaviour stems from overlooked evaluation
practices on common-sense benchmarks. We propose three solutions to ensure
balanced component usage: (i) inference-time hybridisation of linear-only
conversions with sliding-window softmax; (ii) HedgeCATs, combining
attention-weight transfer with targeted LoRA fine-tuning; and (iii) Scheduled
Sliding-window Dropout (SSD), which stochastically suppresses the softmax
branch during training to prevent component collapse. Our methods maintain
computational efficiency while recovering most base model performance and
ensuring genuine linear attention adoption, restoring the validity of
performance attributions in hybrid conversions.
☆ The fragility of "cultural tendencies" in LLMs
In a recent study, Lu, Song, and Zhang (2025) (LSZ) propose that large
language models (LLMs), when prompted in different languages, display
culturally specific tendencies. They report that the two models (i.e., GPT and
ERNIE) respond in more interdependent and holistic ways when prompted in
Chinese, and more independent and analytic ways when prompted in English. LSZ
attribute these differences to deep-seated cultural patterns in the models,
claiming that prompt language alone can induce substantial cultural shifts.
While we acknowledge the empirical patterns they observed, we find their
experiments, methods, and interpretations problematic. In this paper, we
critically re-evaluate the methodology, theoretical framing, and conclusions of
LSZ. We argue that the reported "cultural tendencies" are not stable traits but
fragile artifacts of specific models and task design. To test this, we
conducted targeted replications using a broader set of LLMs and a larger number
of test items. Our results show that prompt language has minimal effect on
outputs, challenging LSZ's claim that these models encode grounded cultural
beliefs.
☆ Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input
Large language models (LLMs) increasingly support applications that rely on
extended context, from document processing to retrieval-augmented generation.
While their long-context capabilities are well studied for reasoning and
retrieval, little is known about their behavior in safety-critical scenarios.
We evaluate LLMs' sensitivity to harmful content under extended context,
varying type (explicit vs. implicit), position (beginning, middle, end),
prevalence (0.01-0.50 of the prompt), and context length (600-6000 tokens).
Across harmful content categories such as toxic, offensive, and hate speech,
with LLaMA-3, Qwen-2.5, and Mistral, we observe similar patterns: performance
peaks at moderate harmful prevalence (0.25) but declines when content is very
sparse or dominant; recall decreases with increasing context length; harmful
sentences at the beginning are generally detected more reliably; and explicit
content is more consistently recognized than implicit. These findings provide
the first systematic view of how LLMs prioritize and calibrate harmful content
in long contexts, highlighting both their emerging strengths and the challenges
that remain for safety-critical use.
☆ Revisiting Long-context Modeling from Context Denoising Perspective
Long-context models (LCMs) have demonstrated great potential in processing
long sequences, facilitating many real-world applications. The success of LCMs
can be attributed to their ability to locate implicit critical information
within the context for further prediction. However, recent research reveals
that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens,
that can mislead model attention. In this paper, we conduct a fine-grained
analysis of the context noise and propose an effective metric, the Integrated
Gradient (IG) score, to detect and quantify the noise information within the
context. Our findings reveal that even simple mitigation of detected context
noise can substantially boost the model's attention on critical tokens and
benefit subsequent predictions. Building on this insight, we propose Context
Denoising Training (CDT), a straightforward yet effective training strategy
that improves attention on critical tokens while reinforcing their influence on
model predictions. Extensive experiments across four tasks, under both context
window scaling and long-context alignment settings, demonstrate the superiority
of CDT. Notably, when trained with CDT, an open-source 8B model can achieve
performance (50.92) comparable to GPT-4o (51.00).
☆ Automated Boilerplate: Prevalence and Quality of Contract Generators in the Context of Swiss Privacy Policies
It has become increasingly challenging for firms to comply with a plethora of
novel digital regulations. This is especially true for smaller businesses that
often lack both the resources and know-how to draft complex legal documents.
Instead of seeking costly legal advice from attorneys, firms may turn to
cheaper alternative legal service providers such as automated contract
generators. While these services have a long-standing presence, there is little
empirical evidence on their prevalence and output quality.
We address this gap in the context of a 2023 Swiss privacy law revision. To
enable a systematic evaluation, we create and annotate a multilingual benchmark
dataset that captures key compliance obligations under Swiss and EU privacy
law. Using this dataset, we validate a novel GPT-5-based method for large-scale
compliance assessment of privacy policies, allowing us to measure the impact of
the revision. We observe compliance increases indicating an effect of the
revision. Generators, explicitly referenced by 18% of local websites, are
associated with substantially higher levels of compliance, with increases of up
to 15 percentage points compared to privacy policies without generator use.
These findings contribute to three debates: the potential of LLMs for
cross-lingual legal analysis, the Brussels Effect of EU regulations, and,
crucially, the role of automated tools in improving compliance and contractual
quality.
comment: 23 pages, 4 figures
☆ DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization EMNLP 2025
Large language models (LLMs) have achieved impressive performance in text
summarization, yet their performance often falls short when applied to
specialized domains %or conversational data that differ from their original
pre-training distribution. While fine-tuning can improve summarization quality,
it typically relies on costly and scarce high-quality labeled data. In this
work, we explore continual pre-training as a scalable, self-supervised approach
to adapt LLMs for downstream summarization tasks, particularly in the context
of noisy real-world conversation transcripts. We conduct extensive experiments
using large-scale, unlabeled business conversation data to investigate whether
continual pre-training enhances model capabilities in conversational
summarization. Our results demonstrate that continual pre-training yields
substantial gains in both in-domain and out-of-domain summarization benchmarks,
while maintaining strong generalization and robustness. We also analyze the
effects of data selection strategies, providing practical guidelines for
applying continual pre-training in summarization-focused industrial
applications.
comment: Accepted to the NewSumm Workshop at EMNLP 2025
☆ Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer
The landscape of Large Language Models (LLMs) remains predominantly
English-centric, resulting in a significant performance gap for other major
languages, such as French, especially in the context of Small Language Models
(SLMs). Existing multilingual models demonstrate considerably lower performance
in French compared to English, and research on efficient adaptation methods for
French remains limited. To address this, we introduce \textbf{Luth}, a family
of French-specialized SLMs: through targeted post-training on curated,
high-quality French data, our models outperform all open-source counterparts of
comparable size on multiple French benchmarks while retaining their original
English capabilities. We further show that strategic model merging enhances
performance in both languages, establishing Luth as a new state of the art for
French SLMs and a robust baseline for future French-language research.
comment: 12 pages, 4 figures and 9 tables
☆ EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget
Balancing exploration and exploitation remains a central challenge in
reinforcement learning with verifiable rewards (RLVR) for large language models
(LLMs). Current RLVR methods often overemphasize exploitation, leading to
entropy collapse, diminished exploratory capacity, and ultimately limited
performance gains. Although techniques that increase policy stochasticity can
promote exploration, they frequently fail to escape dominant behavioral modes.
This creates a self-reinforcing loop-repeatedly sampling and rewarding dominant
modes-that further erodes exploration. We introduce Exploration-Enhanced Policy
Optimization (EEPO), a framework that promotes exploration via two-stage
rollouts with adaptive unlearning. In the first stage, the model generates half
of the trajectories; it then undergoes a lightweight unlearning step to
temporarily suppress these sampled responses, forcing the second stage to
explore different regions of the output space. This sample-then-forget
mechanism disrupts the self-reinforcing loop and promotes wider exploration
during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO,
achieving average relative gains of 24.3% on Qwen2.5-3B, 33.0% on
Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.
☆ Mitigating Premature Exploitation in Particle-based Monte Carlo for Inference-Time Scaling
Giorgio Giannone, Guangxuan Xu, Nikhil Shivakumar Nayak, Rohan Mahesh Awhad, Shivchander Sudalairaj, Kai Xu, Akash Srivastava
Inference-Time Scaling (ITS) improves language models by allocating more
computation at generation time. Particle Filtering (PF) has emerged as a strong
ITS method for complex mathematical reasoning tasks, but it is vulnerable when
guided by process reward models, which often assign overconfident scores early
in the reasoning process. This causes PF to suffer from premature exploitation:
it myopically commits to locally promising trajectories, prunes potentially
correct hypotheses, and converges to suboptimal solutions. This failure mode,
known as particle impoverishment, is especially severe under constrained
computational budgets. To address this, we analyze the problem and identify two
root causes: a lack of diversity in the particle set due to overconfident
resampling and consequent inability to assess the potential of a reasoning
path. We introduce Entropic Particle Filtering (ePF), an algorithm that
integrates two new techniques to solve these issues. The first technique,
Entropic Annealing (EA), directly mitigates particle impoverishment by
monitoring search diversity via entropy; when diversity drops, it intervenes by
dynamically annealing the resampling distribution to preserve exploration. The
second, an enhancement called Look-ahead Modulation (LaM), adds a predictive
guide to evaluate a state's potential based on its successors. On several
challenging math benchmarks, ePF significantly outperforms strong baselines and
achieves up to a 50 % relative improvement in task reward. Together, these
methods improve PF's resilience by balancing the exploration of diverse
solution spaces with the exploitation of high-reward regions, ultimately
leading to higher-quality solutions.
☆ Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech
Aligning text-to-speech (TTS) system outputs with human feedback through
preference optimization has been shown to effectively improve the robustness
and naturalness of language model-based TTS models. Current approaches
primarily require paired desirable and undesirable samples at the utterance
level. However, such pairs are often limited in TTS output data, and
utterance-level formulation prevents fine-grained token-level optimization
needed for accurate pronunciation alignment. In this study, we propose TKTO
that eliminates the need for paired data, enabling a more data-efficient
training paradigm, and directly targets token-level units, automatically
providing fine-grained alignment signals without token-level annotations. TKTO
improves the challenging Japanese TTS accuracy by 39% and reduces CER by 54%,
automatically assigning 12.8 times stronger reward to targeted tokens.
☆ Mixture of Neuron Experts
Runxi Cheng, Yuchen Guan, Yucheng Ding, Qingguo Hu, Yongxian Wei, Chun Yuan, Yelong Shen, Weizhu Chen, Yeyun Gong
In this work, we first explore whether the parameters activated by the MoE
layer remain highly sparse at inference. We perform a sparsification study on
several representative MoE models. For each expert, we rank parameters by the
magnitude of their activations from the gate projection and progressively prune
the activated subset. Pruning up to 60% of parameters within that subset causes
only negligible task-performance degradation; substantial drops occur only
after more than 90% are removed. We further decompose experts into
neuron-granular MoE and visualize their activation values, finding that most
neuron activations are near zero. This observation motivates us to select only
high-activation neuron experts during pretraining. Based on this insight, we
propose Mixture of Neuron Experts (MoNE). MoNE achieves neuron-granular expert
selection by only applying a simple top-k selection within each expert, incurs
negligible latency, and requires no additional routing parameters or
inter-expert communication. Extensive experiments demonstrate that MoNE matches
traditional MoE performance while activating only 50% of the MoE-layer
parameters, and it consistently outperforms traditional MoE when compared at
equal numbers of activated parameters. These results suggest that MoNE is a
practical approach to improving parameter utilization and inference efficiency
in MoE-like models.
comment: 18 page, 11 figures, 7 tables
☆ InforME: Improving Informativeness of Abstractive Text Summarization With Informative Attention Guided by Named Entity Salience
Abstractive text summarization is integral to the Big Data era, which demands
advanced methods to turn voluminous and often long text data into concise but
coherent and informative summaries for efficient human consumption. Despite
significant progress, there is still room for improvement in various aspects.
One such aspect is to improve informativeness. Hence, this paper proposes a
novel learning approach consisting of two methods: an optimal transport-based
informative attention method to improve learning focal information in reference
summaries and an accumulative joint entropy reduction method on named entities
to enhance informative salience. Experiment results show that our approach
achieves better ROUGE scores compared to prior work on CNN/Daily Mail while
having competitive results on XSum. Human evaluation of informativeness also
demonstrates the better performance of our approach over a strong baseline.
Further analysis gives insight into the plausible reasons underlying the
evaluation results.
☆ Diversity Is All You Need for Contrastive Learning: Spectral Bounds on Gradient Magnitudes
We derive non-asymptotic spectral bands that bound the squared InfoNCE
gradient norm via alignment, temperature, and batch spectrum, recovering the
\(1/\tau^{2}\) law and closely tracking batch-mean gradients on synthetic data
and ImageNet. Using effective rank \(R_{\mathrm{eff}}\) as an anisotropy proxy,
we design spectrum-aware batch selection, including a fast greedy builder. On
ImageNet-100, Greedy-64 cuts time-to-67.5\% top-1 by 15\% vs.\ random (24\%
vs.\ Pool--P3) at equal accuracy; CIFAR-10 shows similar gains. In-batch
whitening promotes isotropy and reduces 50-step gradient variance by
\(1.37\times\), matching our theoretical upper bound.
☆ Early Multimodal Prediction of Cross-Lingual Meme Virality on Reddit: A Time-Window Analysis
Predicting the virality of online content remains challenging, especially for
culturally complex, fast-evolving memes. This study investigates the
feasibility of early prediction of meme virality using a large-scale,
cross-lingual dataset from 25 diverse Reddit communities. We propose a robust,
data-driven method to define virality based on a hybrid engagement score,
learning a percentile-based threshold from a chronologically held-out training
set to prevent data leakage. We evaluated a suite of models, including Logistic
Regression, XGBoost, and a Multi-layer Perceptron (MLP), with a comprehensive,
multimodal feature set across increasing time windows (30-420 min). Crucially,
useful signals emerge quickly: our best-performing model, XGBoost, achieves a
PR-AUC $>$ 0.52 in just 30 minutes. Our analysis reveals a clear "evidentiary
transition," in which the importance of the feature dynamically shifts from the
static context to the temporal dynamics as a meme gains traction. This work
establishes a robust, interpretable, and practical benchmark for early virality
prediction in scenarios where full diffusion cascade data is unavailable,
contributing a novel cross-lingual dataset and a methodologically sound
definition of virality. To our knowledge, this study is the first to combine
time series data with static content and network features to predict early meme
virality.
comment: Preprint work in progress. Main body: 9 pages. Total: 15 pages
including references and appendix. 16 figures and 12 tables
☆ ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems
Large Language Model (LLM)-powered Multi-agent systems (MAS) have achieved
state-of-the-art results on various complex reasoning tasks. Recent works have
proposed techniques to automate the design of MASes, eliminating the need for
manual engineering. However, these techniques perform poorly, often achieving
similar or inferior performance to simple baselines. Furthermore, they require
computationally expensive re-discovery of architectures for each new task
domain and expensive data annotation on domains without existing labeled
validation sets. A critical insight is that simple Chain of Thought (CoT)
reasoning often performs competitively with these complex systems, suggesting
that the fundamental reasoning unit of MASes, CoT, warrants further
investigation. To this end, we present a new paradigm for automatic MAS design
that pivots the focus to optimizing CoT reasoning. We introduce the Agentic
Reasoning Module (ARM), an agentic generalization of CoT where each granular
reasoning step is executed by a specialized reasoning module. This module is
discovered through a tree search over the code space, starting from a simple
CoT module and evolved using mutations informed by reflection on execution
traces. The resulting ARM acts as a versatile reasoning building block which
can be utilized as a direct recursive loop or as a subroutine in a learned
meta-orchestrator. Our approach significantly outperforms both manually
designed MASes and state-of-the-art automatic MAS design methods. Crucially,
MASes built with ARM exhibit superb generalization, maintaining high
performance across different foundation models and task domains without further
optimization.
comment: 29 pages, 2 figures
☆ Adaptive and Multi-Source Entity Matching for Name Standardization of Astronomical Observation Facilities
This ongoing work focuses on the development of a methodology for generating
a multi-source mapping of astronomical observation facilities. To compare two
entities, we compute scores with adaptable criteria and Natural Language
Processing (NLP) techniques (Bag-of-Words approaches, sequential approaches,
and surface approaches) to map entities extracted from eight semantic
artifacts, including Wikidata and astronomy-oriented resources. We utilize
every property available, such as labels, definitions, descriptions, external
identifiers, and more domain-specific properties, such as the observation
wavebands, spacecraft launch dates, funding agencies, etc. Finally, we use a
Large Language Model (LLM) to accept or reject a mapping suggestion and provide
a justification, ensuring the plausibility and FAIRness of the validated
synonym pairs. The resulting mapping is composed of multi-source synonym sets
providing only one standardized label per entity. Those mappings will be used
to feed our Name Resolver API and will be integrated into the International
Virtual Observatory Alliance (IVOA) Vocabularies and the OntoPortal-Astro
platform.
comment: Accepted in Ontology Matching 2025 conference proceedings
☆ Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies
Masked diffusion models (MDMs) have recently emerged as a novel framework for
language modeling. MDMs generate sentences by iteratively denoising masked
sequences, filling in [MASK] tokens step by step. Although MDMs support
any-order sampling, performance is highly sensitive to the choice of which
position to unmask next. Prior work typically relies on rule-based schedules
(e.g., max-confidence, max-margin), which provide ad hoc improvements. In
contrast, we replace these heuristics with a learned scheduler. Specifically,
we cast denoising as a KL-regularized Markov decision process (MDP) with an
explicit reference policy and optimize a regularized objective that admits
policy improvement and convergence guarantees under standard assumptions. We
prove that the optimized policy under this framework generates samples that
more closely match the data distribution than heuristic schedules. Empirically,
across four benchmarks, our learned policy consistently outperforms
max-confidence: for example, on SUDOKU, where unmasking order is critical, it
yields a 20.1% gain over random and a 11.2% gain over max-confidence.
comment: Preprint
☆ Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling
Before adopting a new large language model (LLM) architecture, it is critical
to understand vulnerabilities accurately. Existing evaluations can be difficult
to trust, often drawing conclusions from LLMs that are not meaningfully
comparable, relying on heuristic inputs or employing metrics that fail to
capture the inherent uncertainty. In this paper, we propose a principled and
practical end-to-end framework for evaluating LLM vulnerabilities to prompt
injection attacks. First, we propose practical approaches to experimental
design, tackling unfair LLM comparisons by considering two practitioner
scenarios: when training an LLM and when deploying a pre-trained LLM. Second,
we address the analysis of experiments and propose a Bayesian hierarchical
model with embedding-space clustering. This model is designed to improve
uncertainty quantification in the common scenario that LLM outputs are not
deterministic, test prompts are designed imperfectly, and practitioners only
have a limited amount of compute to evaluate vulnerabilities. We show the
improved inferential capabilities of the model in several prompt injection
attack settings. Finally, we demonstrate the pipeline to evaluate the security
of Transformer versus Mamba architectures. Our findings show that consideration
of output variability can suggest less definitive findings. However, for some
attacks, we find notably increased Transformer and Mamba-variant
vulnerabilities across LLMs with the same training data or mathematical
ability.
☆ DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision
Yongqi Leng, Yikun Lei, Xikai Liu, Meizhi Zhong, Bojian Xiong, Yurong Zhang, Yan Gao, Yi Wu, Yao Hu, Deyi Xiong
Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing
capability for complex tasks through dynamic retrieval and adaptive workflows.
Recent advances (e.g., Search-R1) have shown that outcome-supervised
reinforcement learning demonstrate strong performance. However, this approach
still suffers from inefficient exploration, sparse reward signals, and
ambiguous global reward feedback. To address these challenges, we propose
DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating
decision-making and execution, while introducing an efficient pruning strategy
to optimize data expansion. Through comprehensive process-level policy
optimization, DecEx-RAG significantly enhances the autonomous task
decomposition, dynamic retrieval, and high-quality answer generation
capabilities of large language models (LLMs). Experiments show that DecEx-RAG
achieves an average absolute performance improvement of $6.2\%$ across six
datasets, significantly outperforming existing baselines. Moreover, the pruning
strategy improves data construction efficiency by nearly $6 \times$, providing
an efficient solution for process-supervised RAG training. The code is
available at https://github.com/sdsxdxl/DecEx-RAG.
☆ Code-Switching In-Context Learning for Cross-Lingual Transfer of Large Language Models
While large language models (LLMs) exhibit strong multilingual abilities,
their reliance on English as latent representations creates a translation
barrier, where reasoning implicitly depends on internal translation into
English. When this process fails, performance in non-English languages
deteriorates sharply, limiting the inclusiveness of LLM-based applications.
Existing cross-lingual in-context learning (X-ICL) methods primarily leverage
monolingual demonstrations, often failing to mitigate this barrier and instead
reinforcing it. In this work, we introduce code-switching in-context learning
(CSICL), a simple yet effective prompting strategy that progressively
transitions from a target language to English within demonstrations and
instruction to facilitate their latent reasoning in English. By explicitly
scaffolding the reasoning process through controlled code-switching, CSICL acts
as an implicit linguistic bridge that enhances cross-lingual alignment and
reduces reliance on the translation barrier. We conduct extensive experiments
across 4 LLMs, 6 datasets, and 10 languages, spanning both knowledge-intensive
and reasoning-oriented domains. Our results demonstrate that CSICL consistently
outperforms X-ICL baselines, achieving gains of 3.1%p and 1.9%p in both target
and unseen languages, respectively. The improvement is even more pronounced in
low-resource settings, with gains of 14.7% in target and 5.3% in unseen
languages. These findings establish code-switching as a principled and robust
approach for overcoming the translation barrier during inference, moving LLMs
toward more equitable and effective multilingual systems.
☆ The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP
Sheriff Issaka, Keyi Wang, Yinka Ajibola, Oluwatumininu Samuel-Ipaye, Zhaoyi Zhang, Nicte Aguillon Jimenez, Evans Kofi Agyei, Abraham Lin, Rohan Ramachandran, Sadick Abdul Mumin, Faith Nchifor, Mohammed Shuraim, Lieqi Liu, Erick Rosas Gonzalez, Sylvester Kpei, Jemimah Osei, Carlene Ajeneza, Persis Boateng, Prisca Adwoa Dufie Yeboah, Saadia Gabriel
Despite representing nearly one-third of the world's languages, African
languages remain critically underserved by modern NLP technologies, with 88\%
classified as severely underrepresented or completely ignored in computational
linguistics. We present the African Languages Lab (All Lab), a comprehensive
research initiative that addresses this technological gap through systematic
data collection, model development, and capacity building. Our contributions
include: (1) a quality-controlled data collection pipeline, yielding the
largest validated African multi-modal speech and text dataset spanning 40
languages with 19 billion tokens of monolingual text and 12,628 hours of
aligned speech data; (2) extensive experimental validation demonstrating that
our dataset, combined with fine-tuning, achieves substantial improvements over
baseline models, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points
across 31 evaluated languages; and (3) a structured research program that has
successfully mentored fifteen early-career researchers, establishing
sustainable local capacity. Our comparative evaluation against Google Translate
reveals competitive performance in several languages while identifying areas
that require continued development.
☆ Generative AI-Driven Hierarchical Multi-Agent Framework for Zero-Touch Optical Networks
The rapid development of Generative Artificial Intelligence (GenAI) has
catalyzed a transformative technological revolution across all walks of life.
As the backbone of wideband communication, optical networks are expecting
high-level autonomous operation and zero-touch management to accommodate their
expanding network scales and escalating transmission bandwidth. The integration
of GenAI is deemed as the pivotal solution for realizing zero-touch optical
networks. However, the lifecycle management of optical networks involves a
multitude of tasks and necessitates seamless collaboration across multiple
layers, which poses significant challenges to the existing single-agent GenAI
systems. In this paper, we propose a GenAI-driven hierarchical multi-agent
framework designed to streamline multi-task autonomous execution for zero-touch
optical networks. We present the architecture, implementation, and applications
of this framework. A field-deployed mesh network is utilized to demonstrate
three typical scenarios throughout the lifecycle of optical network: quality of
transmission estimation in the planning stage, dynamic channel adding/dropping
in the operation stage, and system capacity increase in the upgrade stage. The
case studies, illustrate the capabilities of multi-agent framework in
multi-task allocation, coordination, execution, evaluation, and summarization.
This work provides a promising approach for the future development of
intelligent, efficient, and collaborative network management solutions, paving
the way for more specialized and adaptive zero-touch optical networks.
comment: 7 pages,6 figures, Accepted by lEEE Communications Magazine, Open
call
☆ MADIAVE: Multi-Agent Debate for Implicit Attribute Value Extraction
Implicit Attribute Value Extraction (AVE) is essential for accurately
representing products in e-commerce, as it infers lantent attributes from
multimodal data. Despite advances in multimodal large language models (MLLMs),
implicit AVE remains challenging due to the complexity of multidimensional data
and gaps in vision-text understanding. In this work, we introduce
\textsc{\modelname}, a multi-agent debate framework that employs multiple MLLM
agents to iteratively refine inferences. Through a series of debate rounds,
agents verify and update each other's responses, thereby improving inference
performance and robustness. Experiments on the ImplicitAVE dataset demonstrate
that even a few rounds of debate significantly boost accuracy, especially for
attributes with initially low performance. We systematically evaluate various
debate configurations, including identical or different MLLM agents, and
analyze how debate rounds affect convergence dynamics. Our findings highlight
the potential of multi-agent debate strategies to address the limitations of
single-agent approaches and offer a scalable solution for implicit AVE in
multimodal e-commerce.
☆ A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks
Shuzheng Si, Haozhe Zhao, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
Agents based on large language models (LLMs) struggle with brainless
trial-and-error and generating hallucinatory actions due to a lack of global
planning in long-horizon tasks. In this paper, we introduce a plan-and-execute
framework and propose EAGLET, an efficient and effective planner training
method to enhance the executor agent's planning abilities without human effort.
Specifically, we train a plug-and-play global planner through a two-step
process: we first synthesize high-quality plans from an advanced LLM using our
proposed homologous consensus filtering strategy, and apply fine-tuning as a
cold start. Moreover, we further improve the planner with a rule-based
reinforcement learning stage using a novel executor capability gain reward,
ensuring it can handle task instructions of varying difficulty. Experiments on
three long-horizon agent tasks show that executor agents equipped with our
planner outperform existing methods, achieving new state-of-the-art
performance. Meanwhile, EAGLET reduces training costs by 8x compared to
RL-based baselines, and it does not require manual effort or extra training
data, offering an efficient and effective solution.
☆ Improving Chain-of-Thought Efficiency for Autoregressive Image Generation
Zeqi Gu, Markos Georgopoulos, Xiaoliang Dai, Marjan Ghazvininejad, Chu Wang, Felix Juefei-Xu, Kunpeng Li, Yujun Shi, Zecheng He, Zijian He, Jiawei Zhou, Abe Davis, Jialiang Wang
Autoregressive multimodal large language models have recently gained
popularity for image generation, driven by advances in foundation models. To
enhance alignment and detail, newer approaches employ chain-of-thought (CoT)
reasoning, expanding user inputs into elaborated prompts prior to image
synthesis. However, this strategy can introduce unnecessary redundancy -- a
phenomenon we call visual overthinking -- which increases computational costs
and can introduce details that contradict the original prompt. In this work, we
explore how to generate more concise CoT sequences for more efficient image
generation. We introduce ShortCoTI, a lightweight optimization framework that
encourages more concise CoT while preserving output image quality. ShortCoTI
rewards more concise prompts with an adaptive function that scales according to
an estimated difficulty for each task. Incorporating this reward into a
reinforcement learning paradigm reduces prompt reasoning length by 54% while
maintaining or slightly improving quality metrics across multiple benchmarks
(T2I-CompBench, GenEval). Qualitative analysis shows that our method eliminates
verbose explanations and repetitive refinements, producing reasoning prompts
that are both concise and semantically rich. As a result, ShortCoTI improves
computational efficiency without compromising the fidelity or visual appeal of
generated images.
☆ In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu
Outcome-driven reinforcement learning has advanced reasoning in large
language models (LLMs), but prevailing tool-augmented approaches train a
single, monolithic policy that interleaves thoughts and tool calls under full
context; this scales poorly with long horizons and diverse tools and
generalizes weakly to new scenarios. Agentic systems offer a promising
alternative by decomposing work across specialized modules, yet most remain
training-free or rely on offline training decoupled from the live dynamics of
multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow
agentic framework that coordinates four modules (planner, executor, verifier,
generator) through an evolving memory and directly optimizes its planner inside
the multi-turn loop. To train on-policy in live environments, we propose
Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles
long-horizon, sparse-reward credit assignment by converting multi-turn
optimization into a sequence of tractable single-turn policy updates. It
broadcasts a single, verifiable trajectory-level outcome to every turn to align
local planner decisions with global success and stabilizes learning with
group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale
backbone outperforms top-performing baselines with average accuracy gains of
14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on
scientific tasks, even surpassing larger proprietary models like GPT-4o.
Further analyses confirm the benefits of in-the-flow optimization, showing
improved planning, enhanced tool-calling reliability, and positive scaling with
model size and reasoning turns.
comment: 45 pages, 12 figures. Project website:
https://agentflow.stanford.edu/
☆ Mission Impossible: Feedback-Guided Dynamic Interactive Planning for Improving Reasoning on LLMs
Recent advancements in language agents have led to significant improvements
in multi-hop reasoning tasks. However, existing approaches often struggle with
handling open-domain problems, which require massive information retrieval due
to their reliance on a fixed sequence of actions. To address this, we propose
Feedback-Guided Dynamic Interactive Planning (FGDIP), a novel framework
tailored to enhance reasoning in LLMs by utilizing dynamic and adaptive
strategies for information exploration in open-domain multi-hop reasoning
tasks. Our approach begins by identifying key entities relevant to the problem,
which serve as the initial nodes in the reasoning process. From these initial
nodes, we then generate reasoning child nodes with the process being refined
through a combination of historical error analysis and real-time feedback,
which allows the framework to dynamically adjust and optimize its reasoning
strategies. By integrating depth-first search with an innovative node
generation technique, our framework adapts based on both prior error paths and
concurrently generated nodes at the same hierarchical level. This dynamic
strategy effectively expands the search space while ensuring the reasoning
process systematically converges toward accurate solutions. Experimental
results show that FGDIP achieved up to 54.47% F1 score on the HotpotQA dataset
and 70.05% on the StrategyQA dataset, surpassing the best baseline by 5.03% and
7.25% respectively, highlighting its versatility and potential to enhance
language agents in multi-hop reasoning tasks.
☆ Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations
The promotion of academic papers has become an important means of enhancing
research visibility. However, existing automated methods struggle limited
storytelling, insufficient aesthetic quality, and constrained self-adjustment,
making it difficult to achieve efficient and engaging dissemination. At the
heart of those challenges is a simple principle: \emph{there is no way to
improve it when you cannot evaluate it right}. To address this, we introduce
\textbf{EvoPresent}, a self-improvement agent framework that unifies coherent
narratives, aesthetic-aware designs, and realistic presentation delivery via
virtual characters. Central to EvoPresent is \textbf{PresAesth}, a multi-task
reinforcement learning (RL) aesthetic model that provides reliable aesthetic
scoring, defect adjustment, and comparative feedback, enabling iterative
self-improvement even under limited aesthetic training data. To systematically
evaluate the methods, we introduce \textbf{EvoPresent Benchmark}, a
comprehensive benchmark comprising: \textit{Presentation Generation Quality},
built on 650 top-tier AI conference papers with multimodal resources (slides,
videos and scripts) to assess both content and design; and \textit{Aesthetic
Awareness}, consisting of 2,000 slide pairs with varying aesthetic levels,
supporting joint training and evaluation on scoring, defect adjustment, and
comparison. Our findings highlight that (i) High-quality feedback is essential
for agent self-improvement, while initial capability alone does not guarantee
effective self-correction. (ii) Automated generation pipelines exhibit a
trade-off between visual design and content construction. (iii) Multi-task RL
training shows stronger generalization in aesthetic awareness tasks.
☆ Domain-Shift-Aware Conformal Prediction for Large Language Models
Large language models have achieved impressive performance across diverse
tasks. However, their tendency to produce overconfident and factually incorrect
outputs, known as hallucinations, poses risks in real world applications.
Conformal prediction provides finite-sample, distribution-free coverage
guarantees, but standard conformal prediction breaks down under domain shift,
often leading to under-coverage and unreliable prediction sets. We propose a
new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our
framework adapts conformal prediction to large language models under domain
shift, by systematically reweighting calibration samples based on their
proximity to the test prompt, thereby preserving validity while enhancing
adaptivity. Our theoretical analysis and experiments on the MMLU benchmark
demonstrate that the proposed method delivers more reliable coverage than
standard conformal prediction, especially under substantial distribution
shifts, while maintaining efficiency. This provides a practical step toward
trustworthy uncertainty quantification for large language models in real-world
deployment.
comment: 26 pages
☆ Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM
Large language models (LLM) and vision-language models (VLM) have achieved
state-of-the-art performance, but they impose significant memory and computing
challenges in deployment. We present a novel low-rank compression framework to
address this challenge. First, we upper bound the change of network loss via
layer-wise activation-based compression errors, filling a theoretical gap in
the literature. We then formulate low-rank model compression as a bi-objective
optimization and prove that a single uniform tolerance yields surrogate
Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we
propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot
pipeline that improves activation-aware compression via Pareto-guided rank
selection and alternating least-squares implementation. We apply PGSVD to both
LLM and VLM, showing better accuracy at the same compression levels and
inference speedup.
☆ Sci-Phi: A Large Language Model Spatial Audio Descriptor
Acoustic scene perception involves describing the type of sounds, their
timing, their direction and distance, as well as their loudness and
reverberation. While audio language models excel in sound recognition,
single-channel input fundamentally limits spatial understanding. This work
presents Sci-Phi, a spatial audio large language model with dual spatial and
spectral encoders that estimates a complete parameter set for all sound sources
and the surrounding environment. Learning from over 4,000 hours of synthetic
first-order Ambisonics recordings including metadata, Sci-Phi enumerates and
describes up to four directional sound sources in one pass, alongside
non-directional background sounds and room characteristics. We evaluate the
model with a permutation-invariant protocol and 15 metrics covering content,
location, timing, loudness, and reverberation, and analyze its robustness
across source counts, signal-to-noise ratios, reverberation levels, and
challenging mixtures of acoustically, spatially, or temporally similar sources.
Notably, Sci-Phi generalizes to real room impulse responses with only minor
performance degradation. Overall, this work establishes the first audio LLM
capable of full spatial-scene description, with strong potential for real-world
deployment. Demo: https://sci-phi-audio.github.io/demo
☆ On the Role of Difficult Prompts in Self-Play Preference Optimization
Self-play preference optimization has emerged as a prominent paradigm for
aligning large language models (LLMs). It typically involves a language model
to generate on-policy responses for prompts and a reward model (RM) to guide
the selection of chosen and rejected responses, which can be further trained
with direct preference optimization (DPO). However, the role of prompts remains
underexplored, despite being a core component in this pipeline. In this work,
we investigate how prompts of varying difficulty influence self-play preference
optimization. We first use the mean reward of $N$ sampled responses of a prompt
as a proxy for its difficulty. We find that difficult prompts exhibit
substantially inferior self-play optimization performance in comparison to easy
prompts for language models. Moreover, incorporating difficult prompts into
training fails to enhance overall performance and, in fact, leads to slight
degradation compared to training on easy prompts alone. We also observe that
the performance gap between difficult and easy prompts closes as the model
capacity increases, suggesting that difficulty interacts with the model
capacity. Building on these findings, we explore strategies to mitigate the
negative effect of difficult prompts on final performance. We demonstrate that
selectively removing an appropriate portion of challenging prompts enhances
overall self-play performance, while also reporting failed attempts and lessons
learned.
☆ H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference
Autoregressive decoding in large language models (LLMs) requires caching a
growing list of past key-value (KV) pairs, making long-context inference a
memory-bound problem. While recent methods have explored quantizing the cache,
evicting tokens, or using binary sketches for keys (e.g., Loki), these
approaches often provide an incomplete solution by leaving one component (like
values) uncompressed or by discarding context information. This paper
introduces the Hybrid One-Bit KV Cache (H1B-KV), a comprehensive compression
scheme that radically reduces memory usage without sacrificing context. H1B-KV
represents each key vector using a 1-bit binary sketch, enabling
hardware-friendly bitwise attention, and further compresses value vectors using
4-bit quantization. This holistic, hybrid approach allows a 7-billion parameter
LLM to handle an 8k-token context with under 60 MB of cache memory - a 70x
reduction. We demonstrate that after a lightweight finetuning, H1B-KV matches
full-precision performance not only on perplexity benchmarks but also on
complex downstream tasks like mathematical reasoning (GSM8K), multi-task
understanding (MMLU), and code generation (HumanEval). Our results show H1B-KV
significantly outperforms leading quantization (KIVI), token eviction
(SparseLLM), and key-only sketching (Loki) methods in quality-per-byte,
establishing it as a robust solution for deploying LLMs in memory-constrained
environments.
comment: MIT URTC 2025 Technical Paper (Oral), 5 pages, 1 figure
☆ KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance
We present Knowledge Extraction on OMIn (KEO), a domain-specific knowledge
extraction and reasoning framework with large language models (LLMs) in
safety-critical contexts. Using the Operations and Maintenance Intelligence
(OMIn) dataset, we construct a QA benchmark spanning global sensemaking and
actionable maintenance tasks. KEO builds a structured Knowledge Graph (KG) and
integrates it into a retrieval-augmented generation (RAG) pipeline, enabling
more coherent, dataset-wide reasoning than traditional text-chunk RAG. We
evaluate locally deployable LLMs (Gemma-3, Phi-4, Mistral-Nemo) and employ
stronger models (GPT-4o, Llama-3.3) as judges. Experiments show that KEO
markedly improves global sensemaking by revealing patterns and system-level
insights, while text-chunk RAG remains effective for fine-grained procedural
tasks requiring localized retrieval. These findings underscore the promise of
KG-augmented LLMs for secure, domain-specific QA and their potential in
high-stakes reasoning.
☆ CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension NeurIPS 2025
Current Large Language Models (LLMs) are confronted with overwhelming
information volume when comprehending long-form documents. This challenge
raises the imperative of a cohesive memory module, which can elevate vanilla
LLMs into autonomous reading agents. Despite the emergence of some heuristic
approaches, a systematic design principle remains absent. To fill this void, we
draw inspiration from Jean Piaget's Constructivist Theory, illuminating three
traits of the agentic memory -- structured schemata, flexible assimilation, and
dynamic accommodation. This blueprint forges a clear path toward a more robust
and efficient memory system for LLM-based reading comprehension. To this end,
we develop CAM, a prototype implementation of Constructivist Agentic Memory
that simultaneously embodies the structurality, flexibility, and dynamicity. At
its core, CAM is endowed with an incremental overlapping clustering algorithm
for structured memory development, supporting both coherent hierarchical
summarization and online batch integration. During inference, CAM adaptively
explores the memory structure to activate query-relevant information for
contextual response, akin to the human associative process. Compared to
existing approaches, our design demonstrates dual advantages in both
performance and efficiency across diverse long-text reading comprehension
tasks, including question answering, query-based summarization, and claim
verification.
comment: Accepted by NeurIPS 2025
☆ Prototype-Based Dynamic Steering for Large Language Models
Despite impressive breadth, LLMs still rely on explicit reasoning
instructions or static, one-fits-all steering methods, leaving a gap for
adaptive, instruction-free reasoning amplification. We present Prototype-Based
Dynamic Steering (PDS), a test-time method that amplifies large language model
(LLM) reasoning without adding or altering instructions. We introduce
"reasoning prototypes" by clustering activation differences between
Chain-of-Thought (CoT) and neutral prompts. At inference, an input's hidden
state is projected onto these prototypes to form an instance-specific steering
vector. Evaluated on GSM8K, AQuA-RAT, and BIG-Bench tasks, PDS consistently
improves accuracy without fine-tuning or prompt engineering. Notably, the gains
persist even when CoT is explicitly suppressed to improve cost-efficiency,
indicating that the intervention strengthens latent reasoning processes rather
than inducing a superficial behavioral shift. These results position dynamic,
prototype-guided steering as a lightweight alternative to training-time
approaches for enhancing LLM reasoning.
☆ NorMuon: Making Muon more efficient and scalable
The choice of optimizer significantly impacts the training efficiency and
computational costs of large language models (LLMs). Recently, the Muon
optimizer has demonstrated promising results by orthogonalizing parameter
updates, improving optimization geometry through better conditioning. Despite
Muon's emergence as a candidate successor to Adam, the potential for jointly
leveraging their strengths has not been systematically explored. In this work,
we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an
optimizer that synergistically combines orthogonalization with neuron-level
adaptive learning rates. Our analysis reveals that while Muon effectively
reduces condition numbers, the resulting updates exhibit highly non-uniform
neuron norms, causing certain neurons to dominate the optimization process.
NorMuon addresses this imbalance by maintaining second-order momentum
statistics for each neuron and applying row-wise normalization after
orthogonalization, ensuring balanced parameter utilization while preserving
Muon's conditioning benefits. To enable practical deployment at scale, we
develop an efficient distributed implementation under the FSDP2 framework that
strategically distributes orthogonalization computations across devices.
Experiments across multiple model scales demonstrate that NorMuon consistently
outperforms both Adam and Muon, achieving 21.74% better training efficiency
than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while
maintaining a comparable memory footprint to Muon. Our findings suggest that
orthogonalization and adaptive learning rates are complementary rather than
competing approaches, opening new avenues for optimizer design in large-scale
deep learning.
☆ LANTERN: Scalable Distillation of Large Language Models for Job-Person Fit and Explanation
Zhoutong Fu, Yihan Cao, Yi-Lin Chen, Aman Lunia, Liming Dong, Neha Saraf, Ruijie Jiang, Yun Dai, Qingquan Song, Tan Wang, Guoyao Li, Derek Koh, Haichao Wei, Zhipeng Wang, Aman Gupta, Chengming Jiang, Jianqiang Shen, Liangjie Hong, Wenjing Zhang
Large language models (LLMs) have achieved strong performance across a wide
range of natural language processing tasks. However, deploying LLMs at scale
for domain specific applications, such as job-person fit and explanation in job
seeking platforms, introduces distinct challenges. At LinkedIn, the job person
fit task requires analyzing a candidate's public profile against job
requirements to produce both a fit assessment and a detailed explanation.
Directly applying open source or finetuned LLMs to this task often fails to
yield high quality, actionable feedback due to the complexity of the domain and
the need for structured outputs. Moreover, the large size of these models leads
to high inference latency and limits scalability, making them unsuitable for
online use. To address these challenges, we introduce LANTERN, a novel LLM
knowledge distillation framework tailored specifically for job person fit
tasks. LANTERN involves modeling over multiple objectives, an encoder model for
classification purpose, and a decoder model for explanation purpose. To better
distill the knowledge from a strong black box teacher model to multiple
downstream models, LANTERN incorporates multi level knowledge distillation that
integrates both data and logit level insights. In addition to introducing the
knowledge distillation framework, we share our insights on post training
techniques and prompt engineering, both of which are crucial for successfully
adapting LLMs to domain specific downstream tasks. Extensive experimental
results demonstrate that LANTERN significantly improves task specific metrics
for both job person fit and explanation. Online evaluations further confirm its
effectiveness, showing measurable gains in job seeker engagement, including a
0.24\% increase in apply rate and a 0.28\% increase in qualified applications.
comment: 9 pages, 4 figures, 5 tables
☆ Language Model as Planner and Formalizer under Constraints
LLMs have been widely used in planning, either as planners to generate action
sequences end-to-end, or as formalizers to represent the planning domain and
problem in a formal language that can derive plans deterministically. However,
both lines of work rely on standard benchmarks that only include generic and
simplistic environmental specifications, leading to potential overestimation of
the planning ability of LLMs and safety concerns in downstream tasks. We bridge
this gap by augmenting widely used planning benchmarks with manually annotated,
fine-grained, and rich natural language constraints spanning four formally
defined categories. Over 4 state-of-the-art reasoning LLMs, 3 formal languages,
5 methods, and 4 datasets, we show that the introduction of constraints not
only consistently halves performance, but also significantly challenges
robustness to problem complexity and lexical shift.
☆ TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation
Modern natural language processing models have achieved unprecedented scale,
yet the tools for their evaluation often remain a computational bottleneck,
limiting the pace of research. This is particularly acute for in-training
evaluation metrics, such as per-sentence reward signals in Reinforcement
Learning, which must operate efficiently on batches of token IDs directly on
the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the
BLEU metric designed from the ground up for this specific use case. Our
approach is fully vectorized for GPU-accelerated, per-sentence computation
within PyTorch and introduces a memory-efficient counting mechanism. By
creating a compact, batch-specific dictionary of n-grams using
\texttt{torch.unique}, our method avoids the prohibitive memory costs of
traditional hashing-based vectorization, making it practical for
large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard
library for token-ID-based BLEU calculation on the CPU. Experiments show that
TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and
exceeding 40x on data-center-class hardware (NVIDIA A100). This performance
transforms a significant bottleneck into a negligible part of the training
loop. By clearly defining its role as a "Token-ID BLEU" for development
purposes and open-sourcing our implementation, we provide a powerful tool for
accelerating research in areas like RL-based model fine-tuning.
comment: 9 pages, 3 figures
♻ ☆ LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning
Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin
Large Language Models (LLMs) demonstrate their reasoning ability through
chain-of-thought (CoT) generation. However, LLM's autoregressive decoding may
limit the ability to revisit and refine earlier tokens in a holistic manner,
which can also lead to inefficient exploration for diverse solutions. In this
paper, we propose LaDiR (Latent Diffusion Reasoner), a novel reasoning
framework that unifies the expressiveness of continuous latent representation
with the iterative refinement capabilities of latent diffusion models for an
existing LLM. We first construct a structured latent reasoning space using a
Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of
thought tokens, preserving semantic information and interpretability while
offering compact but expressive representations. Subsequently, we utilize a
latent diffusion model that learns to denoise a block of latent thought tokens
with a blockwise bidirectional attention mask, enabling longer horizon and
iterative refinement with adaptive test-time compute. This design allows
efficient parallel generation of diverse reasoning trajectories, allowing the
model to plan and revise the reasoning process holistically. We conduct
evaluations on a suite of mathematical reasoning and planning benchmarks.
Empirical results show that LaDiR consistently improves accuracy, diversity,
and interpretability over existing autoregressive, diffusion-based, and latent
reasoning methods, revealing a new paradigm for text reasoning with latent
diffusion.
♻ ☆ Generative Interfaces for Language Models
Large language models (LLMs) are increasingly seen as assistants, copilots,
and consultants, capable of supporting a wide range of tasks through natural
conversation. However, most systems remain constrained by a linear
request-response format that often makes interactions inefficient in
multi-turn, information-dense, and exploratory tasks. To address these
limitations, we propose Generative Interfaces for Language Models, a paradigm
in which LLMs respond to user queries by proactively generating user interfaces
(UIs) that enable more adaptive and interactive engagement. Our framework
leverages structured interface-specific representations and iterative
refinements to translate user queries into task-specific UIs. For systematic
evaluation, we introduce a multidimensional assessment framework that compares
generative interfaces with traditional chat-based ones across diverse tasks,
interaction patterns, and query types, capturing functional, interactive, and
emotional aspects of user experience. Results show that generative interfaces
consistently outperform conversational ones, with up to a 72% improvement in
human preference. These findings clarify when and why users favor generative
interfaces, paving the way for future advancements in human-AI interaction.
comment: Preprint
♻ ☆ Tracing Multilingual Factual Knowledge Acquisition in Pretraining EMNLP
Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia Körner, Ercong Nie, Barbara Plank, François Yvon, Hinrich Schütze
Large Language Models (LLMs) are capable of recalling multilingual factual
knowledge present in their pretraining data. However, most studies evaluate
only the final model, leaving the development of factual recall and
crosslingual consistency throughout pretraining largely unexplored. In this
work, we trace how factual recall and crosslingual consistency evolve during
pretraining, focusing on OLMo-7B as a case study. We find that both accuracy
and consistency improve over time for most languages. We show that this
improvement is primarily driven by the fact frequency in the pretraining
corpus: more frequent facts are more likely to be recalled correctly,
regardless of language. Yet, some low-frequency facts in non-English languages
can still be correctly recalled. Our analysis reveals that these instances
largely benefit from crosslingual transfer of their English counterparts -- an
effect that emerges predominantly in the early stages of pretraining. We
pinpoint two distinct pathways through which multilingual factual knowledge
acquisition occurs: (1) frequency-driven learning, which is dominant and
language-agnostic, and (2) crosslingual transfer, which is limited in scale and
typically constrained to relation types involving named entities. We release
our code and data to facilitate further research at
https://github.com/cisnlp/multilingual-fact-tracing.
comment: EMNLP Findings 2025
♻ ★ LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures
Large Language Model (LLM) pretraining, finetuning, and evaluation rely on
input-space reconstruction and generative capabilities. Yet, it has been
observed in vision that embedding-space training objectives, e.g., with Joint
Embedding Predictive Architectures (JEPAs), are far superior to their
input-space counterpart. That mismatch in how training is achieved between
language and vision opens up a natural question: {\em can language training
methods learn a few tricks from the vision ones?} The lack of JEPA-style LLM is
a testimony of the challenge in designing such objectives for language. In this
work, we propose a first step in that direction where we develop LLM-JEPA, a
JEPA based solution for LLMs applicable both to finetuning and pretraining.
Thus far, LLM-JEPA is able to outperform the standard LLM training objectives
by a significant margin across models, all while being robust to overfiting.
Those findings are observed across numerous datasets (NL-RX, GSM8K, Spider,
RottenTomatoes) and various models from the Llama3, OpenELM, Gemma2 and Olmo
families. Code: https://github.com/rbalestr-lab/llm-jepa.
♻ ☆ Exploring the Potential of Conversational AI Support for Agent-Based Social Simulation Model Design
ChatGPT, the AI-powered chatbot with a massive user base of hundreds of
millions, has become a global phenomenon. However, the use of Conversational AI
Systems (CAISs) like ChatGPT for research in the field of Social Simulation is
still limited. Specifically, there is no evidence of its usage in Agent-Based
Social Simulation (ABSS) model design. This paper takes a crucial first step
toward exploring the untapped potential of this emerging technology in the
context of ABSS model design. The research presented here demonstrates how
CAISs can facilitate the development of innovative conceptual ABSS models in a
concise timeframe and with minimal required upfront case-based knowledge. By
employing advanced prompt engineering techniques and adhering to the
Engineering ABSS framework, we have constructed a comprehensive prompt script
that enables the design of conceptual ABSS models with or by the CAIS. A
proof-of-concept application of the prompt script, used to generate the
conceptual ABSS model for a case study on the impact of adaptive architecture
in a museum environment, illustrates the practicality of the approach. Despite
occasional inaccuracies and conversational divergence, the CAIS proved to be a
valuable companion for ABSS modellers.
comment: This paper has been published in the Journal of Artificial Societies
and Social Simulation 28 (3) 2. Please refer to the published version at
[https://doi.org/10.18564/jasss.5681]
♻ ☆ OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature EMNLP 2025
Alisha Srivastava, Emir Korukluoglu, Minh Nhat Le, Duyen Tran, Chau Minh Pham, Marzena Karpinska, Mohit Iyyer
Large language models (LLMs) are known to memorize and recall English text
from their pretraining data. However, the extent to which this ability
generalizes to non-English languages or transfers across languages remains
unclear. This paper investigates multilingual and cross-lingual memorization in
LLMs, probing if memorized content in one language (e.g., English) can be
recalled when presented in translation. To do so, we introduce OWL, a dataset
of 31.5K aligned excerpts from 20 books in ten languages, including English
originals, official translations (Vietnamese, Spanish, Turkish), and new
translations in six low-resource languages (Sesotho, Yoruba, Maithili,
Malagasy, Setswana, Tahitian). We evaluate memorization across model families
and sizes through three tasks: (1) direct probing, which asks the model to
identify a book's title and author; (2) name cloze, which requires predicting
masked character names; and (3) prefix probing, which involves generating
continuations. We find that LLMs consistently recall content across languages,
even for texts without direct translation in pretraining data. GPT-4o, for
example, identifies authors and titles 69% of the time and masked entities 6%
of the time in newly translated excerpts. Perturbations (e.g., masking
characters, shuffling words) modestly reduce direct probing accuracy (7% drop
for shuffled official translations). Our results highlight the extent of
cross-lingual memorization and provide insights on the differences between the
models.
comment: Accepted to EMNLP 2025 Main
♻ ☆ How Reliable are Causal Probing Interventions?
Causal probing aims to analyze foundation models by examining how intervening
on their representation of various latent properties impacts their outputs.
Recent works have cast doubt on the theoretical basis of several leading causal
probing methods, but it has been unclear how to systematically evaluate the
effectiveness of these methods in practice. To address this, we define two key
causal probing desiderata: completeness (how thoroughly the representation of
the target property has been transformed) and selectivity (how little
non-targeted properties have been impacted). We find that there is an inherent
tradeoff between the two, which we define as reliability, their harmonic mean.
We introduce an empirical analysis framework to measure and evaluate these
quantities, allowing us to make the first direct comparisons between different
families of leading causal probing methods (e.g., linear vs. nonlinear, or
concept removal vs. counterfactual interventions). We find that: (1) all
methods show a clear tradeoff between completeness and selectivity; (2) more
complete and reliable methods have a greater impact on LLM behavior; and (3)
nonlinear interventions are almost always more reliable than linear
interventions.
♻ ☆ Trajectory Prediction Meets Large Language Models: A Survey
Recent advances in large language models (LLMs) have sparked growing interest
in integrating language-driven techniques into trajectory prediction. By
leveraging their semantic and reasoning capabilities, LLMs are reshaping how
autonomous systems perceive, model, and predict trajectories. This survey
provides a comprehensive overview of this emerging field, categorizing recent
work into five directions: (1) Trajectory prediction via language modeling
paradigms, (2) Direct trajectory prediction with pretrained language models,
(3) Language-guided scene understanding for trajectory prediction, (4)
Language-driven data generation for trajectory prediction, (5) Language-based
reasoning and interpretability for trajectory prediction. For each, we analyze
representative methods, highlight core design choices, and identify open
challenges. This survey bridges natural language processing and trajectory
prediction, offering a unified perspective on how language can enrich
trajectory prediction.
comment: 16 pages, GitHub:
https://github.com/colorfulfuture/Awesome-Trajectory-Motion-Prediction-Papers
♻ ☆ Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Reward hacking, where a reasoning model exploits loopholes in a reward
function to achieve high rewards without solving the intended task, poses a
significant threat. This behavior may be explicit, i.e. verbalized in the
model's chain-of-thought (CoT), or implicit, where the CoT appears benign thus
bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE
(Truncated Reasoning AUC Evaluation). Our key observation is that hacking
occurs when exploiting the loophole is easier than solving the actual task.
This means that the model is using less 'effort' than required to achieve high
reward. TRACE quantifies effort by measuring how early a model's reasoning
becomes sufficient to obtain the reward. We progressively truncate a model's
CoT at various lengths, force the model to answer, and estimate the expected
reward at each cutoff. A hacking model, which takes a shortcut, will achieve a
high expected reward with only a small fraction of its CoT, yielding a large
area under the accuracy-vs-length curve. TRACE achieves over 65% gains over our
strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B
monitor in coding. We further show that TRACE can discover unknown loopholes
during training. Overall, TRACE offers a scalable unsupervised approach for
oversight where current monitoring methods prove ineffective.
comment: 25 pages, 31 figures
♻ ☆ Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes
Moral alignment has emerged as a widely adopted approach for regulating the
behavior of pretrained language models (PLMs), typically through fine-tuning on
curated datasets. Gender stereotype mitigation is a representational task
within the broader application of moral alignment. However, this process often
comes at the cost of degraded downstream task performance. Prior studies
commonly aim to achieve a performance trade-off by encouraging PLMs to
selectively forget only stereotypical knowledge through carefully designed
fairness objective, while preserving their language modeling capability
(overall forgetting). In this short paper, we investigate whether the
performance trade-off can be achieved through the lens of forgetting and the
fairness objective. Our analysis shows that the large datasets needed for
satisfactory fairness highlight the limitations of current fairness objectives
in achieving an effective trade-off: (1) downstream task performance is
strongly correlated with overall forgetting; (2) selective forgetting reduces
stereotypes, but overall forgetting increases. and (3) general solutions for
alleviating forgetting are ineffective at reducing the overall forgetting and
fail to improve downstream task performance.
♻ ☆ Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Reasoning language models improve performance on complex tasks by generating
long chains of thought (CoTs), but this process can also increase harmful
outputs in adversarial settings. In this work, we ask whether the long CoTs can
be leveraged for predictive safety monitoring: do the reasoning traces provide
early signals of final response alignment that could enable timely
intervention? We evaluate a range of monitoring methods using either CoT text
or activations, including highly capable large language models, fine-tuned
classifiers, and humans. First, we find that a simple linear probe trained on
CoT activations significantly outperforms all text-based baselines in
predicting whether a final response is safe or unsafe, with an average absolute
increase of 13 in F1 scores over the best-performing alternatives. CoT texts
are often unfaithful and misleading, while model latents provide a more
reliable predictive signal. Second, the probe can be applied to early CoT
segments before the response is generated, showing that alignment signals
appear before reasoning completes. Error analysis reveals that the performance
gap between text classifiers and the linear probe largely stems from a subset
of responses we call performative CoTs, where the reasoning consistently
contradicts the final response as the CoT progresses. Our findings generalize
across model sizes, families, and safety benchmarks, suggesting that
lightweight probes could enable real-time safety monitoring and early
intervention during generation.
♻ ☆ Epistemic Diversity and Knowledge Collapse in Large Language Models
Dustin Wright, Sarah Masud, Jared Moore, Srishti Yadav, Maria Antoniak, Chan Young Park, Isabelle Augenstein
Large language models (LLMs) tend to generate lexically, semantically, and
stylistically homogenous texts. This poses a risk of knowledge collapse, where
homogenous LLMs mediate a shrinking in the range of accessible information over
time. Existing works on homogenization are limited by a focus on closed-ended
multiple-choice setups or fuzzy semantic features, and do not look at trends
across time and cultural contexts. To overcome this, we present a new
methodology to measure epistemic diversity, i.e., variation in real-world
claims in LLM outputs, which we use to perform a broad empirical study of LLM
knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200
prompt variations sourced from real user chats. For the topics in our study, we
show that while newer models tend to generate more diverse claims, nearly all
models are less epistemically diverse than a basic web search. We find that
model size has a negative impact on epistemic diversity, while
retrieval-augmented generation (RAG) has a positive impact, though the
improvement from RAG varies by the cultural context. Finally, compared to a
traditional knowledge source (Wikipedia), we find that country-specific claims
reflect the English language more than the local one, highlighting a gap in
epistemic representation
comment: 16 pages; 8 figures, 4 tables v2 changelog: Fixed the modeling for
table 3, random effect is the model version
♻ ☆ Entropy-Gated Branching for Efficient Test-Time Reasoning
Test-time compute methods can significantly improve the reasoning
capabilities and problem-solving accuracy of large language models (LLMs).
However, these approaches require substantially more computational resources,
with most compute wasted on exploring low-diversity branches where the model
already exhibits high confidence. We observe that a small subset of uncertain
reasoning steps has a disproportionately large impact on final prediction
accuracy, and branching at these critical junctures tends to yield more diverse
and higher-quality candidate reasoning steps. We propose Entropy-Gated
Branching (EGB), which branches only at high-uncertainty steps and prunes
expansions with a lightweight verifier. On mathematical and financial reasoning
benchmarks, EGB improves accuracy by 22.6% over standard inference while
operating 31%-75% faster across math benchmarks than test-time beam search with
higher performance. Our results show that dynamic resource allocation during
inference can substantially improve both efficiency and effectiveness, offering
a more scalable pathway to enhanced LLM reasoning capabilities.
♻ ☆ CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are
distinguished by their strong performance scaling with increasing parameters
across a wide range of tasks, yet they also suffer from substantial
computational and storage overheads. Notably, the performance gains of MoE
models do not scale proportionally with the growth in expert parameters. While
prior works attempt to reduce parameters via expert-level pruning, merging, or
decomposition, they still suffer from challenges in both performance and
computational efficiency. In this paper, we address these challenges by
introducing micro-expert as a finer-grained compression unit that spans across
matrices. We first establish a more fundamental perspective, viewing MoE layers
as mixtures of micro-experts, and present CAMERA, a lightweight and
training-free framework for identifying micro-expert redundancy. Our analysis
uncovers significant variance in micro-expert contributions during decoding.
Based on this insight, we further propose CAMERA-P, a structured micro-expert
pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed
for micro-experts. Extensive experiments on nine downstream tasks show that
CAMERA-P consistently outperforms strong baselines under pruning ratios ranging
from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under
aggressive 2-bit quantization, surpassing existing matrix- and channel-level
ideas. Notably, our method enables complete micro-expert analysis of
Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.
comment: 16 pages, 9 figures, 7 tables
♻ ☆ On Relation-Specific Neurons in Large Language Models EMNLP 2025
Yihong Liu, Runsheng Chen, Lea Hirlimann, Ahmad Dawar Hakimi, Mingyang Wang, Amir Hossein Kargaran, Sascha Rothe, François Yvon, Hinrich Schütze
In large language models (LLMs), certain \emph{neurons} can store distinct
pieces of knowledge learned during pretraining. While factual knowledge
typically appears as a combination of \emph{relations} and \emph{entities}, it
remains unclear whether some neurons focus on a relation itself -- independent
of any entity. We hypothesize such neurons \emph{detect} a relation in the
input text and \emph{guide} generation involving such a relation. To
investigate this, we study the LLama-2 family on a chosen set of relations,
with a \textit{statistics}-based method. Our experiments demonstrate the
existence of relation-specific neurons. We measure the effect of selectively
deactivating candidate neurons specific to relation $r$ on the LLM's ability to
handle (1) facts involving relation $r$ and (2) facts involving a different
relation $r' \neq r$. With respect to their capacity for encoding relation
information, we give evidence for the following three properties of
relation-specific neurons. \textbf{(i) Neuron cumulativity.} Multiple neurons
jointly contribute to processing facts involving relation $r$, with no single
neuron fully encoding a fact in $r$ on its own. \textbf{(ii) Neuron
versatility.} Neurons can be shared across multiple closely related as well as
less related relations. In addition, some relation neurons transfer across
languages. \textbf{(iii) Neuron interference.} Deactivating neurons specific to
one relation can improve LLMs' factual recall performance for facts of other
relations. We make our code and data publicly available at
https://github.com/cisnlp/relation-specific-neurons.
comment: EMNLP 2025
♻ ☆ MedHal: An Evaluation Dataset for Medical Hallucination Detection
We present MedHal, a novel large-scale dataset specifically designed to
evaluate if models can detect hallucinations in medical texts. Current
hallucination detection methods face significant limitations when applied to
specialized domains like medicine, where they can have disastrous consequences.
Existing medical datasets are either too small, containing only a few hundred
samples, or focus on a single task like Question Answering or Natural Language
Inference. MedHal addresses these gaps by: (1) incorporating diverse medical
text sources and tasks; (2) providing a substantial volume of annotated samples
suitable for training medical hallucination detection models; and (3) including
explanations for factual inconsistencies to guide model learning. We
demonstrate MedHal's utility by training and evaluating a baseline medical
hallucination detection model, showing improvements over general-purpose
hallucination detection approaches. This resource enables more efficient
evaluation of medical text generation systems while reducing reliance on costly
expert review, potentially accelerating the development of medical AI research.
♻ ☆ Large Language Models Achieve Gold Medal Performance at the International Olympiad on Astronomy & Astrophysics (IOAA)
Lucas Carrit Delgado Pinheiro, Ziru Chen, Bruno Caixeta Piazza, Ness Shroff, Yingbin Liang, Yuan-Sen Ting, Huan Sun
While task-specific demonstrations show early success in applying large
language models (LLMs) to automate some astronomical research tasks, they only
provide incomplete views of all necessary capabilities in solving astronomy
problems, calling for more thorough understanding of LLMs' strengths and
limitations. So far, existing benchmarks and evaluations focus on simple
question-answering that primarily tests astronomical knowledge and fails to
evaluate the complex reasoning required for real-world research in the
discipline. Here, we address this gap by systematically benchmarking five
state-of-the-art LLMs on the International Olympiad on Astronomy and
Astrophysics (IOAA) exams, which are designed to examine deep conceptual
understanding, multi-step derivations, and multimodal analysis. With average
scores of 85.6% and 84.2%, Gemini 2.5 Pro and GPT-5 (the two top-performing
models) not only achieve gold medal level performance but also rank in the top
two among ~200-300 participants in all four IOAA theory exams evaluated
(2022-2025). In comparison, results on the data analysis exams show more
divergence. GPT-5 still excels in the exams with an 88.5% average score,
ranking top 10 among the participants in the four most recent IOAAs, while
other models' performances drop to 48-76%. Furthermore, our in-depth error
analysis underscores conceptual reasoning, geometric reasoning, and spatial
visualization (52-79% accuracy) as consistent weaknesses among all LLMs. Hence,
although LLMs approach peak human performance in theory exams, critical gaps
must be addressed before they can serve as autonomous research agents in
astronomy.
comment: 18 pages, 6 figures, to be submitted, comments are welcome.
Reproducibility details can be found at:
https://github.com/OSU-NLP-Group/LLM-IOAA
♻ ☆ What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?
Multimodal reasoning models have recently shown promise on challenging
domains such as olympiad-level geometry, yet their evaluation remains dominated
by aggregate accuracy, a single score that obscures where and how models are
improving. We introduce MathLens, a benchmark designed to disentangle the
subskills of multimodal reasoning while preserving the complexity of
textbook-style geometry problems. The benchmark separates performance into
three components: Perception: extracting information from raw inputs,
Reasoning: operating on available information, and Integration: selecting
relevant perceptual evidence and applying it within reasoning. To support each
test, we provide annotations: visual diagrams, textual descriptions to evaluate
reasoning in isolation, controlled questions that require both modalities, and
probes for fine-grained perceptual skills, all derived from symbolic
specifications of the problems to ensure consistency and robustness. Our
analysis reveals that different training approaches have uneven effects: First,
reinforcement learning chiefly strengthens perception, especially when
supported by textual supervision, while textual SFT indirectly improves
perception through reflective reasoning. Second, reasoning improves only in
tandem with perception. Third, integration remains the weakest capacity, with
residual errors concentrated there once other skills advance. Finally,
robustness diverges: RL improves consistency under diagram variation, whereas
multimodal SFT reduces it through overfitting. We will release all data and
experimental logs.
♻ ☆ v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
When thinking with images, humans rarely rely on a single glance: they
revisit visual information repeatedly during reasoning. However, existing
models typically process images only once and thereafter generate reasoning
entirely in text, lacking mechanisms to re-access or ground inference in visual
representations. We empirically confirm this: as reasoning chains lengthen,
models progressively lose focus on relevant regions. In response, we introduce
v1, a lightweight extension that enables active visual referencing through a
simple point-and-copy approach. This allows the model to identify relevant
image patches and copy their embeddings back into the reasoning stream,
ensuring that evolving hypotheses remain grounded in perceptual evidence.
Crucially, our pointing strategy lets the MLLM directly select image patches
using their semantic representations as keys, keeping perceptual evidence
embedded in the same space as the model's reasoning. To train this capability,
we construct v1g, a dataset of 300K multimodal reasoning traces with
interleaved visual grounding annotations. Across various multimodal
mathematical reasoning benchmarks, v1 consistently outperforms comparable
baselines, establishing point-and-copy as a practical mechanism for grounded
reasoning. The model checkpoint and dataset are available at
github.com/jun297/v1.
♻ ☆ Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models ACL 2025
Values are core drivers of individual and collective perception, cognition,
and behavior. Value systems, such as Schwartz's Theory of Basic Human Values,
delineate the hierarchy and interplay among these values, enabling
cross-disciplinary investigations into decision-making and societal dynamics.
Recently, the rise of Large Language Models (LLMs) has raised concerns
regarding their elusive intrinsic values. Despite growing efforts in
evaluating, understanding, and aligning LLM values, a psychologically grounded
LLM value system remains underexplored. This study addresses the gap by
introducing the Generative Psycho-Lexical Approach (GPLA), a scalable,
adaptable, and theoretically informed method for constructing value systems.
Leveraging GPLA, we propose a psychologically grounded five-factor value system
tailored for LLMs. For systematic validation, we present three benchmarking
tasks that integrate psychological principles with cutting-edge AI priorities.
Our results reveal that the proposed value system meets standard psychological
criteria, better captures LLM values, improves LLM safety prediction, and
enhances LLM alignment, when compared to the canonical Schwartz's values.
comment: ACL 2025 Main
♻ ☆ AgenticIE: An Adaptive Agent for Information Extraction from Complex Regulatory Documents
Declaration of Performance (DoP) documents, mandated by EU regulation,
certify the performance of construction products. There are two challenges to
make DoPs machine and human accessible through automated key-value pair
extraction (KVP) and question answering (QA): (1) While some of their content
is standardized, DoPs vary widely in layout, schema, and format; (2) Both users
and documents are multilingual. Existing static or LLM-only Information
Extraction (IE) pipelines fail to adapt to this structural document and user
diversity. Our domain-specific, agentic system addresses these challenges
through a planner-executor-responder architecture. The system infers user
intent, detects document language and modality, and orchestrates tools
dynamically for robust, traceable reasoning while avoiding tool misuse or
execution loops. Our agent outperforms baselines (ROUGE: 0.783 vs. 0.703/0.608)
with better cross-lingual stability (17-point vs. 21-26-point variation).
♻ ☆ SAE-FiRE: Enhancing Earnings Surprise Predictions Through Sparse Autoencoder Feature Selection
Predicting earnings surprises from financial documents, such as earnings
conference calls, regulatory filings, and financial news, has become
increasingly important in financial economics. However, these financial
documents present significant analytical challenges, typically containing over
5,000 words with substantial redundancy and industry-specific terminology that
creates obstacles for language models. In this work, we propose the SAE-FiRE
(Sparse Autoencoder for Financial Representation Enhancement) framework to
address these limitations by extracting key information while eliminating
redundancy. SAE-FiRE employs Sparse Autoencoders (SAEs) to decompose dense
neural representations from large language models into interpretable sparse
components, then applies statistical feature selection methods, including ANOVA
F-tests and tree-based importance scoring, to identify the top-k most
discriminative dimensions for classification. By systematically filtering out
noise that might otherwise lead to overfitting, we enable more robust and
generalizable predictions. Experimental results across three financial datasets
demonstrate that SAE-FiRE significantly outperforms baseline approaches.
♻ ☆ Unifying Inference-Time Planning Language Generation
Prabhu Prakash Kagitha, Bo Sun, Ishan Desai, Andrew Zhu, Cassie Huang, Manling Li, Ziyang Li, Li Zhang
A line of work in planning uses LLM not to generate a plan, but to generate a
formal representation in some planning language, which can be input into a
symbolic solver to deterministically find a plan. While showing improved trust
and promising performance, dozens of recent publications have proposed
scattered methods on a variety of benchmarks under different experimental
settings. We attempt to unify the inference-time LLM-as-formalizer methodology
for classical planning by proposing a unifying framework based on intermediate
representations. We thus systematically evaluate more than a dozen pipelines
that subsume most existing work, while proposing novel ones that involve
syntactically similar but high resource intermediate languages (such as a
Python wrapper of PDDL). We provide recipes for planning language generation
pipelines, draw a series of conclusions showing the efficacy of their various
components, and evidence their robustness against problem complexity.
♻ ☆ Fine-Grained and Thematic Evaluation of LLMs in Social Deduction Game
Recent studies have investigated whether large language models (LLMs) can
support obscured communication, which is characterized by core aspects such as
inferring subtext and evading suspicions. To conduct the investigation,
researchers have used social deduction games (SDGs) as their experimental
environment, in which players conceal and infer specific information. However,
prior work has often overlooked how LLMs should be evaluated in such settings.
Specifically, we point out two limitations with the evaluation methods they
employed. First, metrics used in prior studies are coarse-grained as they are
based on overall game outcomes that often fail to capture event-level
behaviors; Second, error analyses have lacked structured methodologies capable
of producing insights that meaningfully support evaluation outcomes. To address
these limitations, we propose a microscopic and systematic approach to the
investigation. Specifically, we introduce six fine-grained metrics that resolve
the first issue. To tackle the second issue, we conducted a thematic analysis
and identified four major reasoning failures that undermine LLMs' performance
in obscured communication.
comment: Published in IEEE Access
♻ ☆ AgriGPT-VL: Agricultural Vision-Language Understanding Suite
Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, Shijian Li
Despite rapid advances in multimodal large language models, agricultural
applications remain constrained by the scarcity of domain-tailored models,
curated vision-language corpora, and rigorous evaluation. To address these
challenges, we present the AgriGPT-VL Suite, a unified multimodal framework for
agriculture. Our contributions are threefold. First, we introduce Agri-3M-VL,
the largest vision-language corpus for agriculture to our knowledge, curated by
a scalable multi-agent data generator; it comprises 1M image-caption pairs, 2M
image-grounded VQA pairs, 50K expert-level VQA instances, and 15K GRPO
reinforcement learning samples. Second, we develop AgriGPT-VL, an
agriculture-specialized vision-language model trained via a progressive
curriculum of textual grounding, multimodal shallow/deep alignment, and GRPO
refinement. This method achieves strong multimodal reasoning while preserving
text-only capability. Third, we establish AgriBench-VL-4K, a compact yet
challenging evaluation suite with open-ended and image-grounded questions,
paired with multi-metric evaluation and an LLM-as-a-judge framework.
Experiments show that AgriGPT-VL outperforms leading general-purpose VLMs on
AgriBench-VL-4K, achieving higher pairwise win rates in the LLM-as-a-judge
evaluation. Meanwhile, it remains competitive on the text-only AgriBench-13K
with no noticeable degradation of language ability. Ablation studies further
confirm consistent gains from our alignment and GRPO refinement stages. We will
open source all of the resources to support reproducible research and
deployment in low-resource agricultural settings.
♻ ☆ GLiDRE: Generalist Lightweight model for Document-level Relation Extraction
Relation Extraction (RE) is a fundamental task in Natural Language
Processing, and its document-level variant poses significant challenges, due to
complex interactions between entities across sentences. While supervised models
have achieved strong results in fully resourced settings, their behavior with
limited training data remains insufficiently studied. We introduce GLiDRE, a
new compact model for document-level relation extraction, designed to work
efficiently in both supervised and few-shot settings. Experiments in both
low-resource supervised training and few-shot meta-learning benchmarks show
that our approach outperforms existing methods in data-constrained scenarios,
establishing a new state-of-the-art in few-shot document-level relation
extraction. Our code will be publicly available.
comment: Submitted to ARR October
♻ ☆ CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the
reasoning abilities of Large Language Models (LLMs) by using rule-based binary
feedback. However, current RLVR methods typically assign the same reward to
every token. This coarse-grained feedback hampers precise credit assignment,
making it hard for models to identify which reasoning steps lead to success or
failure, and often results in suboptimal policies. Methods like PPO provide
credit assignment by value estimation, but yield inaccurate and unverifiable
signals due to limited sampling. On the other hand, methods using Process
Reward Models can provide step-wise rewards but suffer from several key
limitations: they require high-quality process supervision labels, the feedback
is unreliable due to probabilistic reward modeling, and their application in
online reinforcement learning (RL) is time-consuming. To overcome these
limitations, we introduce a simple but efficient method-Credit Assignment
Policy Optimization (CAPO). Instead of training auxiliary models, CAPO directly
leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward
Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based
on the correctness of the step itself, providing deterministic token-level
credits to refine the tokens that were originally assigned identical rule-based
rewards. To further enhance the accuracy and robustness, we employ voting
mechanisms that scale with the number of generated critiques. Extensive
experiments on various backbones like Llama and Qwen models show that CAPO
consistently outperforms supervised learning-based and RL-based fine-tuning
methods across four challenging mathematical benchmarks and three out-of-domain
benchmarks. Further analysis shows that CAPO can help the model to foster the
learning of correct reasoning pathways leading to correct answers.
comment: Work in progress
♻ ☆ MASRAD: Arabic Terminology Management Corpora with Semi-Automatic Construction
This paper presents MASRAD, a terminology dataset for Arabic terminology
management, and a method with supporting tools for its semi-automatic
construction. The entries in MASRAD are $(f,a)$ pairs of foreign (non-Arabic)
terms $f$, appearing in specialized, academic and field-specific books next to
their Arabic $a$ counterparts. MASRAD-Ex systematically extracts these pairs as
a first step to construct MASRAD. MASRAD helps improving term consistency in
academic translations and specialized Arabic documents, and automating
cross-lingual text processing. MASRAD-Ex leverages translated terms organically
occurring in Arabic books, and considers several candidate pairs for each term
phrase. The candidate Arabic terms occur next to the foreign terms, and vary in
length. MASRAD-Ex computes lexicographic, phonetic, morphological, and semantic
similarity metrics for each candidate pair, and uses heuristic, machine
learning, and machine learning with post-processing approaches to decide on the
best candidate. This paper presents MASRAD after thorough expert review and
makes it available to the interested research community. The best performing
MASRAD-Ex approach achieved 90.5% precision and 92.4% recall.
♻ ☆ Teaching Small Language Models to Learn Logic through Meta-Learning
Large language models (LLMs) are increasingly evaluated on reasoning tasks,
yet their logical abilities remain contested. To address this, we study LLMs'
reasoning in a well-defined fragment of logic: syllogistic reasoning. We cast
the problem as premise selection and construct controlled datasets to isolate
logical competence. Beyond evaluation, an open challenge is enabling LLMs to
acquire abstract inference patterns that generalize to novel structures. We
propose to apply few-shot meta-learning to this domain, thereby encouraging
models to extract rules across tasks rather than memorize patterns within
tasks. Although meta-learning has been little explored in the context of logic
learnability, our experiments show that it is effective: small models (1.5B-7B)
fine-tuned with meta-learning demonstrate strong gains in generalization, with
especially pronounced benefits in low-data regimes. These meta-learned models
outperform GPT-4o and o3-mini on our syllogistic reasoning task.
♻ ☆ Building Resource-Constrained Language Agents: A Korean Case Study on Chemical Toxicity Information EMNLP 2025
Language agents powered by large language models (LLMs) face significant
deployment challenges in resource-constrained environments, particularly for
specialized domains and less-common languages. This paper presents Tox-chat, a
Korean chemical toxicity information agent devised within these limitations. We
propose two key innovations: a context-efficient architecture that reduces
token consumption through hierarchical section search, and a scenario-based
dialogue generation methodology that effectively distills tool-using
capabilities from larger models. Experimental evaluations demonstrate that our
fine-tuned 8B parameter model substantially outperforms both untuned models and
baseline approaches, in terms of DB faithfulness and preference. Our work
offers valuable insights for researchers developing domain-specific language
agents under practical constraints.
comment: EMNLP 2025 Industry track
♻ ☆ FAID: Fine-Grained AI-Generated Text Detection Using Multi-Task Auxiliary and Multi-Level Contrastive Learning
Minh Ngoc Ta, Dong Cao Van, Duc-Anh Hoang, Minh Le-Anh, Truong Nguyen, My Anh Tran Nguyen, Yuxia Wang, Preslav Nakov, Sang Dinh
The growing collaboration between humans and AI models in generative tasks
has introduced new challenges in distinguishing between human-written,
LLM-generated, and human--LLM collaborative texts. In this work, we collect a
multilingual, multi-domain, multi-generator dataset FAIDSet. We further
introduce a fine-grained detection framework FAID to classify text into these
three categories, and also to identify the underlying LLM family of the
generator. Unlike existing binary classifiers, FAID is built to capture both
authorship and model-specific characteristics. Our method combines multi-level
contrastive learning with multi-task auxiliary classification to learn subtle
stylistic cues. By modeling LLM families as distinct stylistic entities, we
incorporate an adaptation to address distributional shifts without retraining
for unseen data. Our experimental results demonstrate that FAID outperforms
several baselines, particularly enhancing the generalization accuracy on unseen
domains and new LLMs, thus offering a potential solution for improving
transparency and accountability in AI-assisted writing.
♻ ☆ Contrastive Learning Using Graph Embeddings for Domain Adaptation of Language Models in the Process Industry EMNLP 2025
Recent trends in NLP utilize knowledge graphs (KGs) to enhance pretrained
language models by incorporating additional knowledge from the graph structures
to learn domain-specific terminology or relationships between documents that
might otherwise be overlooked. This paper explores how SciNCL, a graph-aware
neighborhood contrastive learning methodology originally designed for
scientific publications, can be applied to the process industry domain, where
text logs contain crucial information about daily operations and are often
structured as sparse KGs. Our experiments demonstrate that language models
fine-tuned with triplets derived from graph embeddings (GE) outperform a
state-of-the-art mE5-large text encoder by 9.8-14.3% (5.45-7.96p) on the
proprietary process industry text embedding benchmark (PITEB) while having 3
times fewer parameters.
comment: accepted to EMNLP 2025 (industry track)
♻ ☆ Cross-Document Cross-Lingual NLI via RST-Enhanced Graph Fusion and Interpretability Prediction EMNLP 2025
Natural Language Inference (NLI) is a fundamental task in natural language
processing. While NLI has developed many sub-directions such as sentence-level
NLI, document-level NLI and cross-lingual NLI, Cross-Document Cross-Lingual NLI
(CDCL-NLI) remains largely unexplored. In this paper, we propose a novel
paradigm: CDCL-NLI, which extends traditional NLI capabilities to
multi-document, multilingual scenarios. To support this task, we construct a
high-quality CDCL-NLI dataset including 25,410 instances and spanning 26
languages. To address the limitations of previous methods on CDCL-NLI task, we
further propose an innovative method that integrates RST-enhanced graph fusion
with interpretability-aware prediction. Our approach leverages RST (Rhetorical
Structure Theory) within heterogeneous graph neural networks for cross-document
context modeling, and employs a structure-aware semantic alignment based on
lexical chains for cross-lingual understanding. For NLI interpretability, we
develop an EDU (Elementary Discourse Unit)-level attribution framework that
produces extractive explanations. Extensive experiments demonstrate our
approach's superior performance, achieving significant improvements over both
conventional NLI models as well as large language models. Our work sheds light
on the study of NLI and will bring research interest on cross-document
cross-lingual context understanding, hallucination elimination and
interpretability inference. Our code and datasets are available at
"https://github.com/Leonardo123-ui/CDCL_NLI" for peer review.
comment: EMNLP 2025 Main (Camera Ready)
♻ ☆ Towards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour
This study investigates the adoption of open-access, locally deployable
causal large language models (LLMs) for travel mode choice prediction and
introduces LiTransMC, the first fine-tuned causal LLM developed for this task.
We systematically benchmark eleven open-access LLMs (1-12B parameters) across
three stated and revealed preference datasets, testing 396 configurations and
generating over 79,000 mode choice decisions. Beyond predictive accuracy, we
evaluate models generated reasoning using BERTopic for topic modelling and a
novel Explanation Strength Index, providing the first structured analysis of
how LLMs articulate decision factors in alignment with behavioural theory.
LiTransMC, fine-tuned using parameter efficient and loss masking strategy,
achieved a weighted F1 score of 0.6845 and a Jensen-Shannon Divergence of
0.000245, surpassing both untuned local models and larger proprietary systems,
including GPT-4o with advanced persona inference and embedding-based loading,
while also outperforming classical mode choice methods such as discrete choice
models and machine learning classifiers for the same dataset. This dual
improvement, i.e., high instant-level accuracy and near-perfect distributional
calibration, demonstrates the feasibility of creating specialist, locally
deployable LLMs that integrate prediction and interpretability. Through
combining structured behavioural prediction with natural language reasoning,
this work unlocks the potential for conversational, multi-task transport models
capable of supporting agent-based simulations, policy testing, and behavioural
insight generation. These findings establish a pathway for transforming general
purpose LLMs into specialized and explainable tools for transportation research
and policy formulation, while maintaining privacy, reducing cost, and
broadening access through local deployment.
♻ ☆ Aligning Language Models with Real-time Knowledge Editing
Knowledge editing aims to modify outdated knowledge in large language models
(LLMs) efficiently while retaining their original capabilities. Mainstream
benchmarks for knowledge editing are predominantly static and fail to keep in
pace with the evolving real-world knowledge. In this work, we introduce CRAFT,
an ever-evolving real-world benchmark for knowledge editing. It features
well-designed paired edits for composite reasoning, and evaluates models on
alias portability as well as temporal and common-sense locality, making it a
challenging knowledge editing benchmark on which previous knowledge editing
methods hardly achieve balanced performance. Towards flexible real-time
editing, we propose KEDAS, a novel paradigm of knowledge editing alignment
featuring diverse edit augmentation and self-adaptive post-alignment inference,
which exhibits significant performance gain on CRAFT compared to previous
methods. All of our code and data are available at
https://anonymous.4open.science/r/CRAFT-KEDAS.
comment: Pre-print
♻ ☆ Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora
Language corpora are the foundation of most natural language processing
research, yet they often reproduce structural inequalities. One such inequality
is gender discrimination in how actors are represented, which can distort
analyses and perpetuate discriminatory outcomes. This paper introduces a
user-centric, actor-level pipeline for detecting and mitigating gender
discrimination in large-scale text corpora. By combining discourse-aware
analysis with metrics for sentiment, syntactic agency, and quotation styles,
our method enables both fine-grained auditing and exclusion-based balancing.
Applied to the taz2024full corpus of German newspaper articles (1980-2024), the
pipeline yields a more gender-balanced dataset while preserving core dynamics
of the source material. Our findings show that structural asymmetries can be
reduced through systematic filtering, though subtler biases in sentiment and
framing remain. We release the tools and reports to support further research in
discourse-based fairness auditing and equitable corpus construction.
♻ ☆ WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research
Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou
This paper tackles \textbf{open-ended deep research (OEDR)}, a complex
challenge where AI agents must synthesize vast web-scale information into
insightful reports. Current approaches are plagued by dual-fold limitations:
static research pipelines that decouple planning from evidence acquisition and
monolithic generation paradigms that include redundant, irrelevant evidence,
suffering from hallucination issues and low citation accuracy. To address these
challenges, we introduce \textbf{WebWeaver}, a novel dual-agent framework that
emulates the human research process. The planner operates in a dynamic cycle,
iteratively interleaving evidence acquisition with outline optimization to
produce a comprehensive, citation-grounded outline linking to a memory bank of
evidence. The writer then executes a hierarchical retrieval and writing
process, composing the report section by section. By performing targeted
retrieval of only the necessary evidence from the memory bank via citations for
each part, it effectively mitigates long-context issues and citation
hallucinations. Our framework establishes a new state-of-the-art across major
OEDR benchmarks, including DeepResearch Bench, DeepConsult, and
DeepResearchGym. These results validate our human-centric, iterative
methodology, demonstrating that adaptive planning and focused synthesis are
crucial for producing comprehensive, trusted, and well-structured reports.
comment: An agent system for open-ended deep research
♻ ☆ An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Large language models (LLMs) are typically aligned to refuse harmful
instructions through safety fine-tuning. A recent attack, termed abliteration,
identifies and suppresses the single latent direction most responsible for
refusal behavior, thereby enabling models to generate harmful content. We
propose a defense that fundamentally alters how models express refusal. We
construct an extended-refusal dataset in which responses to harmful prompts
provide detailed justifications before refusing, distributing the refusal
signal across multiple token positions. Fine-tuning Llama-2-7B-Chat and
Qwen2.5-Instruct (1.5B and 3B parameters) on this dataset yields models that
maintain high refusal rates under abliteration: refusal rates drop by at most
10%, compared to 70-80% drops in baseline models. Comprehensive evaluations of
safety and utility demonstrate that extended-refusal fine-tuning effectively
neutralizes abliteration attacks while preserving general model performance and
enhancing robustness across multiple alignment scenarios.
comment: preprint - under review
♻ ☆ SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models
Large language models (LLMs) are playing an increasingly important role in
scientific research, yet there remains a lack of comprehensive benchmarks to
evaluate the breadth and depth of scientific knowledge embedded in these
models. To address this gap, we introduce SciKnowEval, a large-scale dataset
designed to systematically assess LLMs across five progressive levels of
scientific understanding: memory, comprehension, reasoning, discernment, and
application. SciKnowEval comprises 28K multi-level questions and solutions
spanning biology, chemistry, physics, and materials science. Using this
benchmark, we evaluate 20 leading open-source and proprietary LLMs. The results
show that while proprietary models often achieve state-of-the-art performance,
substantial challenges remain -- particularly in scientific reasoning and
real-world application. We envision SciKnowEval as a standard benchmark for
evaluating scientific capabilities in LLMs and as a catalyst for advancing more
capable and reliable scientific language models.
comment: 33 pages, 2 figures
♻ ☆ SAFER: Advancing Safety Alignment via Efficient Ex-Ante Reasoning
Recent advancements in large language models (LLMs) have accelerated progress
toward artificial general intelligence, yet their potential to generate harmful
content poses critical safety challenges. Existing alignment methods often
struggle to cover diverse safety scenarios and remain vulnerable to adversarial
attacks. In this work, we propose SAFER, a framework for Safety Alignment via
eFficient Ex-Ante Reasoning. Our approach instantiates structured Ex-Ante
reasoning through initial assessment, rule verification, and path calibration,
and embeds predefined safety rules to provide transparent and verifiable safety
judgments. Specifically, our approach consists of two training stages: (1)
supervised fine-tuning with synthetic traces to teach the multi-stage Ex-Ante
reasoning, and (2) step-level reasoning preference optimization to jointly
enhance safety, utility, and efficiency. Experiments on multiple open-source
LLMs demonstrate that SAFER significantly enhances safety performance while
maintaining helpfulness and response efficiency.
comment: 22 pages, 5 figures
♻ ☆ MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation
With the rapid growth of academic publications, peer review has become an
essential yet time-consuming responsibility within the research community.
Large Language Models (LLMs) have increasingly been adopted to assist in the
generation of review comments; however, current LLM-based review tasks lack a
unified evaluation benchmark to rigorously assess the models' ability to
produce comprehensive, accurate, and human-aligned assessments, particularly in
scenarios involving multimodal content such as figures and tables. To address
this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans
multiple disciplines and modalities. MMReview includes multimodal content and
expert-written review comments for 240 papers across 17 research domains within
four major academic disciplines: Artificial Intelligence, Natural Sciences,
Engineering Sciences, and Social Sciences. We design a total of 13 tasks
grouped into four core categories, aimed at evaluating the performance of LLMs
and Multimodal LLMs (MLLMs) in step-wise review generation, outcome
formulation, alignment with human preferences, and robustness to adversarial
input manipulation. Extensive experiments conducted on 16 open-source models
and 5 advanced closed-source models demonstrate the thoroughness of the
benchmark. We envision MMReview as a critical step toward establishing a
standardized foundation for the development of automated peer review systems.
comment: Work in progress
♻ ☆ When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation
Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful
paradigm for enhancing large language models (LLMs) with external knowledge. It
leverages graphs to model the hierarchical structure between specific concepts,
enabling more coherent and effective knowledge retrieval for accurate
reasoning.Despite its conceptual promise, recent studies report that GraphRAG
frequently underperforms vanilla RAG on many real-world tasks. This raises a
critical question: Is GraphRAG really effective, and in which scenarios do
graph structures provide measurable benefits for RAG systems? To address this,
we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate
GraphRAG models onboth hierarchical knowledge retrieval and deep contextual
reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of
increasing difficulty, coveringfact retrieval, complex reasoning, contextual
summarization, and creative generation, and a systematic evaluation across the
entire pipeline, from graph constructionand knowledge retrieval to final
generation. Leveraging this novel benchmark, we systematically investigate the
conditions when GraphRAG surpasses traditional RAG and the underlying reasons
for its success, offering guidelines for its practical application. All related
resources and analyses are collected for the community at
https://github.com/GraphRAG-Bench/GraphRAG-Benchmark.
comment: All resources and analyses are collected at
https://github.com/GraphRAG-Bench/GraphRAG-Benchmark
♻ ☆ AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web NeurIPS 2025
Textual claims are often accompanied by images to enhance their credibility
and spread on social media, but this also raises concerns about the spread of
misinformation. Existing datasets for automated verification of image-text
claims remain limited, as they often consist of synthetic claims and lack
evidence annotations to capture the reasoning behind the verdict. In this work,
we introduce AVerImaTeC, a dataset consisting of 1,297 real-world image-text
claims. Each claim is annotated with question-answer (QA) pairs containing
evidence from the web, reflecting a decomposed reasoning regarding the verdict.
We mitigate common challenges in fact-checking datasets such as contextual
dependence, temporal leakage, and evidence insufficiency, via claim
normalization, temporally constrained evidence annotation, and a two-stage
sufficiency check. We assess the consistency of the annotation in AVerImaTeC
via inter-annotator studies, achieving a $\kappa=0.742$ on verdicts and
$74.7\%$ consistency on QA pairs. We also propose a novel evaluation method for
evidence retrieval and conduct extensive experiments to establish baselines for
verifying image-text claims using open-web evidence.
comment: accepted at NeurIPS 2025 Datasets and Benchmarks Track
♻ ☆ WildIFEval: Instruction Following in the Wild
Recent LLMs have shown remarkable success in following user instructions, yet
handling instructions with multiple constraints remains a significant
challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K
real user instructions with diverse, multi-constraint conditions. Unlike prior
datasets, our collection spans a broad lexical and topical spectrum of
constraints, extracted from natural user instructions. We categorize these
constraints into eight high-level classes to capture their distribution and
dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive
experiments to benchmark the instruction-following capabilities of leading
LLMs. WildIFEval clearly differentiates between small and large models, and
demonstrates that all models have a large room for improvement on such tasks.
We analyze the effects of the number and type of constraints on performance,
revealing interesting patterns of model constraint-following behavior. We
release our dataset to promote further research on instruction-following under
complex, realistic conditions.
♻ ☆ Text Clustering as Classification with LLMs
Text clustering serves as a fundamental technique for organizing and
interpreting unstructured textual data, particularly in contexts where manual
annotation is prohibitively costly. With the rapid advancement of Large
Language Models (LLMs) and their demonstrated effectiveness across a broad
spectrum of NLP tasks, an emerging body of research has begun to explore their
potential in the domain of text clustering. However, existing LLM-based
approaches still rely on fine-tuned embedding models and sophisticated
similarity metrics, rendering them computationally intensive and necessitating
domain-specific adaptation. To address these limitations, we propose a novel
framework that reframes text clustering as a classification task by harnessing
the in-context learning capabilities of LLMs. Our framework eliminates the need
for fine-tuning embedding models or intricate clustering algorithms. It
comprises two key steps: first, the LLM is prompted to generate a set of
candidate labels based on the dataset and then merges semantically similar
labels; second, it assigns the most appropriate label to each text sample. By
leveraging the advanced natural language understanding and generalization
capabilities of LLMs, the proposed approach enables effective clustering with
minimal human intervention. Experimental results on diverse datasets
demonstrate that our framework achieves comparable or superior performance to
state-of-the-art embedding-based clustering techniques, while significantly
reducing computational complexity and resource requirements. These findings
underscore the transformative potential of LLMs in simplifying and enhancing
text clustering tasks. We make our code available to the public for utilization
at https://github.com/ECNU-Text-Computing/Text-Clustering-via-LLM. We also
provide the supplementary Appendix within the repository.
comment: 11 pages, 3 figures
♻ ☆ Geometry-Guided Adversarial Prompt Detection via Curvature and Local Intrinsic Dimension
Adversarial prompts are capable of jailbreaking frontier large language
models (LLMs) and inducing undesirable behaviours, posing a significant
obstacle to their safe deployment. Current mitigation strategies primarily rely
on activating built-in defence mechanisms or fine-tuning LLMs, both of which
are computationally expensive and can sacrifice model utility. In contrast,
detection-based approaches are more efficient and practical for deployment in
real-world applications. However, the fundamental distinctions between
adversarial and benign prompts remain poorly understood. In this work, we
introduce CurvaLID, a novel defence framework that efficiently detects
adversarial prompts by leveraging their geometric properties. It is agnostic to
the type of LLM, offering a unified detection framework across diverse
adversarial prompts and LLM architectures. CurvaLID builds on the geometric
analysis of text prompts to uncover their underlying differences. We
theoretically extend the concept of curvature via the Whewell equation into an
$n$-dimensional word embedding space, enabling us to quantify local geometric
properties, including semantic shifts and curvature in the underlying
manifolds. To further enhance our solution, we leverage Local Intrinsic
Dimensionality (LID) to capture complementary geometric features of text
prompts within adversarial subspaces. Our findings show that adversarial
prompts exhibit distinct geometric signatures from benign prompts, enabling
CurvaLID to achieve near-perfect classification and outperform state-of-the-art
detectors in adversarial prompt detection. CurvaLID provides a reliable and
efficient safeguard against malicious queries as a model-agnostic method that
generalises across multiple LLMs and attack families.
comment: 40 Pages, 6 figues
♻ ☆ The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?
Contrastive decoding strategies are widely used to reduce object
hallucinations in multimodal large language models (MLLMs). These methods work
by constructing contrastive samples to induce hallucinations and then
suppressing them in the output distribution. However, this paper demonstrates
that such approaches fail to effectively mitigate the hallucination problem.
The performance improvements observed on POPE Benchmark are largely driven by
two misleading factors: (1) crude, unidirectional adjustments to the model's
output distribution and (2) the adaptive plausibility constraint, which reduces
the sampling strategy to greedy search. To further illustrate these issues, we
introduce a series of spurious improvement methods and evaluate their
performance against contrastive decoding techniques. Experimental results
reveal that the observed performance gains in contrastive decoding are entirely
unrelated to its intended goal of mitigating hallucinations. Our findings
challenge common assumptions about the effectiveness of contrastive decoding
strategies and pave the way for developing genuinely effective solutions to
hallucinations in MLLMs.
♻ ☆ Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages
Large language models (LLMs) have demonstrated significant capabilities in
solving mathematical problems expressed in natural language. However,
multilingual and culturally-grounded mathematical reasoning in low-resource
languages lags behind English due to the scarcity of socio-cultural task
datasets that reflect accurate native entities such as person names,
organization names, and currencies. Existing multilingual benchmarks are
predominantly produced via translation and typically retain English-centric
entities, owing to the high cost associated with human annotater-based
localization. Moreover, automated localization tools are limited, and hence,
truly localized datasets remain scarce. To bridge this gap, we introduce a
framework for LLM-driven cultural localization of math word problems that
automatically constructs datasets with native names, organizations, and
currencies from existing sources. We find that translated benchmarks can
obscure true multilingual math ability under appropriate socio-cultural
contexts. Through extensive experiments, we also show that our framework can
help mitigate English-centric entity bias and improves robustness when native
entities are introduced across various languages.
♻ ☆ Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition
Neural sequence-to-sequence systems deliver state-of-the-art performance for
automatic speech recognition. When using appropriate modeling units, e.g.,
byte-pair encoded characters, these systems are in principal open vocabulary
systems. In practice, however, they often fail to recognize words not seen
during training, e.g., named entities, acronyms, or domain-specific special
words. To address this problem, many context biasing methods have been
proposed; however, for words with a pronunciation-orthography mismatch, these
methods may still struggle. We propose a method which allows corrections of
substitution errors to improve the recognition accuracy of such challenging
words. Users can add corrections on the fly during inference. We show that with
this method we get a relative improvement in biased word error rate of up to
8%, while maintaining a competitive overall word error rate.
♻ ☆ From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning
Trustworthy verifiers are essential for the success of reinforcement learning
with verifiable reward (RLVR), which is the core methodology behind various
large reasoning models such as DeepSeek-R1. In complex domains like
mathematical reasoning, rule-based verifiers have been widely adopted in
previous works to train strong reasoning models. However, the reliability of
these verifiers and their impact on the RL training process remain poorly
understood. In this work, we take mathematical reasoning as a case study and
conduct a comprehensive analysis of various verifiers in both static evaluation
and RL training scenarios. First, we find that current open-source rule-based
verifiers often fail to recognize equivalent answers presented in different
formats across multiple commonly used mathematical datasets, resulting in
non-negligible false negative rates. This limitation adversely affects RL
training performance and becomes more pronounced as the policy model gets
stronger. Subsequently, we investigate model-based verifiers as a potential
solution to address these limitations. While the static evaluation shows that
model-based verifiers achieve significantly higher verification accuracy,
further analysis and RL results imply that they are highly susceptible to
hacking, where they misclassify certain patterns in responses as correct,
particularly after fine-tuning. This vulnerability is exploited during policy
model optimization, leading to artificially inflated rewards. Our findings
underscore the unique challenges inherent to both rule-based and model-based
verifiers and provide insights toward developing more accurate and robust
reward systems for reinforcement learning.
♻ ☆ RooseBERT: A New Deal For Political Language Modelling
The increasing amount of political debates and politics-related discussions
calls for the definition of novel computational methods to automatically
analyse such content with the final goal of lightening up political
deliberation to citizens. However, the specificity of the political language
and the argumentative form of these debates (employing hidden communication
strategies and leveraging implicit arguments) make this task very challenging,
even for current general-purpose pre-trained Language Models. To address this
issue, we introduce a novel pre-trained Language Model for political discourse
language called RooseBERT. Pre-training a language model on a specialised
domain presents different technical and linguistic challenges, requiring
extensive computational resources and large-scale data. RooseBERT has been
trained on large political debate and speech corpora (8K debates, each composed
of several sub-debates on different topics) in English. To evaluate its
performances, we fine-tuned it on four downstream tasks related to political
debate analysis, i.e., stance detection, sentiment analysis, argument component
detection and classification, and argument relation prediction and
classification. Our results demonstrate significant improvements over
general-purpose Language Models on these four tasks, highlighting how
domain-specific pre-training enhances performance in political debate analysis.
We release RooseBERT for the research community.
♻ ☆ VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
Text-to-image retrieval (T2I retrieval) remains challenging because
cross-modal embeddings often behave as bags of concepts and underrepresent
structured visual relationships such as pose and viewpoint. We propose
Visualize-then-Retrieve (VisRet), a new paradigm for T2I retrieval that
mitigates this limitation of cross-modal similarity alignment. VisRet first
projects textual queries into the image modality via T2I generation. Then, it
performs retrieval within the image modality to bypass the weaknesses of
cross-modal retrievers in recognizing subtle visual-spatial features. Across
four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new
Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially
outperforms cross-modal similarity matching and baselines that recast T2I
retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on
average with CLIP as the retriever and by 0.121 with E5-V. For downstream
question answering, VisRet increases accuracy on Visual-RAG and Visual-RAG-ME
by 3.8% and 15.7% in top-1 retrieval, and by 3.9% and 11.1% in top-10
retrieval. Ablation studies show compatibility with different T2I instruction
LLMs, T2I generation models, and downstream LLMs. VisRet provides a practical
and principled path that energizes further advances in vision-language
retrieval. Our code and the Visual-RAG-ME benchmark will be publicly released.
♻ ☆ Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization
Selective retrieval improves the accuracy and efficiency of
retrieval-augmented generation (RAG) by reducing distractions from low-quality
retrievals. However, existing approaches underutilize the inherent knowledge of
large language models (LLMs), leading to suboptimal retrieval decisions and
degraded generation performance. To bridge this gap, we propose Self-Routing
RAG (SR-RAG), a novel framework that binds selective retrieval with knowledge
verbalization. SR-RAG enables an LLM to dynamically decide whether to retrieve
external knowledge or verbalize its own parametric knowledge. To this end, we
design a multi-task objective that jointly optimizes an LLM for knowledge
source selection, knowledge verbalization, and response generation. SR-RAG
further incorporates a nearest neighbor search mechanism at inference time to
improve the accuracy of knowledge source decisions under domain shifts.
Fine-tuning three LLMs with SR-RAG significantly improves both their response
accuracy and reduces the inference latency. Compared to the strongest selective
retrieval baseline, SR-RAG reduces the number of retrievals by 29% while
improving performance by 5.1%.
♻ ☆ RepIt: Representing Isolated Targets to Steer Language Models
While activation steering in large language models (LLMs) is a growing area
of research, methods can often incur broader effects than desired. This
motivates isolation of purer concept vectors to enable targeted interventions
and understand LLM behavior at a more granular level. We present RepIt, a
simple and data-efficient framework for isolating concept-specific
representations. Across five frontier LLMs, RepIt enables precise
interventions: it selectively suppresses refusal on targeted concepts while
preserving refusal elsewhere, producing models that answer WMD-related
questions while still scoring as safe on standard benchmarks. We further show
that the corrective signal localizes to just 100-200 neurons and that robust
target representations can be extracted from as few as a dozen examples on a
single A6000. This efficiency raises a dual concern: manipulations can be
performed with modest compute and data to extend to underrepresented
data-scarce topics while evading existing benchmarks. By disentangling refusal
vectors with RepIt, this work demonstrates that targeted interventions can
counteract overgeneralization, laying the foundation for more granular control
of model behavior.
♻ ☆ BenchAgents: Multi-Agent Systems for Structured Benchmark Creation
Evaluation insights are limited by the availability of high-quality
benchmarks. As models evolve, there is a need to create benchmarks that can
measure progress on new and complex generative capabilities. However, manually
creating new benchmarks is slow and expensive, restricting comprehensive
evaluations for any capability. We introduce BenchAgents, a multi-agent
framework that methodically leverages large language models (LLMs) to automate
evaluation benchmark creation while inherently ensuring data and (evaluation)
metric quality. BenchAgents decomposes the benchmark creation process into
planning, generation, verification, and evaluation, each of which is ]
orchestrated via LLM agents. These agents interact with each other and utilize
feedback from benchmark developers to improve and flexibly control data
diversity and quality. We use BenchAgents to create benchmarks to evaluate
capabilities related to planning, constraint satisfaction, and causal reasoning
spanning both language and vision modalities. We then use these benchmarks to
study state-of-the-art models and extract new insights into common failure
modes and model differences.
♻ ☆ HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
Assessing the capabilities and limitations of large language models (LLMs)
has garnered significant interest, yet the evaluation of multiple models in
real-world scenarios remains rare. Multilingual evaluation often relies on
translated benchmarks, which typically do not capture linguistic and cultural
nuances present in the source language. This study provides an extensive
assessment of 24 LLMs on real world data collected from Indian patients
interacting with a medical chatbot in Indian English and 4 other Indic
languages. We employ a uniform Retrieval Augmented Generation framework to
generate responses, which are evaluated using both automated techniques and
human evaluators on four specific metrics relevant to our application. We find
that models vary significantly in their performance and that instruction tuned
Indic models do not always perform well on Indic language queries. Further, we
empirically show that factual correctness is generally lower for responses to
Indic queries compared to English queries. Finally, our qualitative work shows
that code-mixed and culturally relevant queries in our dataset pose challenges
to evaluated models.
♻ ☆ Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies
Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sepehr Karimi, Sina Rashidi, Ali Zolnour, Maryam Dadkhah, Yasaman Haghbin, Hossein AzadMaleki, Maryam Zolnoori
Over half of US adults with Alzheimer disease and related dementias remain
undiagnosed, and speech-based screening offers a scalable detection approach.
We compared large language model adaptation strategies for dementia detection
using the DementiaBank speech corpus, evaluating nine text-only models and
three multimodal audio-text models on recordings from DementiaBank speech
corpus. Adaptations included in-context learning with different demonstration
selection policies, reasoning-augmented prompting, parameter-efficient
fine-tuning, and multimodal integration. Results showed that class-centroid
demonstrations achieved the highest in-context learning performance, reasoning
improved smaller models, and token-level fine-tuning generally produced the
best scores. Adding a classification head substantially improved
underperforming models. Among multimodal models, fine-tuned audio-text systems
performed well but did not surpass the top text-only models. These findings
highlight that model adaptation strategies, including demonstration selection,
reasoning design, and tuning method, critically influence speech-based dementia
detection, and that properly adapted open-weight models can match or exceed
commercial systems.
♻ ☆ FormulaReasoning: A Dataset for Formula-Based Numerical Reasoning
The application of formulas (e.g., physics formulas) is a fundamental human
ability in solving numerical reasoning problems. Existing numerical reasoning
datasets rarely explicitly state the formulas employed, as their questions
often rely on implicit commonsense mathematical knowledge. To address this gap,
we introduce FormulaReasoning, a new dataset specifically designed for
formula-based numerical reasoning. It consists of 5,324 questions that require
numerical calculations grounded in external physics formulas. We provide
normalized, fine-grained annotations in both English and Chinese, including
formula structures, parameter names, symbols, numerical values, and
units-curated through extensive manual effort with LLM-assisted validation to
ensure high quality. Additionally, we offer a consolidated formula database to
serve as an external knowledge source. We analyze various reasoning approaches
on FormulaReasoning, with emphasis on comparative evaluation of different
architectural and methodological frameworks. Our assessment includes
retrieval-augmented methods, approaches that decompose reasoning into formula
generation, parameter extraction, and numerical calculation, as well as
optimization techniques using preference data. We identify key challenges in
formula-based numerical reasoning that require further investigation across
different reasoning paradigms, highlighting opportunities for methodological
advancement.
♻ ☆ Explaining GPTs' Schema of Depression: A Machine Behavior Analysis
Adithya V Ganesan, Vasudha Varadarajan, Yash Kumar Lal, Veerle C. Eijsbroek, Katarina Kjell, Oscar N. E. Kjell, Tanuja Dhanasekaran, Elizabeth C. Stade, Johannes C. Eichstaedt, Ryan L. Boyd, H. Andrew Schwartz, Lucie Flek
Use of large language models such as ChatGPT (GPT-4/GPT-5) for mental health
support has grown rapidly, emerging as a promising route to assess and help
people with mood disorders like depression. However, we have a limited
understanding of these language models' schema of mental disorders, that is,
how they internally associate and interpret symptoms of such disorders. In this
work, we leveraged contemporary measurement theory to decode how GPT-4 and
GPT-5 interrelate depressive symptoms, providing an explanation of how LLMs
apply what they learn and informing clinical applications. We found that GPT-4
(a) had strong convergent validity with standard instruments and expert
judgments $(r = 0.70 - 0.81)$, and (b) behaviorally linked depression symptoms
with each other (symptom inter-correlates $r = 0.23 - 0.78$) in accordance with
established literature on depression; however, it (c) underemphasized the
relationship between $\textit{suicidality}$ and other symptoms while
overemphasizing $\textit{psychomotor symptoms}$; and (d) suggested novel
hypotheses of symptom mechanisms, for instance, indicating that
$\textit{sleep}$ and $\textit{fatigue}$ are broadly influenced by other
depressive symptoms, while $\textit{worthlessness/guilt}$ is only tied to
$\textit{depressed mood}$. GPT-5 showed a slightly lower convergence with
self-report, a difference our machine-behavior analysis makes interpretable
through shifts in symptom-symptom relationships. These insights provide an
empirical foundation for understanding language models' mental health
assessments and demonstrate a generalizable approach for explainability in
other models and disorders. Our findings can guide key stakeholders to make
informed decisions for effectively situating these technologies in the care
system.
comment: 25 pages, 1 table, 4 figures, 1 supplementary table, 5 supplementary
figures, 59 references
♻ ☆ AWARE, Beyond Sentence Boundaries: A Contextual Transformer Framework for Identifying Cultural Capital in STEM Narratives
Identifying cultural capital (CC) themes in student reflections can offer
valuable insights that help foster equitable learning environments in
classrooms. However, themes such as aspirational goals or family support are
often woven into narratives, rather than appearing as direct keywords. This
makes them difficult to detect for standard NLP models that process sentences
in isolation. The core challenge stems from a lack of awareness, as standard
models are pre-trained on general corpora, leaving them blind to the
domain-specific language and narrative context inherent to the data. To address
this, we introduce AWARE, a framework that systematically attempts to improve a
transformer model's awareness for this nuanced task. AWARE has three core
components: 1) Domain Awareness, adapting the model's vocabulary to the
linguistic style of student reflections; 2) Context Awareness, generating
sentence embeddings that are aware of the full essay context; and 3) Class
Overlap Awareness, employing a multi-label strategy to recognize the
coexistence of themes in a single sentence. Our results show that by making the
model explicitly aware of the properties of the input, AWARE outperforms a
strong baseline by 2.1 percentage points in Macro-F1 and shows considerable
improvements across all themes. This work provides a robust and generalizable
methodology for any text classification task in which meaning depends on the
context of the narrative.
♻ ☆ Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings
Current social bias benchmarks for Large Language Models (LLMs) primarily
rely on predefined question formats like multiple-choice, limiting their
ability to reflect the complexity and open-ended nature of real-world
interactions. To close this gap, we extend an existing dataset BBQ (Parrish et
al., 2022) to Open-BBQ, a comprehensive framework to evaluate the social bias
of LLMs in open-ended settings by incorporating two additional question
categories: fill-in-the-blank and short-answer. Since our new Open-BBQ dataset
contains a lot of open-ended responses like sentences and paragraphs, we
developed an evaluation process to detect biases from open-ended content by
labeling sentences and paragraphs. In addition to this, we also found that
existing debiasing methods, such as self-debiasing (Gallegos et al., 2024),
have over-correction issues, which make the original correct answers incorrect.
In order to solve this issue, we propose Composite Prompting, an In-context
Learning (ICL) method combining structured examples with explicit
chain-of-thought reasoning to form a unified instruction template for LLMs to
explicitly identify content that needs debiasing. Experimental results show
that the proposed method significantly reduces the bias for both GPT-3.5 and
GPT-4o while maintaining high accuracy.
comment: 15 pages
♻ ☆ Robustness of Large Language Models to Perturbations in Text
Having a clean dataset has been the foundational assumption of most natural
language processing (NLP) systems. However, properly written text is rarely
found in real-world scenarios and hence, oftentimes invalidates the
aforementioned foundational assumption. Recently, Large language models (LLMs)
have shown impressive performance, but can they handle the inevitable noise in
real-world data? This work tackles this critical question by investigating
LLMs' resilience against morphological variations in text. To that end, we
artificially introduce varying levels of noise into a diverse set of datasets
and systematically evaluate LLMs' robustness against the corrupt variations of
the original text. Our findings show that contrary to popular beliefs,
generative LLMs are quiet robust to noisy perturbations in text. This is a
departure from pre-trained models like BERT or RoBERTa whose performance has
been shown to be sensitive to deteriorating noisy text. Additionally, we test
LLMs' resilience on multiple real-world benchmarks that closely mimic commonly
found errors in the wild. With minimal prompting, LLMs achieve a new
state-of-the-art on the benchmark tasks of Grammar Error Correction (GEC) and
Lexical Semantic Change (LSC). To empower future research, we also release a
dataset annotated by humans stating their preference for LLM vs.
human-corrected outputs along with the code to reproduce our results.
comment: 8 pages, 1 figure, 6 tables, updated with results also from GPT-4,
LLaMa-3
♻ ☆ MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis
Joseph Cho, Mrudang Mathur, Cyril Zakka, Dhamanpreet Kaur, Matthew Leipzig, Alex Dalal, Aravind Krishnan, Eubee Koo, Karen Wai, Cindy S. Zhao, Akshay Chaudhari, Matthew Duda, Ashley Choi, Ehsan Rahimy, Lyna Azzouz, Robyn Fong, Rohan Shad, William Hiesinger
Deep learning algorithms require extensive data to achieve robust
performance. However, data availability is often restricted in the medical
domain due to patient privacy concerns. Synthetic data presents a possible
solution to these challenges. Recently, image generative models have found
increasing use for medical applications but are often designed for singular
medical specialties and imaging modalities, thus limiting their broader
utility. To address this, we introduce MediSyn: a text-guided, latent diffusion
model capable of generating synthetic images from 6 medical specialties and 10
image types. Through extensive experimentation, we first demonstrate that
MediSyn quantitatively matches or surpasses the performance of specialist
models. Second, we show that our synthetic images are realistic and exhibit
strong alignment with their corresponding text prompts, as validated by a team
of expert physicians. Third, we provide empirical evidence that our synthetic
images are visually distinct from their corresponding real patient images.
Finally, we demonstrate that in data-limited settings, classifiers trained
solely on synthetic data or real data supplemented with synthetic data can
outperform those trained solely on real data. Our findings highlight the
immense potential of generalist image generative models to accelerate
algorithmic research and development in medicine.
♻ ☆ Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning NeurIPS 2025
Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, Jiaxin Mao
Retrieval-augmented generation (RAG) is widely utilized to incorporate
external knowledge into large language models, thereby enhancing factuality and
reducing hallucinations in question-answering (QA) tasks. A standard RAG
pipeline consists of several components, such as query rewriting, document
retrieval, document filtering, and answer generation. However, these components
are typically optimized separately through supervised fine-tuning, which can
lead to misalignments between the objectives of individual components and the
overarching aim of generating accurate answers. Although recent efforts have
explored using reinforcement learning (RL) to optimize specific RAG components,
these approaches often focus on simple pipelines with only two components or do
not adequately address the complex interdependencies and collaborative
interactions among the modules. To overcome these limitations, we propose
treating the complex RAG pipeline with multiple components as a multi-agent
cooperative task, in which each component can be regarded as an RL agent.
Specifically, we present MMOA-RAG, Multi-Module joint Optimization Algorithm
for RAG, which employs multi-agent reinforcement learning to harmonize all
agents' goals toward a unified reward, such as the F1 score of the final
answer. Experiments conducted on various QA benchmarks demonstrate that
MMOA-RAG effectively boost the overall performance of the pipeline and
outperforms existing baselines. Furthermore, comprehensive ablation studies
validate the contributions of individual components and demonstrate MMOA-RAG
can be adapted to different RAG pipelines and benchmarks.
comment: NeurIPS 2025
♻ ☆ ChartCards: A Chart-Metadata Generation Framework for Multi-Task Chart Understanding
The emergence of Multi-modal Large Language Models (MLLMs) presents new
opportunities for chart understanding. However, due to the fine-grained nature
of these tasks, applying MLLMs typically requires large, high-quality datasets
for task-specific fine-tuning, leading to high data collection and training
costs. To address this, we propose ChartCards, a unified chart-metadata
generation framework for multi-task chart understanding. ChartCards
systematically synthesizes various chart information, including data tables,
visualization code, visual elements, and multi-dimensional semantic captions.
By structuring this information into organized metadata, ChartCards enables a
single chart to support multiple downstream tasks, such as text-to-chart
retrieval, chart summarization, chart-to-table conversion, chart description,
and chart question answering. Using ChartCards, we further construct MetaChart,
a large-scale high-quality dataset containing 10,862 data tables, 85K charts,
and 170 K high-quality chart captions. We validate the dataset through
qualitative crowdsourcing evaluations and quantitative fine-tuning experiments
across various chart understanding tasks. Fine-tuning six different models on
MetaChart resulted in an average performance improvement of 5% across all
tasks. The most notable improvements are seen in text-to-chart retrieval and
chart-to-table tasks, with Long-CLIP and Llama 3.2-11B achieving improvements
of 17% and 28%, respectively.
comment: Need to be revised
♻ ☆ Assessing Algorithmic Bias in Language-Based Depression Detection: A Comparison of DNN and LLM Approaches
This paper investigates algorithmic bias in language-based models for
automated depression detection, focusing on socio-demographic disparities
related to gender and race/ethnicity. Models trained using deep neural networks
(DNN) based embeddings are compared to few-shot learning approaches with large
language models (LLMs), evaluating both performance and fairness on clinical
interview transcripts from the Distress Analysis Interview Corpus/Wizard-of-Oz
(DAIC-WOZ). To mitigate bias, fairness-aware loss functions are applied to
DNN-based models, while in-context learning with varied prompt framing and shot
counts is explored for LLMs. Results indicate that LLMs outperform DNN-based
models in depression classification, particularly for underrepresented groups
such as Hispanic participants. LLMs also exhibit reduced gender bias compared
to DNN-based embeddings, though racial disparities persist. Among
fairness-aware techniques for mitigating bias in DNN-based embeddings, the
worst-group loss, which is designed to minimize loss for the worst-performing
demographic group, achieves a better balance between performance and fairness.
In contrast, the fairness-regularized loss minimizes loss across all groups but
performs less effectively. In LLMs, guided prompting with ethical framing helps
mitigate gender bias in the 1-shot setting. However, increasing the number of
shots does not lead to further reductions in disparities. For race/ethnicity,
neither prompting strategy nor increasing $N$ in $N$-shot learning effectively
reduces disparities.
comment: 7 pages, 1 figure. This paper has been accepted to the IEEE-EMBS
International Conference on Biomedical and Health Informatics (BHI 2025),
Georgia Institute of Technology, Atlanta, Georgia, October 26-29, 2025
♻ ☆ AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials
Taoyuze Lv, Alexander Chen, Fengyu Xie, Chu Wu, Jeffrey Meng, Dongzhan Zhou, Bram Hoex, Zhicheng Zhong, Tong Xie
Large Language Models (LLMs) excel at textual reasoning and are beginning to
develop spatial understanding, prompting the question of whether these
abilities can be combined for complex, domain-specific tasks. This question is
essential in fields like materials science, where deep understanding of 3D
atomic structures is fundamental. While initial studies have successfully
applied LLMs to tasks involving pure crystal generation or coordinate
understandings, a standardized benchmark to systematically evaluate their core
reasoning abilities across diverse atomic structures has been notably absent.
To address this gap, we introduce the AtomWorld benchmark to evaluate LLMs on
tasks based in Crystallographic Information Files (CIFs), a standard structure
representation format. These tasks, including structural editing, CIF
perception, and property-guided modeling, reveal a critical limitation: current
models, despite establishing promising baselines, consistently fail in
structural understanding and spatial reasoning. Our experiments show that these
models make frequent errors on structure modification tasks, and even in the
basic CIF format understandings, potentially leading to cumulative errors in
subsequent analysis and materials insights. By defining these standardized
tasks, AtomWorld lays the ground for advancing LLMs toward robust atomic-scale
modeling, crucial for accelerating materials research and automating scientific
workflows.
♻ ☆ LLM Unlearning Without an Expert Curated Dataset
Modern large language models often encode sensitive, harmful, or copyrighted
knowledge, raising the need for post-hoc unlearning-the ability to remove
specific domains of knowledge from a model without full retraining. A major
bottleneck in current unlearning pipelines is constructing effective forget
sets-datasets that approximate the target domain and guide the model to forget
it. In this work, we introduce a scalable, automated approach to generate
high-quality forget sets using language models themselves. Our method
synthesizes textbook-style data through a structured prompting pipeline,
requiring only a domain name as input. Through experiments on unlearning
biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic
datasets consistently outperform the baseline synthetic alternatives and are
comparable to the expert-curated ones. Additionally, ablation studies reveal
that the multi-step generation pipeline significantly boosts data diversity,
which in turn improves unlearning utility. Overall, our findings suggest that
synthetic datasets offer a promising path toward practical, scalable unlearning
for a wide range of emerging domains without the need for manual intervention.
We release our code and dataset at
https://github.com/xyzhu123/Synthetic_Textbook.
♻ ☆ Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time ICLR 2026
Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor
Language model finetuning often results in learning undesirable traits in
combination with desired ones. To address this, we propose inoculation
prompting: modifying finetuning data by prepending a short system-prompt
instruction that deliberately elicits the undesirable trait. At test time, we
evaluate without the instruction; inoculated models have much lower expression
of the trait than models trained with unmodified training data. Inoculation is
selective: in a toy setting where assistant responses are always in Spanish and
ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'')
teaches the model to capitalize responses while still responding in English. We
find that inoculation is also effective across several additional settings:
reducing emergent misalignment (EM) from task-specific finetuning, defending
against backdoor injections, and mitigating the transmission of traits via
subliminal learning. Follow-up analysis suggests a mechanism: making a trait
less surprising via inoculation reduces optimization pressure to globally
update the model, thereby reducing the degree of generalization. Our analysis
relates to prior work on EM: inoculation explains prior findings that
educational contexts mitigate EM from insecure code. Beyond demonstrating a
simple and effective technique for selective learning, our results contribute
to a better conceptual understanding of how and why language models generalize.
comment: 40 pages, 22 figures In proceedings at ICLR 2026
♻ ☆ Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment
Video Large Multimodal Models (VLMMs) have made impressive strides in
understanding video content, but they often struggle with abstract and adaptive
reasoning-the ability to revise their interpretations when new information
emerges. In reality, conclusions are rarely set in stone; additional context
can strengthen or weaken an initial inference. To address this, we introduce
Defeasible Video Entailment (DVidE), a new task that challenges models to think
like doubters, constantly updating their reasoning based on evolving evidence.
In DVidE, given a video premise and a textual hypothesis, models must determine
whether a new update strengthens or weakens the hypothesis (classification
version) or generate a coherent update that modifies the entailment
relationship (generation version). For solving the classification task, we
propose the Chain of Counterfactual Thought framework, utilizing counterfactual
reasoning, ASR-enhanced video content, and rationale refinement to reduce
inference bias. For the generation task, we develop a framework that combines
ASR output with a Large Language Model (LLM) to produce coherent, contextually
relevant updates aligned with the intended strengthener or weakener goals.
Additionally, we introduce a novel benchmark dataset, with
strengthener/weakener annotations and an LLM-based evaluation metric
specifically designed for assessing generative performance. Experimental
results demonstrate significant improvements, highlighting our proposed method
in enhancing dynamic reasoning capabilities of VLMMs.
♻ ☆ GEM-Bench: A Benchmark for Ad-Injected Response Generation within Generative Engine Marketing
Generative Engine Marketing (GEM) is an emerging ecosystem for monetizing
generative engines, such as LLM-based chatbots, by seamlessly integrating
relevant advertisements into their responses. At the core of GEM lies the
generation and evaluation of ad-injected responses. However, existing
benchmarks are not specifically designed for this purpose, which limits future
research. To address this gap, we propose GEM-Bench, the first comprehensive
benchmark for ad-injected response generation in GEM. GEM-Bench includes three
curated datasets covering both chatbot and search scenarios, a metric ontology
that captures multiple dimensions of user satisfaction and engagement, and
several baseline solutions implemented within an extensible multi-agent
framework. Our preliminary results indicate that, while simple prompt-based
methods achieve reasonable engagement such as click-through rate, they often
reduce user satisfaction. In contrast, approaches that insert ads based on
pre-generated ad-free responses help mitigate this issue but introduce
additional overhead. These findings highlight the need for future research on
designing more effective and efficient solutions for generating ad-injected
responses in GEM. The benchmark and all related resources are publicly
available at https://gem-bench.org/.
comment: Include more experimental results and supplementary materials
♻ ☆ PLSemanticsBench: Large Language Models As Programming Language Interpreters
As large language models (LLMs) excel at code reasoning, a natural question
arises: can an LLM execute programs (i.e., act as an interpreter) purely based
on a programming language's formal semantics? If so, it will enable rapid
prototyping of new programming languages and language features. We study this
question using the imperative language IMP (a subset of C), formalized via
small-step operational semantics (SOS) and rewriting-based operational
semantics (K-semantics). We introduce three evaluation sets-Human-Written,
LLM-Translated, and Fuzzer- Generated-whose difficulty is controlled by
code-complexity metrics spanning the size, control-flow, and data-flow axes.
Given a program and its semantics formalized with SOS/K-semantics, models are
evaluated on three tasks ranging from coarse to fine: (1) final-state
prediction, (2) semantic rule prediction, and (3) execution trace prediction.
To distinguish pretraining memorization from semantic competence, we define two
nonstandard semantics obtained through systematic mutations of the standard
rules. Across strong code/reasoning LLMs, performance drops under nonstandard
semantics despite high performance under the standard one. We further find that
(i) there are patterns to different model failures, (ii) most reasoning models
perform exceptionally well on coarse grained tasks involving reasoning about
highly complex programs often containing nested loop depths beyond five, and
surprisingly, (iii) providing formal semantics helps on simple programs but
often hurts on more complex ones. Overall, the results show a promise that LLMs
could serve as programming language interpreters, but points to the lack of
their robust semantics understanding. We release the benchmark and the
supporting code at https://github.com/EngineeringSoftware/PLSemanticsBench.
♻ ☆ COLE: a Comprehensive Benchmark for French Language Understanding Evaluation ACL
To address the need for a more comprehensive evaluation of French Natural
Language Understanding (NLU), we introduce COLE, a new benchmark composed of 23
diverse task covering a broad range of NLU capabilities, including sentiment
analysis, paraphrase detection, grammatical judgment, and reasoning, with a
particular focus on linguistic phenomena relevant to the French language. We
benchmark 94 large language models (LLM), providing an extensive analysis of
the current state of French NLU. Our results highlight a significant
performance gap between closed- and open-weights models and identify key
challenging frontiers for current LLMs, such as zero-shot extractive
question-answering (QA), fine-grained word sense disambiguation, and
understanding of regional language variations. We release COLE as a public
resource to foster further progress in French language modelling.
comment: Submitted to ACL Rolling Review of October
♻ ☆ Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models
Artificial intelligence systems based on large language models (LLMs) are
increasingly used as agents that interact with users and with the world. To do
so successfully, LLMs need to construct internal representations of the world
and form probabilistic beliefs about those representations. To provide a user
with personalized recommendations, for example, the LLM needs to gradually
infer the user's preferences, over the course of multiple interactions. To
evaluate whether contemporary LLMs are able to do so, we use the Bayesian
inference framework from probability theory, which lays out the optimal way to
update an agent's beliefs as it receives new information. We first show that
LLMs do not update their beliefs as expected from the Bayesian framework, and
that consequently their predictions do not improve as expected as more
information becomes available. To address this issue, we teach the LLMs to
reason in a Bayesian manner by training them to mimic the predictions of the
normative Bayesian model. We find that this approach not only significantly
improves the LLM's performance on the particular recommendation task it is
trained on, but also enables generalization to other tasks. This suggests that
this method teaches the LLM to better approximate Bayesian reasoning. More
generally, our results indicate that LLMs can effectively learn reasoning
skills from examples and generalize those skills to new domains.
♻ ☆ What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts
Prompt underspecification is a common challenge when interacting with LLMs.
In this paper, we present an in-depth analysis of this problem, showing that
while LLMs can often infer unspecified requirements by default (41.1%), such
behavior is fragile: Under-specified prompts are 2x as likely to regress across
model or prompt changes, sometimes with accuracy drops exceeding 20%. This
instability makes it difficult to reliably build LLM applications. Moreover,
simply specifying all requirements does not consistently help, as models have
limited instruction-following ability and requirements can conflict. Standard
prompt optimizers likewise provide little benefit. To address these issues, we
propose requirements-aware prompt optimization mechanisms that improve
performance by 4.8% on average over baselines. We further advocate for a
systematic process of proactive requirements discovery, evaluation, and
monitoring to better manage prompt underspecification in practice.
♻ ☆ ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists
Jie Ruan, Inderjeet Nair, Shuyang Cao, Amy Liu, Sheza Munir, Micah Pollens-Dempsey, Tiffany Chiang, Lucy Kates, Nicholas David, Sihan Chen, Ruxin Yang, Yuqian Yang, Jasmine Gump, Tessa Bialek, Vivek Sankaran, Margo Schlanger, Lu Wang
This paper introduces ExpertLongBench, an expert-level benchmark containing
11 tasks from 9 domains that reflect realistic expert workflows and
applications. Beyond question answering, the application-driven tasks in
ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and
strict adherence to domain-specific requirements. Notably, each task in
ExpertLongBench includes a rubric, designed or validated by domain experts, to
specify task requirements and guide output evaluation. Furthermore, we propose
CLEAR, an evaluation framework that supports accurate evaluation of long-form
model outputs in our benchmark. To achieve fine-grained, expert-aligned
evaluation, CLEAR derives checklists from both model outputs and references by
extracting information corresponding to items in the task-specific rubric.
Checklist items of model outputs are then compared with corresponding items of
reference outputs to assess their correctness, enabling grounded evaluation. We
benchmark 13 popular large language models (LLMs) and analyze components in
CLEAR, showing that (1) existing LLMs, with the top performer Gemini-2.5-Pro
achieving only a 33.4 F1 score, require significant improvement for
expert-level tasks; (2) models can generate content corresponding to the
required aspects, but far from correct; and (3) accurate checklist extraction
and comparison in CLEAR can be achieved by open-weight models for more
scalable, reproducible, and low-cost usage.
♻ ☆ Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions EMNLP 2025
Recent advancements in large language models (LLMs) and AI systems have led
to a paradigm shift in the design and optimization of complex AI workflows. By
integrating multiple components, compound AI systems have become increasingly
adept at performing sophisticated tasks. However, as these systems grow in
complexity, new challenges arise in optimizing not only individual components
but also their interactions. While traditional optimization methods such as
supervised fine-tuning (SFT) and reinforcement learning (RL) remain
foundational, the rise of natural language feedback introduces promising new
approaches, especially for optimizing non-differentiable systems. This paper
provides a systematic review of recent progress in optimizing compound AI
systems, encompassing both numerical and language-based techniques. We
formalize the notion of compound AI system optimization, classify existing
methods along several key dimensions, and highlight open research challenges
and future directions in this rapidly evolving field. A list of surveyed papers
is publicly available at https://github.com/MiuLab/AISysOpt-Survey.
comment: Accepted to EMNLP 2025 (Main)