Computation and Language
☆ How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
Recent reasoning models show the ability to reflect, backtrack, and
self-validate their reasoning, which is crucial in spotting mistakes and
arriving at accurate solutions. A natural question that arises is how
effectively models can perform such self-reevaluation. We tackle this question
by investigating how well reasoning models identify and recover from four types
of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to
the question, thoughts misdirecting the question as a slightly different
question, and thoughts that lead to incorrect answers. We show that models are
effective at identifying most unhelpful thoughts but struggle to recover from
the same thoughts when these are injected into their thinking process, causing
significant performance drops. Models tend to naively continue the line of
reasoning of the injected irrelevant thoughts, which showcases that their
self-reevaluation abilities are far from a general "meta-cognitive" awareness.
Moreover, we observe non/inverse-scaling trends, where larger models struggle
more than smaller ones to recover from short irrelevant thoughts, even when
instructed to reevaluate their reasoning. We demonstrate the implications of
these findings with a jailbreak experiment using irrelevant thought injection,
showing that the smallest models are the least distracted by
harmful-response-triggering thoughts. Overall, our findings call for
improvement in self-reevaluation of reasoning models to develop better
reasoning and safer systems.
☆ AutoMind: Adaptive Knowledgeable Agent for Automated Data Science
Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Shuofei Qiao, Jintian Zhang, Da Zheng, Huajun Chen, Ningyu Zhang
Large Language Model (LLM) agents have shown great potential in addressing
real-world data science problems. LLM-driven data science agents promise to
automate the entire machine learning pipeline, yet their real-world
effectiveness remains limited. Existing frameworks depend on rigid, pre-defined
workflows and inflexible coding strategies; consequently, they excel only on
relatively simple, classical problems and fail to capture the empirical
expertise that human practitioners bring to complex, innovative tasks. In this
work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework
that overcomes these deficiencies through three key advances: (1) a curated
expert knowledge base that grounds the agent in domain expert knowledge, (2) an
agentic knowledgeable tree search algorithm that strategically explores
possible solutions, and (3) a self-adaptive coding strategy that dynamically
tailors code generation to task complexity. Evaluations on two automated data
science benchmarks demonstrate that AutoMind delivers superior performance
versus state-of-the-art baselines. Additional analyses confirm favorable
effectiveness, efficiency, and qualitative solution quality, highlighting
AutoMind as an efficient and robust step toward fully automated data science.
comment: Ongoing work. Code is at https://github.com/innovatingAI/AutoMind
☆ MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian
In this paper, we introduce knowledge image generation as a new task,
alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation
Benchmark (MMMG) to probe the reasoning capability of image generation models.
Knowledge images have been central to human civilization and to the mechanisms
of human learning--a fact underscored by dual-coding theory and the
picture-superiority effect. Generating such images is challenging, demanding
multimodal reasoning that fuses world knowledge with pixel-level grounding into
clear explanatory visuals. To enable comprehensive evaluation, MMMG offers
4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines,
6 educational levels, and diverse knowledge formats such as charts, diagrams,
and mind maps. To eliminate confounding complexity during evaluation, we adopt
a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a
target image's core entities and their dependencies. We further introduce
MMMG-Score to evaluate generated knowledge images. This metric combines factual
fidelity, measured by graph-edit distance between KGs, with visual clarity
assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image
generation models expose serious reasoning deficits--low entity fidelity, weak
relations, and clutter--with GPT-4o achieving an MMMG-Score of only 50.20,
underscoring the benchmark's difficulty. To spur further progress, we release
FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines
a reasoning LLM with diffusion models and is trained on 16,000 curated
knowledge image-prompt pairs.
☆ ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark
Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng
Large language models (LLMs) have been increasingly applied to automated
harmful content detection tasks, assisting moderators in identifying policy
violations and improving the overall efficiency and accuracy of content review.
However, existing resources for harmful content detection are predominantly
focused on English, with Chinese datasets remaining scarce and often limited in
scope. We present a comprehensive, professionally annotated benchmark for
Chinese content harm detection, which covers six representative categories and
is constructed entirely from real-world data. Our annotation process further
yields a knowledge rule base that provides explicit expert knowledge to assist
LLMs in Chinese harmful content detection. In addition, we propose a
knowledge-augmented baseline that integrates both human-annotated knowledge
rules and implicit knowledge from large language models, enabling smaller
models to achieve performance comparable to state-of-the-art LLMs. Code and
data are available at https://github.com/zjunlp/ChineseHarm-bench.
comment: Work in progress
☆ Build the web for agents, not agents for the web
Recent advancements in Large Language Models (LLMs) and multimodal
counterparts have spurred significant interest in developing web agents -- AI
systems capable of autonomously navigating and completing tasks within web
environments. While holding tremendous promise for automating complex web
interactions, current approaches face substantial challenges due to the
fundamental mismatch between human-designed interfaces and LLM capabilities.
Current methods struggle with the inherent complexity of web inputs, whether
processing massive DOM trees, relying on screenshots augmented with additional
information, or bypassing the user interface entirely through API interactions.
This position paper advocates for a paradigm shift in web agent research:
rather than forcing web agents to adapt to interfaces designed for humans, we
should develop a new interaction paradigm specifically optimized for agentic
capabilities. To this end, we introduce the concept of an Agentic Web Interface
(AWI), an interface specifically designed for agents to navigate a website. We
establish six guiding principles for AWI design, emphasizing safety,
efficiency, and standardization, to account for the interests of all primary
stakeholders. This reframing aims to overcome fundamental limitations of
existing interfaces, paving the way for more efficient, reliable, and
transparent web agent design, which will be a collaborative effort involving
the broader ML community.
☆ Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training ICML2025
We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any
dataset into a linear combination of several \emph{meta-domains}, a new concept
designed to capture the key underlying features of datasets.
\textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a
classifier to decompose any given dataset into a domain vector that corresponds
to a distribution over this vocabulary. These domain vectors enable the
identification of the optimal data mixture for language model (LM) pretraining
in a training-free manner under the \emph{\textbf{D}istribution
\textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when
the data distributions of the training set and the validation set are better
aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can
be seamlessly integrated into previous works to model the relationship between
domain vectors and LM performance, greatly enhancing the efficiency and
scalability of previous methods. Extensive experiments demonstrate that
\textsc{Domain2Vec} helps find the data mixture that enhances downstream task
performance with minimal computational overhead. Specifically,
\textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only
$51.5\%$ of the computation required when training on the original mixture of
The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves
downstream performance by an average of $2.83\%$.
comment: Accepted to ICML2025
☆ GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models
Unlearning in large language models (LLMs) is becoming increasingly important
due to regulatory compliance, copyright protection, and privacy concerns.
However, a key challenge in LLM unlearning is unintended forgetting, where the
removal of specific data inadvertently impairs the utility of the model and its
retention of valuable, desired information. While prior work has primarily
focused on architectural innovations, the influence of data-level factors on
unlearning performance remains underexplored. As a result, existing methods
often suffer from degraded retention when forgetting high-impact data. To
address this, we propose GUARD-a novel framework for Guided Unlearning And
Retention via Data attribution. At its core, GUARD introduces a lightweight
proxy data attribution metric tailored for LLM unlearning, which quantifies the
"alignment" between the forget and retain sets while remaining computationally
efficient. Building on this, we design a novel unlearning objective that
assigns adaptive, nonuniform unlearning weights to samples, inversely
proportional to their proxy attribution scores. Through such a reallocation of
unlearning power, GUARD mitigates unintended losses in retention. We provide
rigorous theoretical guarantees that GUARD significantly enhances retention
while maintaining forgetting metrics comparable to prior methods. Extensive
experiments on the TOFU benchmark across multiple LLM architectures demonstrate
that GUARD substantially improves utility preservation while ensuring effective
unlearning. Notably, GUARD reduces utility sacrifice on the Retain Set by up to
194.92% in terms of Truth Ratio when forgetting 10% of the training data.
☆ VINCIE: Unlocking In-context Image Editing from Video
Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang
In-context image editing aims to modify images based on a contextual sequence
comprising text and previously generated images. Existing methods typically
depend on task-specific pipelines and expert models (e.g., segmentation and
inpainting) to curate training data. In this work, we explore whether an
in-context image editing model can be learned directly from videos. We
introduce a scalable approach to annotate videos as interleaved multimodal
sequences. To effectively learn from this data, we design a block-causal
diffusion transformer trained on three proxy tasks: next-image prediction,
current segmentation prediction, and next-segmentation prediction.
Additionally, we propose a novel multi-turn image editing benchmark to advance
research in this area. Extensive experiments demonstrate that our model
exhibits strong in-context image editing capabilities and achieves
state-of-the-art results on two multi-turn image editing benchmarks. Despite
being trained exclusively on videos, our model also shows promising abilities
in multi-concept composition, story generation, and chain-of-editing
applications.
comment: Project page: https://vincie2025.github.io/
☆ Dynamic Epistemic Friction in Dialogue CoNLL 2025
Recent developments in aligning Large Language Models (LLMs) with human
preferences have significantly enhanced their utility in human-AI collaborative
scenarios. However, such approaches often neglect the critical role of
"epistemic friction," or the inherent resistance encountered when updating
beliefs in response to new, conflicting, or ambiguous information. In this
paper, we define dynamic epistemic friction as the resistance to epistemic
integration, characterized by the misalignment between an agent's current
belief state and new propositions supported by external evidence. We position
this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit,
2011), where friction emerges as nontrivial belief-revision during the
interaction. We then present analyses from a situated collaborative task that
demonstrate how this model of epistemic friction can effectively predict belief
updates in dialogues, and we subsequently discuss how the model of belief
alignment as a measure of epistemic resistance or friction can naturally be
made more sophisticated to accommodate the complexities of real-world dialogue
scenarios.
comment: 11 pages, 2 figures, 2 tables, CoNLL 2025
☆ Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Large language models (LLMs) are increasingly deployed in high-stakes hiring
applications, making decisions that directly impact people's careers and
livelihoods. While prior studies suggest simple anti-bias prompts can eliminate
demographic biases in controlled evaluations, we find these mitigations fail
when realistic contextual details are introduced. We address these failures
through internal bias mitigation: by identifying and neutralizing sensitive
attribute directions within model activations, we achieve robust bias reduction
across all tested scenarios. Across leading commercial (GPT-4o, Claude 4
Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3,
Mistral-24B), we find that adding realistic context such as company names,
culture descriptions from public careers pages, and selective hiring
constraints (e.g.,``only accept candidates in the top 10\%") induces
significant racial and gender biases (up to 12\% differences in interview
rates). When these biases emerge, they consistently favor Black over White
candidates and female over male candidates across all tested models and
scenarios. Moreover, models can infer demographics and become biased from
subtle cues like college affiliations, with these biases remaining invisible
even when inspecting the model's chain-of-thought reasoning. To address these
limitations, our internal bias mitigation identifies race and gender-correlated
directions and applies affine concept editing at inference time. Despite using
directions from a simple synthetic dataset, the intervention generalizes
robustly, consistently reducing bias to very low levels (typically under 1\%,
always below 2.5\%) while largely maintaining model performance. Our findings
suggest that practitioners deploying LLMs for hiring should adopt more
realistic evaluation methodologies and consider internal mitigation strategies
for equitable outcomes.
☆ Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization
A central goal for mechanistic interpretability has been to identify the
right units of analysis in large language models (LLMs) that causally explain
their outputs. While early work focused on individual neurons, evidence that
neurons often encode multiple concepts has motivated a shift toward analyzing
directions in activation space. A key question is how to find directions that
capture interpretable features in an unsupervised manner. Current methods rely
on dictionary learning with sparse autoencoders (SAEs), commonly trained over
residual stream activations to learn directions from scratch. However, SAEs
often struggle in causal evaluations and lack intrinsic interpretability, as
their learning is not explicitly tied to the computations of the model. Here,
we tackle these limitations by directly decomposing MLP activations with
semi-nonnegative matrix factorization (SNMF), such that the learned features
are (a) sparse linear combinations of co-activated neurons, and (b) mapped to
their activating inputs, making them directly interpretable. Experiments on
Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs
and a strong supervised baseline (difference-in-means) on causal steering,
while aligning with human-interpretable concepts. Further analysis reveals that
specific neuron combinations are reused across semantically-related features,
exposing a hierarchical structure in the MLP's activation space. Together,
these results position SNMF as a simple and effective tool for identifying
interpretable features and dissecting concept representations in LLMs.
☆ Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?
Toxicity remains a leading cause of early-stage drug development failure.
Despite advances in molecular design and property prediction, the task of
molecular toxicity repair - generating structurally valid molecular
alternatives with reduced toxicity - has not yet been systematically defined or
benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task
for general-purpose Multimodal Large Language Models (MLLMs) focused on
molecular toxicity repair. We construct a standardized dataset covering 11
primary tasks and 560 representative toxic molecules spanning diverse
mechanisms and granularities. We design a prompt annotation pipeline with
mechanism-aware and task-adaptive capabilities, informed by expert
toxicological knowledge. In parallel, we propose an automated evaluation
framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic
accessibility, drug-likeness, and structural similarity into a high-throughput
evaluation chain for repair success. We systematically assess nearly 30
mainstream general-purpose MLLMs and design multiple ablation studies to
analyze key factors such as evaluation criteria, candidate diversity, and
failure attribution. Experimental results show that although current MLLMs
still face significant challenges on this task, they begin to demonstrate
promising capabilities in toxicity understanding, semantic constraint
adherence, and structure-aware molecule editing.
☆ Magistral
Mistral-AI, :, Abhinav Rastogi, Albert Q. Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, Léonard Blier, Lucile Saulnier, Matthieu Dinot, Maxime Darrin, Neha Gupta, Roman Soletskyi, Sagar Vaze, Teven Le Scao, Yihan Wang, Adam Yang, Alexander H. Liu, Alexandre Sablayrolles, Amélie Héliou, Amélie Martin, Andy Ehrenberg, Anmol Agarwal, Antoine Roux, Arthur Darcet, Arthur Mensch, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Chris Bamford, Christian Wallenwein, Christophe Renaudin, Clémence Lanfranchi, Darius Dabert, Devon Mizelle, Diego de las Casas, Elliot Chane-Sane, Emilien Fugier, Emma Bou Hanna, Gauthier Delerce, Gauthier Guinet, Georgii Novikov, Guillaume Martin, Himanshu Jaju, Jan Ludziejewski, Jean-Hadrien Chabran, Jean-Malo Delignon, Joachim Studnia, Jonas Amar, Josselin Somerville Roberts, Julien Denize, Karan Saxena, Kush Jain, Lingxiao Zhao, Louis Martin, Luyu Gao, Lélio Renard Lavaud, Marie Pellat, Mathilde Guillaumin, Mathis Felardos, Maximilian Augustin, Mickaël Seznec, Nikhil Raghuraman, Olivier Duchenne, Patricia Wang, Patrick von Platen, Patryk Saffer, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Pavankumar Reddy Muddireddy, Philomène Chagniot, Pierre Stock, Pravesh Agrawal, Romain Sauvestre, Rémi Delacourt, Sanchit Gandhi, Sandeep Subramanian, Shashwat Dalal, Siddharth Gandhi, Soham Ghosh, Srijan Mishra, Sumukh Aithal, Szymon Antoniak, Thibault Schueller, Thibaut Lavril, Thomas Robert, Thomas Wang, Timothée Lacroix, Valeriia Nemychnikova, Victor Paltz, Virgile Richard, Wen-Ding Li, William Marshall, Xuanyu Zhang, Yunhao Tang
We introduce Magistral, Mistral's first reasoning model and our own scalable
reinforcement learning (RL) pipeline. Instead of relying on existing
implementations and RL traces distilled from prior models, we follow a ground
up approach, relying solely on our own models and infrastructure. Notably, we
demonstrate a stack that enabled us to explore the limits of pure RL training
of LLMs, present a simple method to force the reasoning language of the model,
and show that RL on text data alone maintains most of the initial checkpoint's
capabilities. We find that RL on text maintains or improves multimodal
understanding, instruction following and function calling. We present Magistral
Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we
open-source Magistral Small (Apache 2.0) which further includes cold-start data
from Magistral Medium.
☆ Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning
Autoformalization plays a crucial role in formal mathematical reasoning by
enabling the automatic translation of natural language statements into formal
languages. While recent advances using large language models (LLMs) have shown
promising results, methods for automatically evaluating autoformalization
remain underexplored. As one moves to more complex domains (e.g., advanced
mathematics), human evaluation requires significant time and domain expertise,
especially as the complexity of the underlying statements and background
knowledge increases. LLM-as-a-judge presents a promising approach for
automating such evaluation. However, existing methods typically employ
coarse-grained and generic evaluation criteria, which limit their effectiveness
for advanced formal mathematical reasoning, where quality hinges on nuanced,
multi-granular dimensions. In this work, we take a step toward addressing this
gap by introducing a systematic, automatic method to evaluate autoformalization
tasks. The proposed method is based on an epistemically and formally grounded
ensemble (EFG) of LLM judges, defined on criteria encompassing logical
preservation (LP), mathematical consistency (MC), formal validity (FV), and
formal quality (FQ), resulting in a transparent assessment that accounts for
different contributing factors. We validate the proposed framework to serve as
a proxy for autoformalization assessment within the domain of formal
mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM
judges is a suitable emerging proxy for evaluation, more strongly correlating
with human assessments than a coarse-grained model, especially when assessing
formal qualities. These findings suggest that LLM-as-judges, especially when
guided by a well-defined set of atomic properties, could offer a scalable,
interpretable, and reliable support for evaluating formal mathematical
reasoning.
☆ BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP
Thomas Sounack, Joshua Davis, Brigitte Durieux, Antoine Chaffin, Tom J. Pollard, Eric Lehman, Alistair E. W. Johnson, Matthew McDermott, Tristan Naumann, Charlotta Lindvall
Encoder-based transformer models are central to biomedical and clinical
Natural Language Processing (NLP), as their bidirectional self-attention makes
them well-suited for efficiently extracting structured information from
unstructured text through discriminative tasks. However, encoders have seen
slower development compared to decoder models, leading to limited domain
adaptation in biomedical and clinical settings. We introduce BioClinical
ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT
release, incorporating long-context processing and substantial improvements in
speed and performance for biomedical and clinical NLP. BioClinical ModernBERT
is developed through continued pretraining on the largest biomedical and
clinical corpus to date, with over 53.5 billion tokens, and addresses a key
limitation of prior clinical encoders by leveraging 20 datasets from diverse
institutions, domains, and geographic regions, rather than relying on data from
a single source. It outperforms existing biomedical and clinical encoders on
four downstream tasks spanning a broad range of use cases. We release both base
(150M parameters) and large (396M parameters) versions of BioClinical
ModernBERT, along with training checkpoints to support further research.
☆ The Diffusion Duality ICML 2025
Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov
Uniform-state discrete diffusion models hold the promise of fast text
generation due to their inherent ability to self-correct. However, they are
typically outperformed by autoregressive models and masked diffusion models. In
this work, we narrow this performance gap by leveraging a key insight:
Uniform-state diffusion processes naturally emerge from an underlying Gaussian
diffusion. Our method, Duo, transfers powerful techniques from Gaussian
diffusion to improve both training and sampling. First, we introduce a
curriculum learning strategy guided by the Gaussian process, doubling training
speed by reducing variance. Models trained with curriculum learning surpass
autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we
present Discrete Consistency Distillation, which adapts consistency
distillation from the continuous to the discrete setting. This algorithm
unlocks few-step generation in diffusion language models by accelerating
sampling by two orders of magnitude. We provide the code and model checkpoints
on the project page: http://s-sahoo.github.io/duo
comment: ICML 2025. We provide the code at: https://github.com/s-sahoo/duo
☆ Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei
Large language models (LLMs) can acquire new knowledge through fine-tuning,
but this process exhibits a puzzling duality: models can generalize remarkably
from new facts, yet are also prone to hallucinating incorrect information.
However, the reasons for this phenomenon remain poorly understood. In this
work, we argue that both behaviors stem from a single mechanism known as
out-of-context reasoning (OCR): the ability to deduce implications by
associating concepts, even those without a causal link. Our experiments across
five prominent LLMs confirm that OCR indeed drives both generalization and
hallucination, depending on whether the associated concepts are causally
related. To build a rigorous theoretical understanding of this phenomenon, we
then formalize OCR as a synthetic factual recall task. We empirically show that
a one-layer single-head attention-only transformer with factorized output and
value matrices can learn to solve this task, while a model with combined
weights cannot, highlighting the crucial role of matrix factorization. Our
theoretical analysis shows that the OCR capability can be attributed to the
implicit bias of gradient descent, which favors solutions that minimize the
nuclear norm of the combined output-value matrix. This mathematical structure
explains why the model learns to associate facts and implications with high
sample efficiency, regardless of whether the correlation is causal or merely
spurious. Ultimately, our work provides a theoretical foundation for
understanding the OCR phenomenon, offering a new lens for analyzing and
mitigating undesirable behaviors from knowledge injection.
☆ Slimming Down LLMs Without Losing Their Minds
This paper investigates and validates the impact of fine-tuning on large
language model performance, focusing on parameter-efficient methods (LoRA and
QLoRA). We evaluate model capabilities across three key domains: (1)
commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3)
multi-domain knowledge (MMLU-CS).
Our findings demonstrate that: (1) LoRA-based methods effectively improve
task-specific performance while maintaining computational efficiency, and (2)
performance strongly depends on alignment between fine-tuning dataset and
benchmark tasks. The study provides both theoretical insights into
parameter-efficient mechanisms and practical guidance for developers
implementing efficient LLM adaptation with limited resources.
comment: 10 pages
☆ Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment ACL 2025
Medical dialogue systems (MDS) have emerged as crucial online platforms for
enabling multi-turn, context-aware conversations with patients. However,
existing MDS often struggle to (1) identify relevant medical knowledge and (2)
generate personalized, medically accurate responses. To address these
challenges, we propose MedRef, a novel MDS that incorporates knowledge refining
and dynamic prompt adjustment. First, we employ a knowledge refining mechanism
to filter out irrelevant medical data, improving predictions of critical
medical entities in responses. Additionally, we design a comprehensive prompt
structure that incorporates historical details and evident details. To enable
real-time adaptability to diverse patient conditions, we implement two key
modules, Triplet Filter and Demo Selector, providing appropriate knowledge and
demonstrations equipped in the system prompt. Extensive experiments on MedDG
and KaMed benchmarks show that MedRef outperforms state-of-the-art baselines in
both generation quality and medical entity accuracy, underscoring its
effectiveness and reliability for real-world healthcare applications.
comment: ACL 2025 Findings
☆ Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models
Analyses of self-supervised speech models have begun to reveal where and how
they represent different types of information. However, almost all analyses
have focused on English. Here, we examine how wav2vec2 models trained on four
different languages encode both language-matched and non-matched speech. We use
probing classifiers and geometric analyses to examine how phones, lexical
tones, and speaker information are represented. We show that for all
pretraining and test languages, the subspaces encoding phones, tones, and
speakers are largely orthogonal, and that layerwise patterns of probing
accuracy are similar, with a relatively small advantage for matched-language
phone and tone (but not speaker) probes in the later layers. Our findings
suggest that the structure of representations learned by wav2vec2 is largely
independent of the speech material used during pretraining.
☆ Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles
Diffusion-based language models (dLLMs) have emerged as a promising
alternative to traditional autoregressive LLMs by enabling parallel token
generation and significantly reducing inference latency. However, existing
sampling strategies for dLLMs, such as confidence-based or semi-autoregressive
decoding, often suffer from static behavior, leading to suboptimal efficiency
and limited flexibility. In this paper, we propose SlowFast Sampling, a novel
dynamic sampling strategy that adaptively alternates between exploratory and
accelerated decoding stages. Our method is guided by three golden principles:
certainty principle, convergence principle, and positional principle, which
govern when and where tokens can be confidently and efficiently decoded. We
further integrate our strategy with dLLM-Cache to reduce redundant computation.
Extensive experiments across benchmarks and models show that SlowFast Sampling
achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and
up to 34.22$\times$ when combined with caching. Notably, our approach
outperforms strong autoregressive baselines like LLaMA3 8B in throughput,
demonstrating that well-designed sampling can unlock the full potential of
dLLMs for fast and high-quality generation.
comment: 11 pages; 5 figures;
☆ CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training
This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG)
framework composed of specialized agents for subtasks such as planning,
searching, reasoning, and coordination. Our system uses a self-training
paradigm with reward-guided trajectory sampling to optimize inter-agent
collaboration and enhance response generation. Evaluated on DataMorgana-derived
datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms
conventional RAG baselines. We further analyze competition outcomes and
showcase the framework's strengths with case studies, demonstrating its
efficacy for complex, real-world RAG tasks.
☆ ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization
Zhensheng Jin, Xinze Li, Yifan Ji, Chunyi Peng, Zhenghao Liu, Qi Shi, Yukun Yan, Shuo Wang, Furong Peng, Ge Yu
Recent advances in Chain-of-Thought (CoT) prompting have substantially
improved the reasoning capabilities of Large Language Models (LLMs). However,
these methods often suffer from overthinking, leading to unnecessarily lengthy
or redundant reasoning traces. Existing approaches attempt to mitigate this
issue through curating multiple reasoning chains for training LLMs, but their
effectiveness is often constrained by the quality of the generated data and
prone to overfitting. To address the challenge, we propose Reasoning
Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing
the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a
stepwise exploration mechanism and a long-short switched sampling strategy,
enabling LLMs to incrementally generate diverse reasoning paths. These paths
are evaluated and used to construct preference pairs to train two specialized
models (Gemini LLMs)-one optimized for reasoning accuracy, the other for
shorter reasoning. A final integrated model is obtained by interpolating the
parameters of these two models. Experimental results across multiple math
reasoning datasets and backbone models demonstrate that ReCUT significantly
reduces reasoning lengths by approximately 30-50%, while maintaining or
improving reasoning accuracy compared to various baselines. All codes and data
will be released via https://github.com/NEUIR/ReCUT.
☆ VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Long video understanding (LVU) presents a significant challenge for current
multi-modal large language models (MLLMs) due to the task's inherent complexity
and context window constraint. It is widely assumed that addressing LVU tasks
requires foundation MLLMs with extended context windows, strong visual
perception capabilities, and proficient domain expertise. In this work, we
challenge this common belief by introducing VideoDeepResearch, a novel agentic
framework for long video understanding. Our approach relies solely on a
text-only large reasoning model (LRM) combined with a modular multi-modal
toolkit, including multimodal retrievers and visual perceivers, all of which
are readily available in practice. For each LVU task, the system formulates a
problem-solving strategy through reasoning, while selectively accessing and
utilizing essential video content via tool using. We conduct extensive
experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench.
Our results demonstrate that VideoDeepResearch achieves substantial
improvements over existing MLLM baselines, surpassing the previous
state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and
LongVideoBench, respectively. These findings highlight the promise of agentic
systems in overcoming key challenges in LVU problems.
☆ Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints ACL 2025
Efficiently updating multilingual knowledge in large language models (LLMs),
while preserving consistent factual representations across languages, remains a
long-standing and unresolved challenge. While deploying separate editing
systems for each language might seem viable, this approach incurs substantial
costs due to the need to manage multiple models. A more efficient solution
involves integrating knowledge updates across all languages into a unified
model. However, performing sequential edits across languages often leads to
destructive parameter interference, significantly degrading multilingual
generalization and the accuracy of injected knowledge. To address this
challenge, we propose LangEdit, a novel null-space constrained framework
designed to precisely isolate language-specific knowledge updates. The core
innovation of LangEdit lies in its ability to project parameter updates for
each language onto the orthogonal complement of previous updated subspaces.
This approach mathematically guarantees update independence while preserving
multilingual generalization capabilities. We conduct a comprehensive evaluation
across three model architectures, six languages, and four downstream tasks,
demonstrating that LangEdit effectively mitigates parameter interference and
outperforms existing state-of-the-art editing methods. Our results highlight
its potential for enabling efficient and accurate multilingual knowledge
updates in LLMs. The code is available at
https://github.com/VRCMF/LangEdit.git.
comment: ACL 2025 Findings
☆ FASCIST-O-METER: Classifier for Neo-fascist Discourse Online
Neo-fascism is a political and societal ideology that has been having
remarkable growth in the last decade in the United States of America (USA), as
well as in other Western societies. It poses a grave danger to democracy and
the minorities it targets, and it requires active actions against it to avoid
escalation. This work presents the first-of-its-kind neo-fascist coding scheme
for digital discourse in the USA societal context, overseen by political
science researchers. Our work bridges the gap between Natural Language
Processing (NLP) and political science against this phenomena. Furthermore, to
test the coding scheme, we collect a tremendous amount of activity on the
internet from notable neo-fascist groups (the forums of Iron March and
Stormfront.org), and the guidelines are applied to a subset of the collected
posts. Through crowdsourcing, we annotate a total of a thousand posts that are
labeled as neo-fascist or non-neo-fascist. With this labeled data set, we
fine-tune and test both Small Language Models (SLMs) and Large Language Models
(LLMs), obtaining the very first classification models for neo-fascist
discourse. We find that the prevalence of neo-fascist rhetoric in this kind of
forum is ever-present, making them a good target for future research. The
societal context is a key consideration for neo-fascist speech when conducting
NLP research. Finally, the work against this kind of political movement must be
pressed upon and continued for the well-being of a democratic society.
Disclaimer: This study focuses on detecting neo-fascist content in text,
similar to other hate speech analyses, without labeling individuals or
organizations.
☆ Improving Named Entity Transcription with Contextual LLM-based Revision
With recent advances in modeling and the increasing amount of supervised
training data, automatic speech recognition (ASR) systems have achieved
remarkable performance on general speech. However, the word error rate (WER) of
state-of-the-art ASR remains high for named entities. Since named entities are
often the most critical keywords, misrecognizing them can affect all downstream
applications, especially when the ASR system functions as the front end of a
complex system. In this paper, we introduce a large language model (LLM)
revision mechanism to revise incorrect named entities in ASR predictions by
leveraging the LLM's reasoning ability as well as local context (e.g., lecture
notes) containing a set of correct named entities. Finally, we introduce the
NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses
for development and testing. On this dataset, our proposed technique achieves
up to 30\% relative WER reduction for named entities.
☆ Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs
Accurate and well-calibrated uncertainty estimates are essential for
deploying large language models (LLMs) in high-stakes domains such as clinical
decision support. We present a fine-grained evaluation of uncertainty
estimation methods for clinical multiple-choice question answering, covering
ten open-source LLMs (general-purpose, biomedical, and reasoning models) across
two datasets, eleven medical specialties, and six question types. We compare
standard single-generation and sampling-based methods, and present a case study
exploring simple, single-pass estimators based on behavioral signals in
reasoning traces. These lightweight methods approach the performance of
Semantic Entropy while requiring only one generation. Our results reveal
substantial variation across specialties and question types, underscoring the
importance of selecting models based on both the nature of the question and
model-specific strengths.
☆ One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
Diana Abagyan, Alejandro R. Salamanca, Andres Felipe Cruz-Salinas, Kris Cao, Hangyu Lin, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker
Pretraining massively multilingual Large Language Models (LLMs) for many
languages at once is challenging due to limited model capacity, scarce
high-quality data, and compute constraints. Moreover, the lack of language
coverage of the tokenizer makes it harder to address the gap for new languages
purely at the post-training stage. In this work, we study what relatively cheap
interventions early on in training improve "language plasticity", or adaptation
capabilities of the model post-training to new languages. We focus on tokenizer
design and propose using a universal tokenizer that is trained for more
languages than the primary pretraining languages to enable efficient adaptation
in expanding language coverage after pretraining. Our systematic experiments
across diverse groups of languages and different training strategies show that
a universal tokenizer enables significantly higher language adaptation, with up
to 20.2% increase in win rates compared to tokenizers specific to pretraining
languages. Furthermore, a universal tokenizer also leads to better plasticity
towards languages that are completely unseen in the tokenizer and pretraining,
by up to 5% win rate gain. We achieve this adaptation to an expanded set of
languages with minimal compromise in performance on the majority of languages
included in pretraining.
☆ Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering
Sai Prasanna Teja Reddy Bogireddy, Abrar Majeedi, Viswanatha Reddy Gajjala, Zhuoyan Xu, Siddhant Rai, Vaishnav Potlapalli
Automated question answering (QA) over electronic health records (EHRs) can
bridge critical information gaps for clinicians and patients, yet it demands
both precise evidence retrieval and faithful answer generation under limited
supervision. In this work, we present Neural, the runner-up in the BioNLP 2025
ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method
decouples the task into (1) sentence-level evidence identification and (2)
answer synthesis with explicit citations. For each stage, we automatically
explore the prompt space with DSPy's MIPROv2 optimizer, jointly tuning
instructions and few-shot demonstrations on the development set. A
self-consistency voting scheme further improves evidence recall without
sacrificing precision. On the hidden test set, our method attains an overall
score of 51.5, placing second stage while outperforming standard zero-shot and
few-shot prompting by over 20 and 10 points, respectively. These results
indicate that data-driven prompt optimization is a cost-effective alternative
to model fine-tuning for high-stakes clinical QA, advancing the reliability of
AI assistants in healthcare.
☆ TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora ACL 2025
The rapid evolution of scientific fields introduces challenges in organizing
and retrieving scientific literature. While expert-curated taxonomies have
traditionally addressed this need, the process is time-consuming and expensive.
Furthermore, recent automatic taxonomy construction methods either (1)
over-rely on a specific corpus, sacrificing generalizability, or (2) depend
heavily on the general knowledge of large language models (LLMs) contained
within their pre-training datasets, often overlooking the dynamic nature of
evolving scientific domains. Additionally, these approaches fail to account for
the multi-faceted nature of scientific literature, where a single research
paper may contribute to multiple dimensions (e.g., methodology, new tasks,
evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a
framework that dynamically adapts an LLM-generated taxonomy to a given corpus
across multiple dimensions. TaxoAdapt performs iterative hierarchical
classification, expanding both the taxonomy width and depth based on corpus'
topical distribution. We demonstrate its state-of-the-art performance across a
diverse set of computer science conferences over the years to showcase its
ability to structure and capture the evolution of scientific fields. As a
multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more
granularity-preserving and 50.41% more coherent than the most competitive
baselines judged by LLMs.
comment: Accepted to ACL 2025 Main Conference. Code available at:
https://github.com/pkargupta/taxoadapt
☆ Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims ACL 2025
Claims made by individuals or entities are oftentimes nuanced and cannot be
clearly labeled as entirely "true" or "false" -- as is frequently the case with
scientific and political claims. However, a claim (e.g., "vaccine A is better
than vaccine B") can be dissected into its integral aspects and sub-aspects
(e.g., efficacy, safety, distribution), which are individually easier to
validate. This enables a more comprehensive, structured response that provides
a well-rounded perspective on a given problem while also allowing the reader to
prioritize specific angles of interest within the claim (e.g., safety towards
children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based
framework for automatically constructing a hierarchy of aspects typically
considered when addressing a claim and enriching them with corpus-specific
perspectives. This structure hierarchically partitions an input corpus to
retrieve relevant segments, which assist in discovering new sub-aspects.
Moreover, these segments enable the discovery of varying perspectives towards
an aspect of the claim (e.g., support, neutral, or oppose) and their respective
prevalence (e.g., "how many biomedical papers believe vaccine A is more
transportable than B?"). We apply ClaimSpect to a wide variety of real-world
scientific and political claims featured in our constructed dataset, showcasing
its robustness and accuracy in deconstructing a nuanced claim and representing
perspectives within a corpus. Through real-world case studies and human
evaluation, we validate its effectiveness over multiple baselines.
comment: Accepted to ACL 2025 Main Conference. Code available at:
https://github.com/pkargupta/claimspect
☆ PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models
Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve
strong performance on mathematical benchmarks using lengthy chain-of-thought
(CoT) reasoning, but the resulting traces are often unnecessarily verbose. This
inflates token usage and cost, limiting deployment in latency-sensitive or
API-constrained settings. We introduce PREMISE (PRompt-based Efficient
Mathematical Inference with Strategic Evaluation), a prompt-only framework that
reduces reasoning overhead without modifying model weights. PREMISE combines
trace-level diagnostics with gradient-inspired prompt optimization to minimize
redundant computation while preserving answer accuracy. The approach jointly
optimizes brevity and correctness through a multi-objective textual search that
balances token length and answer validity. Unlike prior work, PREMISE runs in a
single-pass black-box interface, so it can be applied directly to commercial
LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy
($96\%\rightarrow96\%$ with Claude, $91\%\rightarrow92\%$ with Gemini) while
reducing reasoning tokens by up to $87.5\%$ and cutting dollar cost by
$69$--$82\%$. These results show that prompt-level optimization is a practical
and scalable path to efficient LRM inference without compromising reasoning
quality.
☆ Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet
Open English Wordnet is a key resource published in OntoLex-lemon as part of
the linguistic linked open data cloud. There are, however, many links missing
in the resource, and in this paper, we look at how we can establish hypernymy
between adjectives. We present a theoretical discussion of the hypernymy
relation and how it differs for adjectives in contrast to nouns and verbs. We
develop a new resource for adjective hypernymy and fine-tune large language
models to predict adjective hypernymy, showing that the methodology of
TaxoLLaMa can be adapted to this task.
☆ Large Language Models for Detection of Life-Threatening Texts
Detecting life-threatening language is essential for safeguarding individuals
in distress, promoting mental health and well-being, and preventing potential
harm and loss of life. This paper presents an effective approach to identifying
life-threatening texts using large language models (LLMs) and compares them
with traditional methods such as bag of words, word embedding, topic modeling,
and Bidirectional Encoder Representations from Transformers. We fine-tune three
open-source LLMs including Gemma, Mistral, and Llama-2 using their 7B parameter
variants on different datasets, which are constructed with class balance,
imbalance, and extreme imbalance scenarios. Experimental results demonstrate a
strong performance of LLMs against traditional methods. More specifically,
Mistral and Llama-2 models are top performers in both balanced and imbalanced
data scenarios while Gemma is slightly behind. We employ the upsampling
technique to deal with the imbalanced data scenarios and demonstrate that while
this method benefits traditional approaches, it does not have as much impact on
LLMs. This study demonstrates a great potential of LLMs for real-world
life-threatening language detection problems.
☆ TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving
The increasing adoption of artificial intelligence in telecommunications has
raised interest in the capability of Large Language Models (LLMs) to address
domain-specific, mathematically intensive tasks. Although recent advancements
have improved the performance of LLMs in general mathematical reasoning, their
effectiveness within specialized domains, such as signal processing, network
optimization, and performance analysis, remains largely unexplored. To address
this gap, we introduce TeleMath, the first benchmark dataset specifically
designed to evaluate LLM performance in solving mathematical problems with
numerical solutions in the telecommunications domain. Comprising 500
question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the
telecommunications field. This paper outlines the proposed QnAs generation
pipeline, starting from a selected seed of problems crafted by Subject Matter
Experts. The evaluation of a wide range of open-source LLMs reveals that best
performance on TeleMath is achieved by recent models explicitly designed for
mathematical or logical reasoning. In contrast, general-purpose models, even
those with a large number of parameters, often struggle with these challenges.
We have released the dataset and the evaluation code to ease result
reproducibility and support future research.
comment: 6 pages
☆ Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes
Speech recognisers usually perform optimally only in a specific environment
and need to be adapted to work well in another. For adaptation to a new
speaker, there is often too little data for fine-tuning to be robust, and that
data is usually unlabelled. This paper proposes a combination of approaches to
make adaptation to a single minute of data robust. First, instead of estimating
the adaptation parameters with cross-entropy on a single error-prone hypothesis
or "pseudo-label", this paper proposes a novel loss function, the conditional
entropy over complete hypotheses. Using multiple hypotheses makes adaptation
more robust to errors in the initial recognition. Second, a "speaker code"
characterises a speaker in a vector short enough that it requires little data
to estimate. On a far-field noise-augmented version of Common Voice, the
proposed scheme yields a 20% relative improvement in word error rate on one
minute of adaptation data, increasing on 10 minutes to 29%.
☆ Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters
Large language models (LLMs) can spell out tokens character by character with
high accuracy, yet they struggle with more complex character-level tasks, such
as identifying compositional subcomponents within tokens. In this work, we
investigate how LLMs internally represent and utilize character-level
information during the spelling-out process. Our analysis reveals that,
although spelling out is a simple task for humans, it is not handled in a
straightforward manner by LLMs. Specifically, we show that the embedding layer
does not fully encode character-level information, particularly beyond the
first character. As a result, LLMs rely on intermediate and higher Transformer
layers to reconstruct character-level knowledge, where we observe a distinct
"breakthrough" in their spelling behavior. We validate this mechanism through
three complementary analyses: probing classifiers, identification of knowledge
neurons, and inspection of attention weights.
☆ Conversational Search: From Fundamentals to Frontiers in the LLM Era SIGIR 2025
Conversational search enables multi-turn interactions between users and
systems to fulfill users' complex information needs. During this interaction,
the system should understand the users' search intent within the conversational
context and then return the relevant information through a flexible,
dialogue-based interface. The recent powerful large language models (LLMs) with
capacities of instruction following, content generation, and reasoning, attract
significant attention and advancements, providing new opportunities and
challenges for building up intelligent conversational search systems. This
tutorial aims to introduce the connection between fundamentals and the emerging
topics revolutionized by LLMs in the context of conversational search. It is
designed for students, researchers, and practitioners from both academia and
industry. Participants will gain a comprehensive understanding of both the core
principles and cutting-edge developments driven by LLMs in conversational
search, equipping them with the knowledge needed to contribute to the
development of next-generation conversational search systems.
comment: Accepted by Tutorial Track in SIGIR 2025
☆ NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors
This paper presents our system for Track 1: Mistake Identification in the BEA
2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The
task involves evaluating whether a tutor's response correctly identifies a
mistake in a student's mathematical reasoning. We explore four approaches: (1)
an ensemble of machine learning models over pooled token embeddings from
multiple pretrained language models (LMs); (2) a frozen sentence-transformer
using [CLS] embeddings with an MLP classifier; (3) a history-aware model with
multi-head attention between token-level history and response embeddings; and
(4) a retrieval-augmented few-shot prompting system with a large language model
(LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples,
constructs structured prompts, and uses schema-guided output parsing to produce
interpretable predictions. It outperforms all baselines, demonstrating the
effectiveness of combining example-driven prompting with LLM reasoning for
pedagogical feedback assessment. Our code is available at
https://github.com/NaumanNaeem/BEA_2025.
comment: 6 pages, 2 figures, 1 table
☆ SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis
The advancement of conversational AI systems relies on the availability of
high-quality, flexible, and reproducible synthetic dialogues for training,
evaluation, and benchmarking. SDialog is a modular, extensible Python toolkit
designed to address the challenges of synthetic dialogue generation and
analysis. By leveraging instruction-tuned Large Language Models (LLMs), SDialog
provides abstractions for personas, orchestration, and scenario management,
enabling the creation of realistic, diverse, and controllable conversational
data for research and development. SDialog supports workflows such as
multi-agent simulation and scenario-driven generation, and represents a step
forward in the standardization of tools and frameworks for synthetic data
generation, a crucial advancement for ensuring reproducibility in today's
fast-evolving research landscape.
comment: https://github.com/idiap/sdialog
☆ Deep Learning-Based Digitization of Overlapping ECG Images with Open-Source Python Code
This paper addresses the persistent challenge of accurately digitizing
paper-based electrocardiogram (ECG) recordings, with a particular focus on
robustly handling single leads compromised by signal overlaps-a common yet
under-addressed issue in existing methodologies. We propose a two-stage
pipeline designed to overcome this limitation. The first stage employs a U-Net
based segmentation network, trained on a dataset enriched with overlapping
signals and fortified with custom data augmentations, to accurately isolate the
primary ECG trace. The subsequent stage converts this refined binary mask into
a time-series signal using established digitization techniques, enhanced by an
adaptive grid detection module for improved versatility across different ECG
formats and scales. Our experimental results demonstrate the efficacy of our
approach. The U-Net architecture achieves an IoU of 0.87 for the fine-grained
segmentation task. Crucially, our proposed digitization method yields superior
performance compared to a well-established baseline technique across both
non-overlapping and challenging overlapping ECG samples. For non-overlapping
signals, our method achieved a Mean Squared Error (MSE) of 0.0010 and a Pearson
Correlation Coefficient (rho) of 0.9644, compared to 0.0015 and 0.9366,
respectively, for the baseline. On samples with signal overlap, our method
achieved an MSE of 0.0029 and a rho of 0.9641, significantly improving upon the
baseline's 0.0178 and 0.8676. This work demonstrates an effective strategy to
significantly enhance digitization accuracy, especially in the presence of
signal overlaps, thereby laying a strong foundation for the reliable conversion
of analog ECG records into analyzable digital data for contemporary research
and clinical applications. The implementation is publicly available at this
GitHub repository: https://github.com/masoudrahimi39/ECG-code.
☆ Unsupervised Protoform Reconstruction through Parsimonious Rule-guided Heuristics and Evolutionary Search
We propose an unsupervised method for the reconstruction of protoforms i.e.,
ancestral word forms from which modern language forms are derived. While prior
work has primarily relied on probabilistic models of phonological edits to
infer protoforms from cognate sets, such approaches are limited by their
predominantly data-driven nature. In contrast, our model integrates data-driven
inference with rule-based heuristics within an evolutionary optimization
framework. This hybrid approach leverages on both statistical patterns and
linguistically motivated constraints to guide the reconstruction process. We
evaluate our method on the task of reconstructing Latin protoforms using a
dataset of cognates from five Romance languages. Experimental results
demonstrate substantial improvements over established baselines across both
character-level accuracy and phonological plausibility metrics.
☆ Encoding call-by-push-value in the pi-calculus
In this report we define an encoding of Levys call-by-push-value
lambda-calculus (CBPV) in the pi-calculus, and prove that our encoding is both
sound and complete. We present informal (by-hand) proofs of soundness,
completeness, and all required lemmas. The encoding is specialized to the
internal pi-calculus (pi-i-calculus) to circumvent certain challenges
associated with using de Bruijn index in a formalization, and it also helps
with bisimulation as early-, late- and open-bisimulation coincide in this
setting, furthermore bisimulation is a congruence. Additionally, we argue that
our encoding also satisfies the five criteria for good encodings proposed by
Gorla, as well as show similarities between Milners and our encoding. This
paper includes encodings from CBPV in the pi-i-calculus, asynchronous polyadic
pi-calculus and the local pi-calculus. We begin a formalization of the proof in
Coq for the soundness and completeness of the encoding in the pi-i-calculus.
Not all lemmas used in the formalization are themselves formally proven.
However, we argue that the non-proven lemmas are reasonable, as they are proven
by hand, or amount to Coq formalities that are straightforward given informal
arguments.
comment: 56 pages
☆ Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv, Wenlong Zhang, Lei Bai
Scientific discoveries increasingly rely on complex multimodal reasoning
based on information-intensive scientific data and domain-specific expertise.
Empowered by expert-level scientific benchmarks, scientific Multimodal Large
Language Models (MLLMs) hold the potential to significantly enhance this
discovery process in realistic workflows. However, current scientific
benchmarks mostly focus on evaluating the knowledge understanding capabilities
of MLLMs, leading to an inadequate assessment of their perception and reasoning
abilities. To address this gap, we present the Scientists' First Exam (SFE)
benchmark, designed to evaluate the scientific cognitive capacities of MLLMs
through three interconnected levels: scientific signal perception, scientific
attribute understanding, scientific comparative reasoning. Specifically, SFE
comprises 830 expert-verified VQA pairs across three question types, spanning
66 multimodal tasks across five high-value disciplines. Extensive experiments
reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08%
and 26.52% on SFE, highlighting significant room for MLLMs to improve in
scientific realms. We hope the insights obtained in SFE will facilitate further
developments in AI-enhanced scientific discoveries.
comment: 82 pages
☆ Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs
Large language models (LLMs) often struggle with knowledge-intensive tasks
due to a lack of background knowledge and a tendency to hallucinate. To address
these limitations, integrating knowledge graphs (KGs) with LLMs has been
intensively studied. Existing KG-enhanced LLMs focus on supplementary factual
knowledge, but still struggle with solving complex questions. We argue that
refining the relationships among facts and organizing them into a logically
consistent reasoning path is equally important as factual knowledge itself.
Despite their potential, extracting reliable reasoning paths from KGs poses the
following challenges: the complexity of graph structures and the existence of
multiple generated paths, making it difficult to distinguish between useful and
redundant ones. To tackle these challenges, we propose the RRP framework to
mine the knowledge graph, which combines the semantic strengths of LLMs with
structural information obtained through relation embedding and bidirectional
distribution learning. Additionally, we introduce a rethinking module that
evaluates and refines reasoning paths according to their significance.
Experimental results on two public datasets show that RRP achieves
state-of-the-art performance compared to existing baseline methods. Moreover,
RRP can be easily integrated into various LLMs to enhance their reasoning
abilities in a plug-and-play manner. By generating high-quality reasoning paths
tailored to specific questions, RRP distills effective guidance for LLM
reasoning.
☆ Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models
Large language models (LLMs) have demonstrated remarkable performance in
zero-shot dialogue state tracking (DST), reducing the need for task-specific
training. However, conventional DST benchmarks primarily focus on structured
user-agent conversations, failing to capture the complexities of real-world
multi-user interactions. In this study, we assess the robustness of LLMs in
multi-user DST while minimizing dataset construction costs. Inspired by recent
advances in LLM-based data annotation, we extend an existing DST dataset by
generating utterances of a second user based on speech act theory. Our
methodology systematically incorporates a second user's utterances into
conversations, enabling a controlled evaluation of LLMs in multi-user settings.
Experimental results reveal a significant performance drop compared to
single-user DST, highlighting the limitations of current LLMs in extracting and
tracking dialogue states amidst multiple speakers. Our findings emphasize the
need for future research to enhance LLMs for multi-user DST scenarios, paving
the way for more realistic and robust DST models.
☆ Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models
Modern language models are trained on large amounts of data. These data
inevitably include controversial and stereotypical content, which contains all
sorts of biases related to gender, origin, age, etc. As a result, the models
express biased points of view or produce different results based on the
assigned personality or the personality of the user. In this paper, we
investigate various proxy measures of bias in large language models (LLMs). We
find that evaluating models with pre-prompted personae on a multi-subject
benchmark (MMLU) leads to negligible and mostly random differences in scores.
However, if we reformulate the task and ask a model to grade the user's answer,
this shows more significant signs of bias. Finally, if we ask the model for
salary negotiation advice, we see pronounced bias in the answers. With the
recent trend for LLM assistant memory and personalization, these problems open
up from a different angle: modern LLM users do not need to pre-prompt the
description of their persona since the model already knows their
socio-demographics.
☆ Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers
Scientific claim verification against tables typically requires predicting
whether a claim is supported or refuted given a table. However, we argue that
predicting the final label alone is insufficient: it reveals little about the
model's reasoning and offers limited interpretability. To address this, we
reframe table-text alignment as an explanation task, requiring models to
identify the table cells essential for claim verification. We build a new
dataset by extending the SciTab benchmark with human-annotated cell-level
rationales. Annotators verify the claim label and highlight the minimal set of
cells needed to support their decision. After the annotation process, we
utilize the collected information and propose a taxonomy for handling ambiguous
cases. Our experiments show that (i) incorporating table alignment information
improves claim verification performance, and (ii) most LLMs, while often
predicting correct labels, fail to recover human-aligned rationales, suggesting
that their predictions do not stem from faithful reasoning.
comment: 8 pages; code and data are available at
https://github.com/Alab-NII/SciTabAlign
☆ Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts
Recent advancements in Multimodal Emotion Recognition (MER) face challenges
in addressing both modality missing and Out-Of-Distribution (OOD) data
simultaneously. Existing methods often rely on specific models or introduce
excessive parameters, which limits their practicality. To address these issues,
we propose a novel robust MER framework, Causal Inference Distiller (CIDer),
and introduce a new task, Random Modality Feature Missing (RMFM), to generalize
the definition of modality missing. CIDer integrates two key components: a
Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal
Inference (MACI) module. MSSD enhances robustness under the RMFM task through a
weight-sharing self-distillation approach applied across low-level features,
attention maps, and high-level representations. Additionally, a Word-level
Self-aligned Attention Module (WSAM) reduces computational complexity, while a
Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion.
To tackle OOD challenges, MACI employs a tailored causal graph to mitigate
label and language biases using a Multimodal Causal Module (MCM) and
fine-grained counterfactual texts. Notably, MACI can independently enhance OOD
generalization with minimal additional parameters. Furthermore, we also
introduce the new repartitioned MER OOD datasets. Experimental results
demonstrate that CIDer achieves robust performance in both RMFM and OOD
scenarios, with fewer parameters and faster training compared to
state-of-the-art methods. The implementation of this work is publicly
accessible at https://github.com/gw-zhong/CIDer.
comment: Submitted to TAC. The code is available at
https://github.com/gw-zhong/CIDer
☆ Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty
Large language models (LLMs) have demonstrated significant advancements in
reasoning capabilities, performing well on various challenging benchmarks.
Techniques like Chain-of-Thought prompting have been introduced to further
improve reasoning. However, these approaches frequently generate longer
outputs, which in turn increase computational latency. Although some methods
use reinforcement learning to shorten reasoning, they often apply uniform
penalties without considering the problem's complexity, leading to suboptimal
outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by
promoting conciseness for simpler problems while preserving sufficient
reasoning for more complex ones for accuracy, thus improving the model's
overall performance. Specifically, we manage the model's reasoning efficiency
by dividing the reward function and including a novel penalty for output
length. Our approach has yielded impressive outcomes in benchmark evaluations
across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively
simpler datasets GSM8K and MATH500, our method has effectively shortened output
lengths while preserving or enhancing accuracy. On the more demanding AIME2024
dataset, our approach has resulted in improved accuracy.
☆ PAL: Probing Audio Encoders via LLMs -- A Study of Information Transfer from Audio Encoders to LLMs
Tony Alex, Wish Suharitdamrong, Sara Atito, Armin Mustafa, Philip J. B. Jackson, Imran Razzak, Muhammad Awais
The integration of audio perception capabilities into Large Language Models
(LLMs) has enabled significant advances in Audio-LLMs. Although
application-focused developments, particularly in curating training data for
specific capabilities e.g., audio reasoning, have progressed rapidly, the
underlying mechanisms that govern efficient transfer of rich semantic
representations from audio encoders to LLMs remain under-explored. We
conceptualize effective audio-LLM interaction as the LLM's ability to
proficiently probe the audio encoder representations to satisfy textual
queries. This paper presents a systematic investigation on how architectural
design choices can affect that. Beginning with a standard Pengi/LLaVA-style
audio-LLM architecture, we propose and evaluate several modifications guided by
hypotheses derived from mechanistic interpretability studies and LLM
operational principles. Our experiments demonstrate that: (1) delaying audio
integration until the LLM's initial layers establish textual context that
enhances its ability to probe the audio representations for relevant
information; (2) the LLM can proficiently probe audio representations
exclusively through LLM layer's attention submodule, without requiring
propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently
integrated ensemble of diverse audio encoders provides richer, complementary
representations, thereby broadening the LLM's capacity to probe a wider
spectrum of audio information. All hypotheses are evaluated using an identical
three-stage training curriculum on a dataset of 5.6 million audio-text pairs,
ensuring controlled comparisons. Our final architecture, which incorporates all
proposed modifications, achieves relative improvements from 10\% to 60\% over
the baseline, validating our approach to optimizing cross-modal information
transfer in audio-LLMs. Project page: https://ta012.github.io/PAL/
comment: 21 pages, 11 figures
☆ Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting
Framing used by news media, especially in times of conflict, can have
substantial impact on readers' opinion, potentially aggravating the conflict
itself. Current studies on the topic of conflict framing have limited insights
due to their qualitative nature or only look at surface level generic frames
without going deeper. In this work, we identify indicators of war and peace
journalism, as outlined by prior work in conflict studies, in a corpus of news
articles reporting on the Israel-Palestine war. For our analysis, we use
computational approaches, using a combination of frame semantics and large
language models to identify both communicative framing and its connection to
linguistic framing. Our analysis reveals a higher focus on war based reporting
rather than peace based. We also show substantial differences in reporting
across the US, UK, and Middle Eastern news outlets in framing who the assailant
and victims of the conflict are, surfacing biases within the media.
☆ Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences? ACL 2025
This paper introduces the TempVS benchmark, which focuses on temporal
grounding and reasoning capabilities of Multimodal Large Language Models
(MLLMs) in image sequences. TempVS consists of three main tests (i.e., event
relation inference, sentence ordering and image ordering), each accompanied
with a basic grounding test. TempVS requires MLLMs to rely on both visual and
linguistic modalities to understand the temporal order of events. We evaluate
38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS,
with a substantial performance gap compared to human capabilities. We also
provide fine-grained insights that suggest promising directions for future
research. Our TempVS benchmark data and code are available at
https://github.com/yjsong22/TempVS.
comment: 27 pages, 14 figures. Accepted to ACL 2025
☆ Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series
Time series data in real-world applications such as healthcare, climate
modeling, and finance are often irregular, multimodal, and messy, with varying
sampling rates, asynchronous modalities, and pervasive missingness. However,
existing benchmarks typically assume clean, regularly sampled, unimodal data,
creating a significant gap between research and real-world deployment. We
introduce Time-IMM, a dataset specifically designed to capture cause-driven
irregularity in multimodal multivariate time series. Time-IMM represents nine
distinct types of time series irregularity, categorized into trigger-based,
constraint-based, and artifact-based mechanisms. Complementing the dataset, we
introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal
time series, enabling asynchronous integration and realistic evaluation.
IMM-TSF includes specialized fusion modules, including a timestamp-to-text
fusion module and a multimodality fusion module, which support both
recency-aware averaging and attention-based integration strategies. Empirical
results demonstrate that explicitly modeling multimodality on irregular time
series data leads to substantial gains in forecasting performance. Time-IMM and
IMM-TSF provide a foundation for advancing time series analysis under
real-world conditions. The dataset is publicly available at
https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the
benchmark library can be accessed at
https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.
comment: This paper is currently under review
☆ PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
Large Language Models (LLMs) have demonstrated impressive capabilities in
complex reasoning tasks, yet they still struggle to reliably verify the
correctness of their own outputs. Existing solutions to this verification
challenge often depend on separate verifier models or require multi-stage
self-correction training pipelines, which limit scalability. In this paper, we
propose Policy as Generative Verifier (PAG), a simple and effective framework
that empowers LLMs to self-correct by alternating between policy and verifier
roles within a unified multi-turn reinforcement learning (RL) paradigm.
Distinct from prior approaches that always generate a second attempt regardless
of model confidence, PAG introduces a selective revision mechanism: the model
revises its answer only when its own generative verification step detects an
error. This verify-then-revise workflow not only alleviates model collapse but
also jointly enhances both reasoning and verification abilities. Extensive
experiments across diverse reasoning benchmarks highlight PAG's dual
advancements: as a policy, it enhances direct generation and self-correction
accuracy; as a verifier, its self-verification outperforms self-consistency.
☆ TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning
Retrieval-Augmented Generation (RAG) has demonstrated considerable
effectiveness in open-domain question answering. However, when applied to
heterogeneous documents, comprising both textual and tabular components,
existing RAG approaches exhibit critical limitations. The prevailing practice
of flattening tables and chunking strategies disrupts the intrinsic tabular
structure, leads to information loss, and undermines the reasoning capabilities
of LLMs in multi-hop, global queries. To address these challenges, we propose
TableRAG, an hybrid framework that unifies textual understanding and complex
manipulations over tabular data. TableRAG iteratively operates in four steps:
context-sensitive query decomposition, text retrieval, SQL programming and
execution, and compositional intermediate answer generation. We also develop
HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous
reasoning capabilities. Experimental results demonstrate that TableRAG
consistently outperforms existing baselines on both public datasets and our
HeteQA, establishing a new state-of-the-art for heterogeneous document question
answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.
comment: Under review. Codes are available at
https://github.com/yxh-y/TableRAG/tree/main
☆ Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning
Faithful evaluation of language model capabilities is crucial for deriving
actionable insights that can inform model development. However, rigorous causal
evaluations in this domain face significant methodological challenges,
including complex confounding effects and prohibitive computational costs
associated with extensive retraining. To tackle these challenges, we propose a
causal representation learning framework wherein observed benchmark performance
is modeled as a linear transformation of a few latent capability factors.
Crucially, these latent factors are identified as causally interrelated after
appropriately controlling for the base model as a common confounder. Applying
this approach to a comprehensive dataset encompassing over 1500 models
evaluated across six benchmarks from the Open LLM Leaderboard, we identify a
concise three-node linear causal structure that reliably explains the observed
performance variations. Further interpretation of this causal structure
provides substantial scientific insights beyond simple numerical rankings:
specifically, we reveal a clear causal direction starting from general
problem-solving capabilities, advancing through instruction-following
proficiency, and culminating in mathematical reasoning ability. Our results
underscore the essential role of carefully controlling base model variations
during evaluation, a step critical to accurately uncovering the underlying
causal relationships among latent model capabilities.
☆ Can We Infer Confidential Properties of Training Data from LLMs?
Large language models (LLMs) are increasingly fine-tuned on domain-specific
datasets to support applications in fields such as healthcare, finance, and
law. These fine-tuning datasets often have sensitive and confidential
dataset-level properties -- such as patient demographics or disease prevalence
-- that are not intended to be revealed. While prior work has studied property
inference attacks on discriminative models (e.g., image classification models)
and generative models (e.g., GANs for image data), it remains unclear if such
attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark
task for evaluating property inference in LLMs under two fine-tuning paradigms:
question-answering and chat-completion. Built on the ChatDoctor dataset, our
benchmark includes a range of property types and task configurations. We
further propose two tailored attacks: a prompt-based generation attack and a
shadow-model attack leveraging word frequency signals. Empirical evaluations
across multiple pretrained LLMs show the success of our attacks, revealing a
previously unrecognized vulnerability in LLMs.
☆ An Analysis of Datasets, Metrics and Models in Keyphrase Generation ACL 2025
Keyphrase generation refers to the task of producing a set of words or
phrases that summarises the content of a document. Continuous efforts have been
dedicated to this task over the past few years, spreading across multiple lines
of research, such as model architectures, data resources, and use-case
scenarios. Yet, the current state of keyphrase generation remains unknown as
there has been no attempt to review and analyse previous work. In this paper,
we bridge this gap by presenting an analysis of over 50 research papers on
keyphrase generation, offering a comprehensive overview of recent progress,
limitations, and open challenges. Our findings highlight several critical
issues in current evaluation practices, such as the concerning similarity among
commonly-used benchmark datasets and inconsistencies in metric calculations
leading to overestimated performances. Additionally, we address the limited
availability of pre-trained models by releasing a strong PLM-based model for
keyphrase generation as an effort to facilitate future research.
comment: GEM^2 paper @ ACL 2025
☆ Code Execution as Grounded Supervision for LLM Reasoning
Training large language models (LLMs) with chain-of-thought (CoT) supervision
has proven effective for enhancing their reasoning abilities. However,
obtaining reliable and accurate reasoning supervision remains a significant
challenge. We propose a scalable method for generating a high-quality CoT
supervision dataset by leveraging the determinism of program execution. Unlike
existing reasoning dataset generation methods that rely on costly human
annotations or error-prone LLM-generated CoT, our approach extracts verifiable,
step-by-step reasoning traces from code execution and transforms them into a
natural language CoT reasoning. Experiments on reasoning benchmarks across
various domains show that our method effectively equips LLMs with transferable
reasoning abilities across diverse tasks. Furthermore, the ablation studies
validate that our method produces highly accurate reasoning data and reduces
overall token length during inference by reducing meaningless repetition and
overthinking.
☆ Provably Learning from Language Feedback
Interactively learning from observation and language feedback is an
increasingly studied area driven by the emergence of large language model (LLM)
agents. While impressive empirical demonstrations have been shown, so far a
principled framing of these decision problems remains lacking. In this paper,
we formalize the Learning from Language Feedback (LLF) problem, assert
sufficient assumptions to enable learning despite latent rewards, and introduce
$\textit{transfer eluder dimension}$ as a complexity measure to characterize
the hardness of LLF problems. We show that transfer eluder dimension captures
the intuition that information in the feedback changes the learning complexity
of the LLF problem. We demonstrate cases where learning from rich language
feedback can be exponentially faster than learning from reward. We develop a
no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems
through sequential interactions, with performance guarantees that scale with
the transfer eluder dimension of the problem. Across several empirical domains,
we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs
does not work reliably. Our contributions mark a first step towards designing
principled interactive learning algorithms from generic language feedback.
☆ Detecting Sockpuppetry on Wikipedia Using Meta-Learning ACL 2025
Malicious sockpuppet detection on Wikipedia is critical to preserving access
to reliable information on the internet and preventing the spread of
disinformation. Prior machine learning approaches rely on stylistic and
meta-data features, but do not prioritise adaptability to author-specific
behaviours. As a result, they struggle to effectively model the behaviour of
specific sockpuppet-groups, especially when text data is limited. To address
this, we propose the application of meta-learning, a machine learning technique
designed to improve performance in data-scarce settings by training models
across multiple tasks. Meta-learning optimises a model for rapid adaptation to
the writing style of a new sockpuppet-group. Our results show that
meta-learning significantly enhances the precision of predictions compared to
pre-trained models, marking an advancement in combating sockpuppetry on open
editing platforms. We release a new dataset of sockpuppet investigations to
foster future research in both sockpuppetry and meta-learning fields.
comment: Accepted to ACL 2025
☆ AC/DC: LLM-based Audio Comprehension via Dialogue Continuation
We propose an instruction-following audio comprehension model that leverages
the dialogue continuation ability of large language models (LLMs). Instead of
directly generating target captions in training data, the proposed method
trains a model to produce responses as if the input caption triggered a
dialogue. This dialogue continuation training mitigates the caption variation
problem. Learning to continue a dialogue effectively captures the caption's
meaning beyond its surface-level words. As a result, our model enables
zero-shot instruction-following capability without multitask instruction
tuning, even trained solely on audio captioning datasets. Experiments on
AudioCaps, WavCaps, and Clotho datasets with AudioBench audio-scene
question-answering tests demonstrate our model's ability to follow various
unseen instructions.
comment: Accepted to Interspeech 2025
☆ Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs
Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe
Speech-to-speech translation (S2ST) has been advanced with large language
models (LLMs), which are fine-tuned on discrete speech units. In such
approaches, modality adaptation from text to speech has been an issue. LLMs are
trained on text-only data, which presents challenges to adapt them to speech
modality with limited speech-to-speech data. To address the training
difficulty, we propose scheduled interleaved speech--text training in this
study. We use interleaved speech--text units instead of speech units during
training, where aligned text tokens are interleaved at the word level. We
gradually decrease the ratio of text as training progresses, to facilitate
progressive modality adaptation from text to speech. We conduct experimental
evaluations by fine-tuning LLaMA3.2-1B for S2ST on the CVSS dataset. We show
that the proposed method consistently improves the translation performances,
especially for languages with limited training data.
comment: Accepted to Interspeech2025
☆ "Check My Work?": Measuring Sycophancy in a Simulated Educational Context KDD
This study examines how user-provided suggestions affect Large Language
Models (LLMs) in a simulated educational context, where sycophancy poses
significant risks. Testing five different LLMs from the OpenAI GPT-4o and
GPT-4.1 model classes across five experimental conditions, we show that
response quality varies dramatically based on query framing. In cases where the
student mentions an incorrect answer, the LLM correctness can degrade by as
much as 15 percentage points, while mentioning the correct answer boosts
accuracy by the same margin. Our results also show that this bias is stronger
in smaller models, with an effect of up to 30% for the GPT-4.1-nano model,
versus 8% for the GPT-4o model. Our analysis of how often LLMs "flip" their
answer, and an investigation into token level probabilities, confirm that the
models are generally changing their answers to answer choices mentioned by
students in line with the sycophancy hypothesis. This sycophantic behavior has
important implications for educational equity, as LLMs may accelerate learning
for knowledgeable students while the same tools may reinforce misunderstanding
for less knowledgeable students. Our results highlight the need to better
understand the mechanism, and ways to mitigate, such bias in the educational
context.
comment: Presented at KDD Workshop on Ethical Artificial Intelligence: Methods
and Applications (EAI) 2025
☆ Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages
Training deep learning networks with minimal supervision has gained
significant research attention due to its potential to reduce reliance on
extensive labelled data. While self-training methods have proven effective in
semi-supervised learning, they remain vulnerable to errors from noisy pseudo
labels. Moreover, most recent approaches to the few-label classification
problem are either designed for resource-rich languages such as English or
involve complex cascading models that are prone to overfitting. To address the
persistent challenge of few-label text classification in truly low-resource
linguistic contexts, where existing methods often struggle with noisy
pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods
that rely on generic multi-cluster pseudo-labelling or complex cascading
architectures, Flick leverages the fundamental insight that distilling
high-confidence pseudo-labels from a broader set of initial clusters can
dramatically improve pseudo-label quality, particularly for linguistically
diverse, low-resource settings. Flick introduces a novel pseudo-label
refinement component, a departure from traditional pseudo-labelling strategies
by identifying and leveraging top-performing pseudo-label clusters. This
component specifically learns to distil highly reliable pseudo-labels from an
initial broad set by focusing on single-cluster cohesion and leveraging an
adaptive top-k selection mechanism. This targeted refinement process is crucial
for mitigating the propagation of errors inherent in low-resource data,
allowing for robust fine-tuning of pre-trained language models with only a
handful of true labels. We demonstrate Flick's efficacy across 14 diverse
datasets, encompassing challenging low-resource languages such as Arabic, Urdu,
and Setswana, alongside English, showcasing its superior performance and
adaptability.
☆ ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs
Gradient-based data influence approximation has been leveraged to select
useful data samples in the supervised fine-tuning of large language models.
However, the computation of gradients throughout the fine-tuning process
requires too many resources to be feasible in practice. In this paper, we
propose an efficient gradient-based data selection framework with clustering
and a modified Upper Confidence Bound (UCB) algorithm. Based on the intuition
that data samples with similar gradient features will have similar influences,
we first perform clustering on the training data pool. Then, we frame the
inter-cluster data selection as a constrained computing budget allocation
problem and consider it a multi-armed bandit problem. A modified UCB algorithm
is leveraged to solve this problem. Specifically, during the iterative sampling
process, historical data influence information is recorded to directly estimate
the distributions of each cluster, and a cold start is adopted to balance
exploration and exploitation. Experimental results on various benchmarks show
that our proposed framework, ClusterUCB, can achieve comparable results to the
original gradient-based data selection methods while greatly reducing computing
consumption.
☆ Discrete Audio Tokens: More Than a Survey!
Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli
Discrete audio tokens are compact representations that aim to preserve
perceptual quality, phonetic content, and speaker characteristics while
enabling efficient storage and inference, as well as competitive performance
across diverse downstream tasks.They provide a practical alternative to
continuous features, enabling the integration of speech and audio into modern
large language models (LLMs). As interest in token-based audio processing
grows, various tokenization methods have emerged, and several surveys have
reviewed the latest progress in the field. However, existing studies often
focus on specific domains or tasks and lack a unified comparison across various
benchmarks. This paper presents a systematic review and benchmark of discrete
audio tokenizers, covering three domains: speech, music, and general audio. We
propose a taxonomy of tokenization approaches based on encoder-decoder,
quantization techniques, training paradigm, streamability, and application
domains. We evaluate tokenizers on multiple benchmarks for reconstruction,
downstream performance, and acoustic language modeling, and analyze trade-offs
through controlled ablation studies. Our findings highlight key limitations,
practical considerations, and open challenges, providing insight and guidance
for future research in this rapidly evolving area. For more information,
including our main results and tokenizer database, please refer to our website:
https://poonehmousavi.github.io/dates-website/.
☆ Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models
Language models are essentially probability distributions over token
sequences. Auto-regressive models generate sentences by iteratively computing
and sampling from the distribution of the next token. This iterative sampling
introduces stochasticity, leading to the assumption that language models make
probabilistic decisions, similar to sampling from unknown distributions.
Building on this assumption, prior research has used simulated Gibbs sampling,
inspired by experiments designed to elicit human priors, to infer the priors of
language models. In this paper, we revisit a critical question: Do language
models possess Bayesian brains? Our findings show that under certain
conditions, language models can exhibit near-deterministic decision-making,
such as producing maximum likelihood estimations, even with a non-zero sampling
temperature. This challenges the sampling assumption and undermines previous
methods for eliciting human-like priors. Furthermore, we demonstrate that
without proper scrutiny, a system with deterministic behavior undergoing
simulated Gibbs sampling can converge to a "false prior." To address this, we
propose a straightforward approach to distinguish between stochastic and
deterministic decision patterns in Gibbs sampling, helping to prevent the
inference of misleading language model priors. We experiment on a variety of
large language models to identify their decision patterns under various
circumstances. Our results provide key insights in understanding decision
making of large language models.
♻ ☆ Visually Descriptive Language Model for Vector Graphics Reasoning
Despite significant advancements, large multimodal models (LMMs) still
struggle to bridge the gap between low-level visual perception -- focusing on
shapes, sizes, and layouts -- and high-level language reasoning, such as
semantics and logic. This limitation is evident in tasks that require precise
visual perception, like comparing geometric properties or solving visual
reasoning problems. To study this failure mode, we focus on vector graphics --
images composed of 2D objects and shapes, prevalent in LMM-based tasks in web,
design, and OS environments. We identify two key research questions: how can we
enable precise visual perception, and how can we facilitate high-level
reasoning based on such low-level perceptions? To capture fine visual details,
we use Scalable Vector Graphics (SVG) for accurate encoding of visual scenes.
However, SVGs are not readily interpretable by LMMs in a zero-shot manner. To
tackle this, we propose the Visually Descriptive Language Model (VDLM), which
introduces a Primal Visual Description (PVD) as an intermediate textual
representation. PVD translates SVGs into a text-based abstraction consisting of
primitive attributes (e.g., shape, position, measurement) and their
corresponding values. PVD can be learned using task-agnostic synthesized data
and represents visual primitives that are universal across vector graphics.
This abstraction is more structured, allowing for direct interpretation by
foundation models for zero-shot generalization. Without human-annotated data,
empirical results show that VDLM significantly improves state-of-the-art LMMs
like GPT-4o on various multimodal perception and reasoning tasks. Extensive
analyses of VDLM show improved interpretability due to its disentangled
perception and reasoning. We also demonstrate a positive correlation between
PVD quality and task performance. Project page:
https://mikewangwzhl.github.io/VDLM/
comment: Project page: https://mikewangwzhl.github.io/VDLM/
♻ ☆ Improving LLM Safety Alignment with Dual-Objective Optimization ICML 2025
Existing training-time safety alignment techniques for large language models
(LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization
(DPO), a widely deployed alignment method, exhibits limitations in both
experimental and theoretical contexts as its loss function proves suboptimal
for refusal learning. Through gradient-based analysis, we identify these
shortcomings and propose an improved safety alignment that disentangles DPO
objectives into two components: (1) robust refusal training, which encourages
refusal even when partial unsafe generations are produced, and (2) targeted
unlearning of harmful knowledge. This approach significantly increases LLM
robustness against a wide range of jailbreak attacks, including prefilling,
suffix, and multi-turn attacks across both in-distribution and
out-of-distribution scenarios. Furthermore, we introduce a method to emphasize
critical refusal tokens by incorporating a reward-based token-level weighting
mechanism for refusal learning, which further improves the robustness against
adversarial exploits. Our research also suggests that robustness to jailbreak
attacks is correlated with token distribution shifts in the training process
and internal representations of refusal and harmful tokens, offering valuable
directions for future research in LLM safety alignment. The code is available
at https://github.com/wicai24/DOOR-Alignment
comment: ICML 2025
♻ ☆ Weak-to-Strong Jailbreaking on Large Language Models ICML 2025
Large language models (LLMs) are vulnerable to jailbreak attacks - resulting
in harmful, unethical, or biased text generations. However, existing
jailbreaking methods are computationally costly. In this paper, we propose the
weak-to-strong jailbreaking attack, an efficient inference time attack for
aligned LLMs to produce harmful text. Our key intuition is based on the
observation that jailbroken and aligned models only differ in their initial
decoding distributions. The weak-to-strong attack's key technical insight is
using two smaller models (a safe and an unsafe one) to adversarially modify a
significantly larger safe model's decoding probabilities. We evaluate the
weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The
results show our method can increase the misalignment rate to over 99% on two
datasets with just one forward pass per example. Our study exposes an urgent
safety issue that needs to be addressed when aligning LLMs. As an initial
attempt, we propose a defense strategy to protect against such attacks, but
creating more advanced defenses remains challenging. The code for replicating
the method is available at https://github.com/XuandongZhao/weak-to-strong
comment: ICML 2025
♻ ☆ Efficiently Identifying Watermarked Segments in Mixed-Source Texts ACL 2025
Text watermarks in large language models (LLMs) are increasingly used to
detect synthetic text, mitigating misuse cases like fake news and academic
dishonesty. While existing watermarking detection techniques primarily focus on
classifying entire documents as watermarked or not, they often neglect the
common scenario of identifying individual watermark segments within longer,
mixed-source documents. Drawing inspiration from plagiarism detection systems,
we propose two novel methods for partial watermark detection. First, we develop
a geometry cover detection framework aimed at determining whether there is a
watermark segment in long text. Second, we introduce an adaptive online
learning algorithm to pinpoint the precise location of watermark segments
within the text. Evaluated on three popular watermarking techniques
(KGW-Watermark, Unigram-Watermark, and Gumbel-Watermark), our approach achieves
high accuracy, significantly outperforming baseline methods. Moreover, our
framework is adaptable to other watermarking techniques, offering new insights
for precise watermark detection. Our code is publicly available at
https://github.com/XuandongZhao/llm-watermark-location
comment: ACL 2025
♻ ☆ PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play ACL 2025
Large language models (LLMs) are increasingly integrated with specialized
external tools, yet many tasks demand zero-shot tool usage with minimal or
noisy documentation. Existing solutions rely on manual rewriting or labeled
data for validation, making them inapplicable in true zero-shot settings. To
address these challenges, we propose PLAY2PROMPT, an automated framework that
systematically "plays" with each tool to explore its input-output behaviors.
Through this iterative trial-and-error process, PLAY2PROMPT refines tool
documentation and generates usage examples without any labeled data. These
examples not only guide LLM inference but also serve as validation to further
enhance tool utilization. Extensive experiments on real-world tasks demonstrate
that PLAY2PROMPT significantly improves zero-shot tool performance across both
open and closed models, offering a scalable and effective solution for
domain-specific tool integration.
comment: ACL 2025 Long Paper (Findings)
♻ ☆ Large Language Models for Multilingual Previously Fact-Checked Claim Detection
In our era of widespread false information, human fact-checkers often face
the challenge of duplicating efforts when verifying claims that may have
already been addressed in other countries or languages. As false information
transcends linguistic boundaries, the ability to automatically detect
previously fact-checked claims across languages has become an increasingly
important task. This paper presents the first comprehensive evaluation of large
language models (LLMs) for multilingual previously fact-checked claim
detection. We assess seven LLMs across 20 languages in both monolingual and
cross-lingual settings. Our results show that while LLMs perform well for
high-resource languages, they struggle with low-resource languages. Moreover,
translating original texts into English proved to be beneficial for
low-resource languages. These findings highlight the potential of LLMs for
multilingual previously fact-checked claim detection and provide a foundation
for further research on this promising application of LLMs.
♻ ☆ Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards
Recent developments in Large Language Models (LLMs) have shifted from
pre-training scaling to post-training and test-time scaling. Across these
developments, a key unified paradigm has arisen: Learning from Rewards, where
reward signals act as the guiding stars to steer LLM behavior. It has
underpinned a wide range of prevalent techniques, such as reinforcement
learning (RLHF, RLAIF, DPO, and GRPO), reward-guided decoding, and post-hoc
correction. Crucially, this paradigm enables the transition from passive
learning from static data to active learning from dynamic feedback. This endows
LLMs with aligned preferences and deep reasoning capabilities for diverse
tasks. In this survey, we present a comprehensive overview of learning from
rewards, from the perspective of reward models and learning strategies across
training, inference, and post-inference stages. We further discuss the
benchmarks for reward models and the primary applications. Finally we highlight
the challenges and future directions. We maintain a paper collection at
https://github.com/bobxwu/learning-from-rewards-llm-papers.
comment: 36 Pages
♻ ☆ Multi-group Uncertainty Quantification for Long-form Text Generation UAI 2025
While past works have shown how uncertainty quantification can be applied to
large language model (LLM) outputs, the question of whether resulting
uncertainty guarantees still hold within sub-groupings of data remains open. In
our work, given some long-form text generated by an LLM, we study uncertainty
at both the level of individual claims contained within the output (via
calibration) and across the entire output itself (via conformal prediction).
Using biography generation as a testbed for this study, we derive a set of
(demographic) attributes (e.g., whether some text describes a man or woman) for
each generation to form such "subgroups" of data. We find that although
canonical methods for both types of uncertainty quantification perform well
when measuring across the entire dataset, such guarantees break down when
examining particular subgroups. Having established this issue, we invoke
group-conditional methods for uncertainty quantification -- multicalibration
and multivalid conformal prediction -- and find that across a variety of
approaches, additional subgroup information consistently improves calibration
and conformal prediction within subgroups (while crucially retaining guarantees
across the entire dataset). As the problems of calibration, conformal
prediction, and their multi-group counterparts have not been extensively
explored in the context of long-form text generation, we consider these results
to form a benchmark for this setting.
comment: Updated to UAI 2025 camera ready version
♻ ☆ Debiasing Watermarks for Large Language Models via Maximal Coupling
Watermarking language models is essential for distinguishing between human
and machine-generated text and thus maintaining the integrity and
trustworthiness of digital communication. We present a novel green/red list
watermarking approach that partitions the token set into ``green'' and ``red''
lists, subtly increasing the generation probability for green tokens. To
correct token distribution bias, our method employs maximal coupling, using a
uniform coin flip to decide whether to apply bias correction, with the result
embedded as a pseudorandom watermark signal. Theoretical analysis confirms this
approach's unbiased nature and robust detection capabilities. Experimental
results show that it outperforms prior techniques by preserving text quality
while maintaining high detectability, and it demonstrates resilience to
targeted modifications aimed at improving text quality. This research provides
a promising watermarking solution for language models, balancing effective
detection with minimal impact on text quality.
comment: To appear in Journal of the American Statistical Association (JASA)
♻ ☆ The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages
Jenalea Rajab, Anuoluwapo Aremu, Everlyn Asiko Chimoto, Dale Dunbar, Graham Morrissey, Fadel Thior, Luandrie Potgieter, Jessico Ojo, Atnafu Lambebo Tonja, Maushami Chetty, Wilhelmina NdapewaOnyothi Nekoto, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, Benjamin Rosman
This paper presents the Esethu Framework, a sustainable data curation
framework specifically designed to empower local communities and ensure
equitable benefit-sharing from their linguistic resource. This framework is
supported by the Esethu license, a novel community-centric data license. As a
proof of concept, we introduce the Vuk'uzenzele isiXhosa Speech Dataset
(ViXSD), an open-source corpus developed under the Esethu Framework and
License. The dataset, containing read speech from native isiXhosa speakers
enriched with demographic and linguistic metadata, demonstrates how
community-driven licensing and curation principles can bridge resource gaps in
automatic speech recognition (ASR) for African languages while safeguarding the
interests of data creators. We describe the framework guiding dataset
development, outline the Esethu license provisions, present the methodology for
ViXSD, and present ASR experiments validating ViXSD's usability in building and
refining voice-driven applications for isiXhosa.
♻ ☆ Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation
Large Language Models (LLMs) have achieved remarkable success in tasks
requiring complex reasoning, such as code generation, mathematical problem
solving, and algorithmic synthesis -- especially when aided by reasoning tokens
and Chain-of-Thought prompting. Yet, a core question remains: do these models
truly reason, or do they merely exploit shallow statistical patterns? In this
paper, we introduce Chain-of-Code Collapse, where we systematically investigate
the robustness of reasoning LLMs by introducing a suite of semantically
faithful yet adversarially structured prompt perturbations. Our evaluation --
spanning 700 perturbed code generations derived from LeetCode-style problems --
applies transformations such as storytelling reframing, irrelevant constraint
injection, example reordering, and numeric perturbation. We observe that while
certain modifications severely degrade performance (with accuracy drops up to
-42.1%), others surprisingly improve model accuracy by up to 35.3%, suggesting
sensitivity not only to semantics but also to surface-level prompt dynamics.
These findings expose the fragility and unpredictability of current reasoning
systems, underscoring the need for more principles approaches to reasoning
alignments and prompting robustness. We release our perturbation datasets and
evaluation framework to promote further research in trustworthy and resilient
LLM reasoning.
♻ ☆ Aspect-Based Opinion Summarization with Argumentation Schemes
Reviews are valuable resources for customers making purchase decisions in
online shopping. However, it is impractical for customers to go over the vast
number of reviews and manually conclude the prominent opinions, which prompts
the need for automated opinion summarization systems. Previous approaches,
either extractive or abstractive, face challenges in automatically producing
grounded aspect-centric summaries. In this paper, we propose a novel
summarization system that not only captures predominant opinions from an aspect
perspective with supporting evidence, but also adapts to varying domains
without relying on a pre-defined set of aspects. Our proposed framework,
ASESUM, summarizes viewpoints relevant to the critical aspects of a product by
extracting aspect-centric arguments and measuring their salience and validity.
We conduct experiments on a real-world dataset to demonstrate the superiority
of our approach in capturing diverse perspectives of the original reviews
compared to new and existing methods.
comment: Accepted by ArgMining 2025
♻ ☆ Great Models Think Alike and this Undermines AI Oversight
Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping
As Language Model (LM) capabilities advance, evaluating and supervising them
at scale is getting harder for humans. There is hope that other language models
can automate both these tasks, which we refer to as ''AI Oversight''. We study
how model similarity affects both aspects of AI oversight by proposing Chance
Adjusted Probabilistic Agreement (CAPA): a metric for LM similarity based on
overlap in model mistakes. Using CAPA, we first show that LLM-as-a-judge scores
favor models similar to the judge, generalizing recent self-preference results.
Then, we study training on LM annotations, and find complementary knowledge
between the weak supervisor and strong student model plays a crucial role in
gains from ''weak-to-strong generalization''. As model capabilities increase,
it becomes harder to find their mistakes, and we might defer more to AI
oversight. However, we observe a concerning trend -- model mistakes are
becoming more similar with increasing capabilities, pointing to risks from
correlated failures. Our work underscores the importance of reporting and
correcting for model similarity, especially in the emerging paradigm of AI
oversight.
comment: 60 pages, 20 figures
♻ ☆ Persistent Topological Features in Large Language Models ICML 2025
Yuri Gardinazzi, Karthik Viswanathan, Giada Panerai, Alessio Ansuini, Alberto Cazzaniga, Matteo Biagetti
Understanding the decision-making processes of large language models is
critical given their widespread applications. To achieve this, we aim to
connect a formal mathematical framework -- zigzag persistence from topological
data analysis -- with practical and easily applicable algorithms. Zigzag
persistence is particularly effective for characterizing data as it dynamically
transforms across model layers. Within this framework, we introduce topological
descriptors that measure how topological features, $p$-dimensional holes,
persist and evolve throughout the layers. Unlike methods that assess each layer
individually and then aggregate the results, our approach directly tracks the
full evolutionary path of these features. This offers a statistical perspective
on how prompts are rearranged and their relative positions changed in the
representation space, providing insights into the system's operation as an
integrated whole. To demonstrate the expressivity and applicability of our
framework, we highlight how sensitive these descriptors are to different models
and a variety of datasets. As a showcase application to a downstream task, we
use zigzag persistence to establish a criterion for layer pruning, achieving
results comparable to state-of-the-art methods while preserving the
system-level perspective.
comment: 10+6 pages, 7 figures, 1 table. Accepted as poster at ICML 2025
♻ ☆ Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models
Existing large language models (LLMs) face challenges of following complex
instructions, especially when multiple constraints are present and organized in
paralleling, chaining, and branching structures. One intuitive solution, namely
chain-of-thought (CoT), is expected to universally improve capabilities of
LLMs. However, we find that the vanilla CoT exerts a negative impact on
performance due to its superficial reasoning pattern of simply paraphrasing the
instructions. It fails to peel back the compositions of constraints for
identifying their relationship across hierarchies of types and dimensions. To
this end, we propose a systematic method to boost LLMs in dealing with complex
instructions via incentivizing reasoning for test-time compute scaling. First,
we stem from the decomposition of complex instructions under existing
taxonomies and propose a reproducible data acquisition method. Second, we
exploit reinforcement learning (RL) with verifiable rule-centric reward signals
to cultivate reasoning specifically for instruction following. We address the
shallow, non-essential nature of reasoning under complex instructions via
sample-wise contrast for superior CoT enforcement. We also exploit behavior
cloning of experts to facilitate steady distribution shift from fast-thinking
LLMs to skillful reasoners. Extensive evaluations on seven comprehensive
benchmarks confirm the validity of the proposed method, where a 1.5B LLM
achieves 11.74% gains with performance comparable to a 8B LLM. Codes and data
are available at https://github.com/yuleiqin/RAIF.
comment: 13 pages of main body, 3 tables, 5 figures, 45 pages of appendix
♻ ☆ PRSA: Prompt Stealing Attacks against Real-World Prompt Services USENIX Security 2025
Yong Yang, Changjiang Li, Qingming Li, Oubo Ma, Haoyu Wang, Zonghui Wang, Yandong Gao, Wenzhi Chen, Shouling Ji
Recently, large language models (LLMs) have garnered widespread attention for
their exceptional capabilities. Prompts are central to the functionality and
performance of LLMs, making them highly valuable assets. The increasing
reliance on high-quality prompts has driven significant growth in prompt
services. However, this growth also expands the potential for prompt leakage,
increasing the risk that attackers could replicate original functionalities,
create competing products, and severely infringe on developers' intellectual
property. Despite these risks, prompt leakage in real-world prompt services
remains underexplored.
In this paper, we present PRSA, a practical attack framework designed for
prompt stealing. PRSA infers the detailed intent of prompts through very
limited input-output analysis and can successfully generate stolen prompts that
replicate the original functionality. Extensive evaluations demonstrate PRSA's
effectiveness across two main types of real-world prompt services.
Specifically, compared to previous works, it improves the attack success rate
from 17.8% to 46.1% in prompt marketplaces and from 39% to 52% in LLM
application stores, respectively. Notably, in the attack on "Math", one of the
most popular educational applications in OpenAI's GPT Store with over 1 million
conversations, PRSA uncovered a hidden Easter egg that had not been revealed
previously. Besides, our analysis reveals that higher mutual information
between a prompt and its output correlates with an increased risk of leakage.
This insight guides the design and evaluation of two potential defenses against
the security threats posed by PRSA. We have reported these findings to the
prompt service vendors, including PromptBase and OpenAI, and actively
collaborate with them to implement defensive measures.
comment: This is the extended version of the paper accepted at the 34th USENIX
Security Symposium (USENIX Security 2025)
♻ ☆ FedRAG: A Framework for Fine-Tuning Retrieval-Augmented Generation Systems ICML 2025
Val Andrei Fajardo, David B. Emerson, Amandeep Singh, Veronica Chatrath, Marcelo Lotif, Ravi Theja, Alex Cheung, Izuki Matsuba
Retrieval-augmented generation (RAG) systems have been shown to be effective
in addressing many of the drawbacks of relying solely on the parametric memory
of large language models. Recent work has demonstrated that RAG systems can be
improved via fine-tuning of their retriever and generator models. In this work,
we introduce FedRAG, a framework for fine-tuning RAG systems across centralized
and federated architectures. FedRAG supports state-of-the-art fine-tuning
methods, offering a simple and intuitive interface and a seamless conversion
from centralized to federated training tasks. FedRAG is also deeply integrated
with the modern RAG ecosystem, filling a critical gap in available tools.
comment: 9 pages, 4 figures, 2 tables. Accepted for the CODEML Workshop at
ICML 2025. Framework code available at
https://github.com/VectorInstitute/fed-rag
♻ ☆ SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models
Large language models (LLMs) have been widely adopted due to their remarkable
performance across various applications, driving the accelerated development of
a large number of diverse models. However, these individual LLMs show
limitations in generalization and performance on complex tasks due to inherent
training biases, model size constraints, and the quality or diversity of
pre-training datasets. A promising direction is to efficiently harness the
diverse capabilities of LLMs to overcome these individual limitations. To
address these limitations, we introduce a novel LLM selection algorithm called
SelectLLM, which efficiently directs input queries to the most suitable subset
of LLMs from a large pool, ensuring that the selected models collectively
provide accurate responses. SelectLLM employs a multi-label classifier and
policy based on the classifier's predictions and confidence scores in selecting
an optimal, query-aware, and lightweight subset of LLMs. Our findings indicate
that the proposed model outperforms existing ensemble-based baselines and
achieves competitive performance with similarly sized top-performing LLMs while
maintaining efficiency. Specifically, it achieves a huge reduction in inference
latency on two challenging reasoning benchmarks: 13\% on GSM8K and 70\% on
MMLU, compared to the top-performing baseline. Also, we establish a theoretical
upper bound by an Oracle with LLMs and perform an in-depth linguistic analysis
to understand the performance gap between the Oracle and SelectLLM.
comment: 9 pages
♻ ☆ Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models
Large Language Models (LLM) have demonstrated the capability of generating
free text self Natural Language Explanation (self-NLE) to justify their
answers. Despite their logical appearance, self-NLE do not necessarily reflect
the LLM actual decision-making process, making such explanations unfaithful.
While existing methods for measuring self-NLE faithfulness mostly rely on
behavioral tests or computational block identification, none of them examines
the neural activity underlying the model's reasoning. This work introduces a
novel flexible framework for quantitatively measuring the faithfulness of
LLM-generated self-NLE by directly comparing the latter with interpretations of
the model's internal hidden states. The proposed framework is versatile and
provides deep insights into self-NLE faithfulness by establishing a direct
connection between self-NLE and model reasoning. This approach advances the
understanding of self-NLE faithfulness and provides building blocks for
generating more faithful self-NLE.
♻ ☆ CoRT: Code-integrated Reasoning within Thinking
Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu
Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable
progress in natural language reasoning with long chain-of-thought (CoT), yet
they remain inefficient or inaccurate when handling complex mathematical
operations. Addressing these limitations through computational tools (e.g.,
computation libraries and symbolic solvers) is promising, but it introduces a
technical challenge: Code Interpreter (CI) brings external knowledge beyond the
model's internal text representations, thus the direct combination is not
efficient. This paper introduces CoRT, a post-training framework for teaching
LRMs to leverage CI effectively and efficiently. As a first step, we address
the data scarcity issue by synthesizing code-integrated reasoning data through
Hint-Engineering, which strategically inserts different hints at appropriate
positions to optimize LRM-CI interaction. We manually create 30 high-quality
samples, upon which we post-train models ranging from 1.5B to 32B parameters,
with supervised fine-tuning, rejection fine-tuning and reinforcement learning.
Our experimental results demonstrate that Hint-Engineering models achieve 4\%
and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and
DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging
mathematical reasoning datasets. Furthermore, Hint-Engineering models use about
30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model
compared with the natural language models. The models and code are available at
https://github.com/ChengpengLi1003/CoRT.
comment: work in progress
♻ ☆ Identifying Reliable Evaluation Metrics for Scientific Text Revision ACL 2025
Evaluating text revision in scientific writing remains a challenge, as
traditional metrics such as ROUGE and BERTScore primarily focus on similarity
rather than capturing meaningful improvements. In this work, we analyse and
identify the limitations of these metrics and explore alternative evaluation
methods that better align with human judgments. We first conduct a manual
annotation study to assess the quality of different revisions. Then, we
investigate reference-free evaluation metrics from related NLP domains.
Additionally, we examine LLM-as-a-judge approaches, analysing their ability to
assess revisions with and without a gold reference. Our results show that LLMs
effectively assess instruction-following but struggle with correctness, while
domain-specific metrics provide complementary insights. We find that a hybrid
approach combining LLM-as-a-judge evaluation and task-specific metrics offers
the most reliable assessment of revision quality.
comment: V3 contains only the English version, accepted to ACL 2025 main (26
pages). V2 contains both English (ACL 2025) and French (TALN 2025) versions
(58 pages)
♻ ☆ ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization ICML 2025
We introduce ConfPO, a method for preference learning in Large Language
Models (LLMs) that identifies and optimizes preference-critical tokens based
solely on the training policy's confidence, without requiring any auxiliary
models or compute. Unlike prior Direct Alignment Algorithms (DAAs) such as
Direct Preference Optimization (DPO), which uniformly adjust all token
probabilities regardless of their relevance to preference, ConfPO focuses
optimization on the most impactful tokens. This targeted approach improves
alignment quality while mitigating overoptimization (i.e., reward hacking) by
using the KL divergence budget more efficiently. In contrast to recent
token-level methods that rely on credit-assignment models or AI annotators,
raising concerns about scalability and reliability, ConfPO is simple,
lightweight, and model-free. Experimental results on challenging alignment
benchmarks, including AlpacaEval 2 and Arena-Hard, demonstrate that ConfPO
consistently outperforms uniform DAAs across various LLMs, delivering better
alignment with zero additional computational overhead.
comment: ICML 2025
♻ ☆ IPA-CHILDES & G2P+: Feature-Rich Resources for Cross-Lingual Phonology and Phonemic Language Modeling CoNLL 2025
In this paper, we introduce two resources: (i) G2P+, a tool for converting
orthographic datasets to a consistent phonemic representation; and (ii) IPA
CHILDES, a phonemic dataset of child-centered speech across 31 languages. Prior
tools for grapheme-to-phoneme conversion result in phonemic vocabularies that
are inconsistent with established phonemic inventories, an issue which G2P+
addresses by leveraging the inventories in the Phoible database. Using this
tool, we augment CHILDES with phonemic transcriptions to produce IPA CHILDES.
This new resource fills several gaps in existing phonemic datasets, which often
lack multilingual coverage, spontaneous speech, and a focus on child-directed
language. We demonstrate the utility of this dataset for phonological research
by training phoneme language models on 11 languages and probing them for
distinctive features, finding that the distributional properties of phonemes
are sufficient to learn major class and place features cross-lingually.
comment: Accepted to CoNLL 2025
♻ ☆ Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges ACL 2025
Bolei Ma, Yuting Li, Wei Zhou, Ziwei Gong, Yang Janet Liu, Katja Jasinskaja, Annemarie Friedrich, Julia Hirschberg, Frauke Kreuter, Barbara Plank
Understanding pragmatics-the use of language in context-is crucial for
developing NLP systems capable of interpreting nuanced language use. Despite
recent advances in language technologies, including large language models,
evaluating their ability to handle pragmatic phenomena such as implicatures and
references remains challenging. To advance pragmatic abilities in models, it is
essential to understand current evaluation trends and identify existing
limitations. In this survey, we provide a comprehensive review of resources
designed for evaluating pragmatic capabilities in NLP, categorizing datasets by
the pragmatic phenomena they address. We analyze task designs, data collection
methods, evaluation approaches, and their relevance to real-world applications.
By examining these resources in the context of modern language models, we
highlight emerging trends, challenges, and gaps in existing benchmarks. Our
survey aims to clarify the landscape of pragmatic evaluation and guide the
development of more comprehensive and targeted benchmarks, ultimately
contributing to more nuanced and context-aware NLP models.
comment: ACL 2025
♻ ☆ BabyLM's First Words: Word Segmentation as a Phonological Probing Task CoNLL 2025
Language models provide a key framework for studying linguistic theories
based on prediction, but phonological analysis using large language models
(LLMs) is difficult; there are few phonological benchmarks beyond English and
the standard input representation used in LLMs (subwords of graphemes) is not
suitable for analyzing the representation of phonemes. In this work, we
demonstrate how word segmentation can be used as a phonological probing task,
allowing us to study the representations learned by phoneme-based language
models trained on child-directed speech across 31 languages. Following
computational models of word segmentation, we present unsupervised methods for
extracting word boundaries from a trained model using the observation that
prediction-error peaks at the start of words. We also use linear probes to
identify that these models implicitly track word boundaries, even when they do
not appear in training. This cross-lingual work corroborates statistical
learning theories of acquisition and empirically motivates new methods for
training subword tokenizers.
comment: Accepted to CoNLL 2025
♻ ☆ Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets AAAI
The rise of online platforms exacerbated the spread of hate speech, demanding
scalable and effective detection. However, the accuracy of hate speech
detection systems heavily relies on human-labeled data, which is inherently
susceptible to biases. While previous work has examined the issue, the
interplay between the characteristics of the annotator and those of the target
of the hate are still unexplored. We fill this gap by leveraging an extensive
dataset with rich socio-demographic information of both annotators and targets,
uncovering how human biases manifest in relation to the target's attributes.
Our analysis surfaces the presence of widespread biases, which we
quantitatively describe and characterize based on their intensity and
prevalence, revealing marked differences. Furthermore, we compare human biases
with those exhibited by persona-based LLMs. Our findings indicate that while
persona-based LLMs do exhibit biases, these differ significantly from those of
human annotators. Overall, our work offers new and nuanced results on human
biases in hate speech annotations, as well as fresh insights into the design of
AI-driven hate speech detection systems.
comment: Article published in ICWSM'25 - 19th AAAI Conference on Web and
Social Media. Please, cite the published version
♻ ☆ Reinforcing Multimodal Understanding and Generation with Dual Self-rewards
Building upon large language models (LLMs), recent large multimodal models
(LMMs) unify cross-model understanding and generation into a single framework.
However, LMMs still struggle to achieve accurate image-text alignment, prone to
generating text responses contradicting the visual input or failing to follow
the text-to-image prompts. Current solutions require external supervision
(e.g., human feedback or reward models) and only address unidirectional
tasks-either understanding or generation. In this work, based on the
observation that understanding and generation are inverse dual tasks, we
introduce a self-supervised dual reward mechanism to reinforce the
understanding and generation capabilities of LMMs. Specifically, we sample
multiple outputs for a given input in one task domain, then reverse the
input-output pairs to compute the dual likelihood of the model as self-rewards
for optimization. Extensive experimental results on visual understanding and
generation benchmarks demonstrate that our method can effectively enhance the
performance of the model without any external supervision, especially achieving
remarkable improvements in text-to-image tasks.
♻ ☆ Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models
Recent copyright agreements between AI companies and content creators
underscore the need for fine-grained control over language models' ability to
reproduce copyrighted text. Existing defenses-ranging from aggressive
unlearning to simplistic output filters-either sacrifice model utility or
inadequately address verbatim leakage. We introduce Obliviate, a lightweight
post-training method that surgically suppresses exact reproduction of specified
sequences while preserving semantic understanding. Obliviate first identifies
memorized passages and then, for each target token, minimally adjusts the
model's output distribution via a Kullback-Leibler divergence penalty to drive
down the probability of exact reproduction. Simultaneously, we enforce a
consistency loss on non-target tokens to retain the model's fluency and task
performance. We evaluate Obliviate on four popular 6-8B-parameter models
(LLaMA-3.1, LLaMA-3.1-Instruct, Qwen-2.5, and Yi-1.5) using synthetic
memorization benchmarks and organic copyrighted excerpts (e.g., Moby Dick,
Frankenstein, Alice in Wonderland and Les Miserables). Across all settings,
Obliviate reduces verbatim recall by two orders of magnitude (e.g., from
hundreds of words to fewer than 12) while degrading downstream accuracy by at
most 1% on HellaSwag, MMLU, TruthfulQA, and Winogrande. Furthermore, we
benchmark Obliviate aganist different unlearning and copyright techniques using
the MUSE and CoTaEval benchmarks. These results position Obliviate as a
practical, high-fidelity solution for copyright compliance in deployed LLMs.
♻ ☆ Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps
When prompted to think step-by-step, language models (LMs) produce a chain of
thought (CoT), a sequence of reasoning steps that the model supposedly used to
produce its prediction. Despite much work on CoT prompting, it is unclear if
reasoning verbalized in a CoT is faithful to the models' parametric beliefs. We
introduce a framework for measuring parametric faithfulness of generated
reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an
instance of this framework. FUR erases information contained in reasoning steps
from model parameters, and measures faithfulness as the resulting effect on the
model's prediction. Our experiments with four LMs and five multi-hop
multi-choice question answering (MCQA) datasets show that FUR is frequently
able to precisely change the underlying models' prediction for a given instance
by unlearning key steps, indicating when a CoT is parametrically faithful.
Further analysis shows that CoTs generated by models post-unlearning support
different answers, hinting at a deeper effect of unlearning.
♻ ☆ Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics
Large language models (LLMs) make it easy to rewrite a text in any style --
e.g. to make it more polite, persuasive, or more positive -- but evaluation
thereof is not straightforward. A challenge lies in measuring content
preservation: that content not attributable to style change is retained. This
paper presents a large meta-evaluation of metrics for evaluating style and
attribute transfer, focusing on content preservation. We find that
meta-evaluation studies on existing datasets lead to misleading conclusions
about the suitability of metrics for content preservation. Widely used metrics
show a high correlation with human judgments despite being deemed unsuitable
for the task -- because they do not abstract from style changes when evaluating
content preservation. We show that the overly high correlations with human
judgment stem from the nature of the test data. To address this issue, we
introduce a new, challenging test set specifically designed for evaluating
content preservation metrics for style transfer. Using this dataset, we
demonstrate that suitable metrics for content preservation for style transfer
indeed are style-aware. To support efficient evaluation, we propose a new
style-aware method that utilises small language models, obtaining a higher
alignment with human judgements than prompting a model of a similar size as an
autorater.
♻ ☆ TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding
Transformers exhibit proficiency in capturing long-range dependencies,
whereas State Space Models (SSMs) facilitate linear-time sequence modeling.
Notwithstanding their synergistic potential, the integration of these
architectures presents a significant challenge, primarily attributable to a
fundamental incongruity in their respective positional encoding mechanisms:
Transformers rely on explicit Rotary Position Embeddings (RoPE), while SSMs
leverage implicit positional representations via convolutions. This divergence
often precipitates discontinuities and suboptimal performance. To address this
impediment, we propose a unified rotary position embedding (Unified RoPE)
methodology, thereby establishing a consistent positional encoding framework
for both self-attention and state-space components. Using this Unified RoPE, we
introduce TransXSSM, a hybrid architecture that coherently integrates the
Transformer and SSM layers under this unified positional encoding scheme. At a
4K sequence length, TransXSSM exhibits training and inference speeds that are
42.3\% and 29.5\% faster, respectively, relative to standard Transformer
models. It also delivers higher accuracy: under comparable settings, it
surpasses a Transformer baseline by over 4\% on language modeling
benchmarks.TransXSSM furthermore scales more effectively: TransXSSM-1.3B gains
7.22\% in average accuracy over its 320M version (versus about 6\% gains for
equivalent Transformers or SSMs). Our results show that unified positional
encoding resolves positional incompatibility in hybrid models, enabling
efficient, high-performance long-context modeling.
♻ ☆ Towards Large Language Models with Self-Consistent Natural Language Explanations
Large language models (LLMs) seem to offer an easy path to interpretability:
just ask them to explain their decisions. Yet, studies show that these post-hoc
explanations often misrepresent the true decision process, as revealed by
mismatches in feature importance. Despite growing evidence of this
inconsistency, no systematic solutions have emerged, partly due to the high
cost of estimating feature importance, which limits evaluations to small
datasets. To address this, we introduce the Post-hoc Self-Consistency Bank
(PSCB) - a large-scale benchmark of decisions spanning diverse tasks and
models, each paired with LLM-generated explanations and corresponding feature
importance scores. Analysis of PSCB reveals that self-consistency scores barely
differ between correct and incorrect predictions. We also show that the
standard metric fails to meaningfully distinguish between explanations. To
overcome this limitation, we propose an alternative metric that more
effectively captures variation in explanation quality. We use it to fine-tune
LLMs via Direct Preference Optimization (DPO), leading to significantly better
alignment between explanations and decision-relevant features, even under
domain shift. Our findings point to a scalable path toward more trustworthy,
self-consistent LLMs.
♻ ☆ IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language
Lucky Susanto, Musa Izzanardi Wijanarko, Prasetia Anugrah Pratama, Traci Hong, Ika Idris, Alham Fikri Aji, Derry Wijaya
Hate speech poses a significant threat to social harmony. Over the past two
years, Indonesia has seen a ten-fold increase in the online hate speech ratio,
underscoring the urgent need for effective detection mechanisms. However,
progress is hindered by the limited availability of labeled data for Indonesian
texts. The condition is even worse for marginalized minorities, such as Shia,
LGBTQ, and other ethnic minorities because hate speech is underreported and
less understood by detection tools. Furthermore, the lack of accommodation for
subjectivity in current datasets compounds this issue. To address this, we
introduce IndoToxic2024, a comprehensive Indonesian hate speech and toxicity
classification dataset. Comprising 43,692 entries annotated by 19 diverse
individuals, the dataset focuses on texts targeting vulnerable groups in
Indonesia, specifically during the hottest political event in the country: the
presidential election. We establish baselines for seven binary classification
tasks, achieving a macro-F1 score of 0.78 with a BERT model (IndoBERTweet)
fine-tuned for hate speech classification. Furthermore, we demonstrate how
incorporating demographic information can enhance the zero-shot performance of
the large language model, gpt-3.5-turbo. However, we also caution that an
overemphasis on demographic information can negatively impact the fine-tuned
model performance due to data fragmentation.
comment: This work has been substantially expanded and finalized as
IndoDiscourse (see [https://huggingface.co/datasets/Exqrch/IndoDiscourse]).
IndoToxic should be considered a draft/precursor version and is no longer
maintained
♻ ☆ VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu
Recent Large Vision-Language Models (LVLMs) have advanced multi-modal
understanding by incorporating finer-grained visual perception and encoding.
However, such methods incur significant computational costs due to longer
visual token sequences, posing challenges for real-time deployment. To mitigate
this, prior studies have explored pruning unimportant visual tokens either at
the output layer of the visual encoder or at the early layers of the language
model. In this work, we revisit these design choices and reassess their
effectiveness through comprehensive empirical studies of how visual tokens are
processed throughout the visual encoding and language decoding stages. Guided
by these insights, we propose VScan, a two-stage visual token reduction
framework that addresses token redundancy by: (1) integrating complementary
global and local scans with token merging during visual encoding, and (2)
introducing pruning at intermediate layers of the language model. Extensive
experimental results across four LVLMs validate the effectiveness of VScan in
accelerating inference and demonstrate its superior performance over current
state-of-the-arts on sixteen benchmarks. Notably, when applied to
LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a
10$\times$ reduction in FLOPs, while retaining 95.4\% of the original
performance. Code is available at
https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan.
comment: Changes from v1: Uploaded code link and fixed minor typos
♻ ☆ Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations ACL
Measuring social bias in large language models (LLMs) is crucial, but
existing bias evaluation methods struggle to assess bias in long-form
generation. We propose a Bias Benchmark for Generation (BBG), an adaptation of
the Bias Benchmark for QA (BBQ), designed to evaluate social bias in long-form
generation by having LLMs generate continuations of story prompts. Building our
benchmark in English and Korean, we measure the probability of neutral and
biased generations across ten LLMs. We also compare our long-form story
generation evaluation results with multiple-choice BBQ evaluation, showing that
the two approaches produce inconsistent results.
comment: ACL-Findings 2025
♻ ☆ CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning
Mengsong Wu, YaFei Wang, Yidong Ming, Yuqi An, Yuwei Wan, Wenliang Chen, Binbin Lin, Yuqiang Li, Tong Xie, Dongzhan Zhou
Large language models (LLMs) have recently demonstrated promising
capabilities in chemistry tasks while still facing challenges due to outdated
pretraining knowledge and the difficulty of incorporating specialized chemical
expertise. To address these issues, we propose an LLM-based agent that
synergistically integrates 137 external chemical tools created ranging from
basic information retrieval to complex reaction predictions, and a dataset
curation pipeline to generate the dataset ChemToolBench that facilitates both
effective tool selection and precise parameter filling during fine-tuning and
evaluation. We introduce a Hierarchical Evolutionary Monte Carlo Tree Search
(HE-MCTS) framework, enabling independent optimization of tool planning and
execution. By leveraging self-generated data, our approach supports step-level
fine-tuning (FT) of the policy model and training task-adaptive PRM and ORM
that surpass GPT-4o. Experimental evaluations demonstrate that our approach
significantly improves performance in Chemistry QA and discovery tasks,
offering a robust solution to integrate specialized tools with LLMs for
advanced chemical applications. All datasets and code are available at
https://github.com/AI4Chem/ChemistryAgent .
comment: 15 pages, 6 figures
♻ ☆ ConvD: Attention Enhanced Dynamic Convolutional Embeddings for Knowledge Graph Completion
Knowledge graphs often suffer from incompleteness issues, which can be
alleviated through information completion. However, current state-of-the-art
deep knowledge convolutional embedding models rely on external convolution
kernels and conventional convolution processes, which limits the feature
interaction capability of the model. This paper introduces a novel dynamic
convolutional embedding model, ConvD, which directly reshapes relation
embeddings into multiple internal convolution kernels. This approach
effectively enhances the feature interactions between relation embeddings and
entity embeddings. Simultaneously, we incorporate a priori knowledge-optimized
attention mechanism that assigns different contribution weight coefficients to
the multiple relation convolution kernels in dynamic convolution, further
boosting the expressive power of the model. Extensive experiments on various
datasets show that our proposed model consistently outperforms the
state-of-the-art baseline methods, with average improvements ranging from 3.28%
to 14.69% across all model evaluation metrics, while the number of parameters
is reduced by 50.66% to 85.40% compared to other state-of-the-art models.
♻ ☆ iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering ACL 2025
While Large Language Models (LLMs) excel at many natural language processing
tasks, they often suffer from factual inaccuracies in knowledge-intensive
scenarios. Integrating external knowledge resources, particularly knowledge
graphs (KGs), provides a transparent and updatable foundation for more reliable
reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons
over KGs, is central to this effort, especially for complex, multi-hop queries.
However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent
reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop
connections. To address these issues, we introduce iQUEST, a question-guided
KBQA framework that iteratively decomposes complex queries into simpler
sub-questions, ensuring a structured and focused reasoning trajectory.
Additionally, we integrate a Graph Neural Network (GNN) to look ahead and
incorporate 2-hop neighbor information at each reasoning step. This dual
approach strengthens the reasoning process, enabling the model to explore
viable paths more effectively. Detailed experiments demonstrate the consistent
improvement delivered by iQUEST across four benchmark datasets and four LLMs.
comment: Accepted to the 63rd Annual Meeting of the Association for
Computational Linguistics (ACL 2025), Main Track
♻ ☆ AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving
Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye, Lijun Sun, Diange Yang
Vision-Language Models (VLMs) show promise for autonomous driving, yet their
struggle with hallucinations, inefficient reasoning, and limited real-world
validation hinders accurate perception and robust step-by-step reasoning. To
overcome this, we introduce AgentThink, a pioneering unified framework that,
for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic,
agent-style tool invocation for autonomous driving tasks. AgentThink's core
innovations include: (i) Structured Data Generation, by establishing an
autonomous driving tool library to automatically construct structured,
self-verified reasoning data explicitly incorporating tool usage for diverse
driving scenarios; (ii) A Two-stage Training Pipeline, employing Supervised
Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs
with the capability for autonomous tool invocation; and (iii) Agent-style
Tool-Usage Evaluation, introducing a novel multi-tool assessment protocol to
rigorously evaluate the model's tool invocation and utilization. Experiments on
the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall
reasoning scores by 53.91% and enhances answer accuracy by 33.54%, while
markedly improving reasoning quality and consistency. Furthermore, ablation
studies and robust zero-shot/few-shot generalization experiments across various
benchmarks underscore its powerful capabilities. These findings highlight a
promising trajectory for developing trustworthy and tool-aware autonomous
driving models.
comment: 18 pages, 8 figures
♻ ☆ On Many-Shot In-Context Learning for Long-Context Evaluation ACL 2025
Many-shot in-context learning (ICL) has emerged as a unique setup to both
utilize and test the ability of large language models to handle long context.
This paper delves into long-context language model (LCLM) evaluation through
many-shot ICL. We first ask: what types of ICL tasks benefit from additional
demonstrations, and how effective are they in evaluating LCLMs? We find that
classification and summarization tasks show performance improvements with
additional demonstrations, while translation and reasoning tasks do not exhibit
clear trends. Next, we investigate the extent to which different tasks
necessitate retrieval versus global context understanding. We develop metrics
to categorize ICL tasks into two groups: (i) similar-sample learning (SSL):
tasks where retrieval of the most similar examples is sufficient for good
performance, and (ii) all-sample learning (ASL): tasks that necessitate a
deeper comprehension of all examples in the prompt. Lastly, we introduce a new
many-shot ICL benchmark, MANYICLBENCH, to characterize model's ability on both
fronts and benchmark 12 LCLMs using MANYICLBENCH. We find that while
state-of-the-art models demonstrate good performance up to 64k tokens in SSL
tasks, many models experience significant performance drops at only 16k tokens
in ASL tasks.
comment: ACL 2025 Main Conference
♻ ☆ A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, Hanze Dong
Reinforcement learning (RL) has become a prevailing approach for fine-tuning
large language models (LLMs) on complex reasoning tasks. Among recent methods,
GRPO stands out for its empirical success in training models such as
DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In
this work, we revisit GRPO from a reinforce-like algorithm perspective and
analyze its core components. Surprisingly, we find that a simple rejection
sampling baseline, RAFT, which trains only on positively rewarded samples,
yields competitive performance than GRPO and PPO. Our ablation studies reveal
that GRPO's main advantage arises from discarding prompts with entirely
incorrect responses, rather than from its reward normalization. Motivated by
this insight, we propose Reinforce-Rej, a minimal extension of policy gradient
that filters both entirely incorrect and entirely correct samples.
Reinforce-Rej improves KL efficiency and stability, serving as a lightweight
yet effective alternative to more complex RL algorithms. We advocate RAFT as a
robust and interpretable baseline, and suggest that future advances should
focus on more principled designs for incorporating negative samples, rather
than relying on them indiscriminately. Our findings provide guidance for future
work in reward-based LLM post-training.
♻ ☆ CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models
Large language model (LLM) have become mainstream methods in the field of
sarcasm detection. However, existing LLM methods face challenges in irony
detection, including: 1. single-perspective limitations, 2. insufficient
comprehensive understanding, and 3. lack of interpretability. This paper
introduces the Collaborative Agent Framework for Irony (CAF-I), an LLM-driven
multi-agent system designed to overcome these issues. CAF-I employs specialized
agents for Context, Semantics, and Rhetoric, which perform multidimensional
analysis and engage in interactive collaborative optimization. A Decision Agent
then consolidates these perspectives, with a Refinement Evaluator Agent
providing conditional feedback for optimization. Experiments on benchmark
datasets establish CAF-I's state-of-the-art zero-shot performance. Achieving
SOTA on the vast majority of metrics, CAF-I reaches an average Macro-F1 of
76.31, a 4.98 absolute improvement over the strongest prior baseline. This
success is attained by its effective simulation of human-like multi-perspective
analysis, enhancing detection accuracy and interpretability.
♻ ☆ Improving Fairness of Large Language Models in Multi-document Summarization ACL 2025
Fairness in multi-document summarization (MDS) is crucial for providing
comprehensive views across documents with diverse social attribute values,
which can significantly impact decision-making. For example, a summarization
system that tends to overrepresent negative reviews of products can mislead
customers into disregarding good products. Previous works measure fairness in
MDS at two levels: summary-level and corpus-level. While summary-level fairness
focuses on individual summaries, corpus-level fairness focuses on a corpus of
summaries. Recent methods primarily focus on summary-level fairness. We propose
FairPO, a preference tuning method that focuses on both summary-level and
corpus-level fairness in MDS. To improve summary-level fairness, we propose to
generate preference pairs by perturbing document sets. To improve corpus-level
fairness, we propose fairness-aware preference tuning by dynamically adjusting
the weights of preference pairs. Our experiments show that FairPO outperforms
strong baselines while maintaining the critical qualities of summaries. The
code is available at https://github.com/leehaoyuan/coverage_fairnes.
comment: Accepted to ACL 2025 main
♻ ☆ SCORE: Story Coherence and Retrieval Enhancement for AI Narratives
Qiang Yi, Yangfan He, Jianhui Wang, Xinyuan Song, Shiyao Qian, Xinhang Yuan, Li Sun, Yi Xin, Jingqun Tang, Keqin Li, Kuan Lu, Menghao Huo, Jiaqi Chen, Tianyu Shi
Large Language Models (LLMs) can generate creative and engaging narratives
from user-specified input, but maintaining coherence and emotional depth
throughout these AI-generated stories remains a challenge. In this work, we
propose SCORE, a framework for Story Coherence and Retrieval Enhancement,
designed to detect and resolve narrative inconsistencies. By tracking key item
statuses and generating episode summaries, SCORE uses a Retrieval-Augmented
Generation (RAG) approach, incorporating TF-IDF and cosine similarity to
identify related episodes and enhance the overall story structure. Results from
testing multiple LLM-generated stories demonstrate that SCORE significantly
improves the consistency and stability of narrative coherence compared to
baseline GPT models, providing a more robust method for evaluating and refining
AI-generated narratives.
♻ ☆ The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets
AI agents are increasingly used in consumer-facing applications to assist
with tasks such as product search, negotiation, and transaction execution. In
this paper, we explore a future scenario where both consumers and merchants
authorize AI agents to fully automate negotiations and transactions. We aim to
answer two key questions: (1) Do different LLM agents vary in their ability to
secure favorable deals for users? (2) What risks arise from fully automating
deal-making with AI agents in consumer markets? To address these questions, we
develop an experimental framework that evaluates the performance of various LLM
agents in real-world negotiation and transaction settings. Our findings reveal
that AI-mediated deal-making is an inherently imbalanced game -- different
agents achieve significantly different outcomes for their users. Moreover,
behavioral anomalies in LLMs can result in financial losses for both consumers
and merchants, such as overspending or accepting unreasonable deals. These
results underscore that while automation can improve efficiency, it also
introduces substantial risks. Users should exercise caution when delegating
business decisions to AI agents.
♻ ☆ Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs ICLR 2025
Zhaowei Zhang, Fengshuo Bai, Qizhi Chen, Chengdong Ma, Mingzhi Wang, Haoran Sun, Zilong Zheng, Yaodong Yang
How to align large language models (LLMs) with user preferences from a static
general dataset has been frequently studied. However, user preferences are
usually personalized, changing, and diverse regarding culture, values, or time.
This leads to the problem that the actual user preferences often do not
coincide with those trained by the model developers in the practical use of
LLMs. Since we cannot collect enough data and retrain for every demand,
researching efficient real-time preference adaptation methods based on the
backbone LLMs during test time is important. To this end, we introduce Amulet,
a novel, training-free framework that formulates the decoding process of every
token as a separate online learning problem with the guidance of simple
user-provided prompts, thus enabling real-time optimization to satisfy users'
personalized preferences. To reduce the computational cost brought by this
optimization process for each token, we additionally provide a closed-form
solution for each iteration step of the optimization process, thereby reducing
the computational time cost to a negligible level. The detailed experimental
results demonstrate that Amulet can achieve significant performance
improvements in rich settings with combinations of different LLMs, datasets,
and user preferences, while maintaining acceptable computational efficiency.
comment: Accepted by ICLR 2025, Project page:
https://zowiezhang.github.io/projects/Amulet
♻ ☆ Benchmarking LLMs for Environmental Review and Permitting
Rounak Meyur, Hung Phan, Koby Hayashi, Ian Stewart, Shivam Sharma, Sarthak Chaturvedi, Mike Parker, Dan Nally, Sadie Montgomery, Karl Pazdernik, Ali Jannesari, Mahantesh Halappanavar, Sai Munikoti, Sameera Horawalavithana, Anurag Acharya
The National Environment Policy Act (NEPA) stands as a foundational piece of
environmental legislation in the United States, requiring federal agencies to
consider the environmental impacts of their proposed actions. The primary
mechanism for achieving this is through the preparation of Environmental
Assessments (EAs) and, for significant impacts, comprehensive Environmental
Impact Statements (EIS). Large Language Model (LLM)s' effectiveness in
specialized domains like NEPA remains untested for adoption in federal
decision-making processes. To address this gap, we present NEPA Question and
Answering Dataset (NEPAQuAD), the first comprehensive benchmark derived from
EIS documents, along with a modular and transparent evaluation pipeline, MAPLE,
to assess LLM performance on NEPA-focused regulatory reasoning tasks. Our
benchmark leverages actual EIS documents to create diverse question types,
ranging from factual to complex problem-solving ones. We built a modular and
transparent evaluation pipeline to test both closed- and open-source models in
zero-shot or context-driven QA benchmarks. We evaluate five state-of-the-art
LLMs using our framework to assess both their prior knowledge and their ability
to process NEPA-specific information. The experimental results reveal that all
the models consistently achieve their highest performance when provided with
the gold passage as context. While comparing the other context-driven
approaches for each model, Retrieval Augmented Generation (RAG)-based
approaches substantially outperform PDF document contexts, indicating that
neither model is well suited for long-context question-answering tasks. Our
analysis suggests that NEPA-focused regulatory reasoning tasks pose a
significant challenge for LLMs, particularly in terms of understanding the
complex semantics and effectively processing the lengthy regulatory documents.
comment: 15 pages
♻ ☆ CHANCERY: Evaluating Corporate Governance Reasoning Capabilities in Language Models
Law has long been a domain that has been popular in natural language
processing (NLP) applications. Reasoning (ratiocination and the ability to make
connections to precedent) is a core part of the practice of the law in the real
world. Nevertheless, while multiple legal datasets exist, none have thus far
focused specifically on reasoning tasks. We focus on a specific aspect of the
legal landscape by introducing a corporate governance reasoning benchmark
(CHANCERY) to test a model's ability to reason about whether
executive/board/shareholder's proposed actions are consistent with corporate
governance charters. This benchmark introduces a first-of-its-kind corporate
governance reasoning test for language models - modeled after real world
corporate governance law. The benchmark consists of a corporate charter (a set
of governing covenants) and a proposal for executive action. The model's task
is one of binary classification: reason about whether the action is consistent
with the rules contained within the charter. We create the benchmark following
established principles of corporate governance - 24 concrete corporate
governance principles established in and 79 real life corporate charters
selected to represent diverse industries from a total dataset of 10k real life
corporate charters. Evaluations on state-of-the-art (SOTA) reasoning models
confirm the difficulty of the benchmark, with models such as Claude 3.7 Sonnet
and GPT-4o achieving 64.5% and 75.2% accuracy respectively. Reasoning agents
exhibit superior performance, with agents based on the ReAct and CodeAct
frameworks scoring 76.1% and 78.1% respectively, further confirming the
advanced legal reasoning capabilities required to score highly on the
benchmark. We also conduct an analysis of the types of questions which current
reasoning models struggle on, revealing insights into the legal reasoning
capabilities of SOTA models.
♻ ☆ Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs
Detecting issue framing in text - how different perspectives approach the
same topic - is valuable for social science and policy analysis, yet
challenging for automated methods due to subtle linguistic differences. We
introduce `paired completion', a novel approach using LLM next-token log
probabilities to detect contrasting frames using minimal examples. Through
extensive evaluation across synthetic datasets and a human-labeled corpus, we
demonstrate that paired completion is a cost-efficient, low-bias alternative to
both prompt-based and embedding-based methods, offering a scalable solution for
analyzing issue framing in large text collections, especially suited to
low-resource settings.
comment: 9 pages, 4 figures
♻ ☆ Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling ICML 2025
Despite the success of Transformers, handling long contexts remains
challenging due to the limited length generalization and quadratic complexity
of self-attention. Thus Transformers often require post-training with a larger
attention window, significantly increasing computational and memory costs. In
this paper, we propose a novel attention mechanism based on dynamic context,
Grouped Cross Attention (GCA), which can generalize to 1000 times the
pre-training context length while maintaining the ability to access distant
information with a constant attention window size. For a given input sequence,
we split it into chunks and use each chunk to retrieve top-k relevant past
chunks for subsequent text generation. Specifically, unlike most previous works
that use an off-the-shelf retriever, our key innovation allows the retriever to
learn how to retrieve past chunks that better minimize the auto-regressive loss
of subsequent tokens in an end-to-end manner. Such a mechanism accommodates
retrieved chunks with a fixed-size attention window to achieve long-range
information access, significantly reducing computational and memory costs
during training and inference. Experiments show that GCA-based models achieve
near-perfect accuracy in passkey retrieval for 16M context lengths, which is
1000 times the training length.
comment: accepted to ICML 2025
♻ ☆ BeamLoRA: Beam-Constraint Low-Rank Adaptation ACL 2025
Naibin Gu, Zhenyu Zhang, Xiyu Liu, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
Due to the demand for efficient fine-tuning of large language models,
Low-Rank Adaptation (LoRA) has been widely adopted as one of the most effective
parameter-efficient fine-tuning methods. Nevertheless, while LoRA improves
efficiency, there remains room for improvement in accuracy. Herein, we adopt a
novel perspective to assess the characteristics of LoRA ranks. The results
reveal that different ranks within the LoRA modules not only exhibit varying
levels of importance but also evolve dynamically throughout the fine-tuning
process, which may limit the performance of LoRA. Based on these findings, we
propose BeamLoRA, which conceptualizes each LoRA module as a beam where each
rank naturally corresponds to a potential sub-solution, and the fine-tuning
process becomes a search for the optimal sub-solution combination. BeamLoRA
dynamically eliminates underperforming sub-solutions while expanding the
parameter space for promising ones, enhancing performance with a fixed rank.
Extensive experiments across three base models and 12 datasets spanning math
reasoning, code generation, and commonsense reasoning demonstrate that BeamLoRA
consistently enhances the performance of LoRA, surpassing the other baseline
methods.
comment: Accepted by ACL 2025
♻ ☆ Context Is Not Comprehension
The dominant way of judging Large Language Models (LLMs) has been to ask how
well they can recall explicit facts from very long inputs. While today's best
models achieve near perfect recall, this masks a harder skill: performing
multi-step reasoning and tracking intermediate state that never appears
verbatim. We introduce Verbose ListOps (VLO), a benchmark that embeds
deterministic ListOps computations inside narrative camouflage and, crucially,
allows step-level evaluation of every intermediate result. Experiments show
that models which solve raw ListOps with approximately 100% accuracy collapse
on VLO after only 10,000 tokens. By exposing where a model's reasoning chain
first diverges, VLO moves assessment beyond sheer context length and toward
genuine comprehension. VLO's generation pipeline is task-agnostic: it can weave
any deterministically verifiable reasoning schema -- arithmetic, symbolic,
abductive, inductive or defeasible -- into narrative form. This makes VLO a
reusable test-bed for the next wave of reasoning-centric model designs, not
merely those with step-explicit scaffolds.
comment: 24 pages, 2 figures, 4 tables; under review
♻ ☆ Prompt-based Depth Pruning of Large Language Models
Depth pruning aims to reduce the inference cost of a large language model
without any hardware-specific complications, by simply removing several less
important transformer blocks. However, our empirical findings suggest that the
importance of a transformer block may be highly task-dependent -- a block that
is crucial for a task can be removed without degrading the accuracy on another
task. Based on this observation, we develop a dynamic depth pruning algorithm,
coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which
blocks to omit from the model based on the input prompt. PuDDing operates by
training a lightweight router to predict the best omission set among a set of
options, where this option set has also been constructed in a data-driven
manner. Empirical results on commonsense reasoning benchmarks demonstrate that
PuDDing effectively accelerates the inference language models, and achieves
better on-task performance than static depth pruning baselines.
comment: Project: https://jwee01.github.io/PuDDing/ Code:
https://github.com/tada0347/PuDDing
♻ ☆ Convert Language Model into a Value-based Strategic Planner ACL 2025
Emotional support conversation (ESC) aims to alleviate the emotional distress
of individuals through effective conversations. Although large language models
(LLMs) have obtained remarkable progress on ESC, most of these studies might
not define the diagram from the state model perspective, therefore providing a
suboptimal solution for long-term satisfaction. To address such an issue, we
leverage the Q-learning on LLMs, and propose a framework called straQ*. Our
framework allows a plug-and-play LLM to bootstrap the planning during ESC,
determine the optimal strategy based on long-term returns, and finally guide
the LLM to response. Substantial experiments on ESC datasets suggest that
straQ* outperforms many baselines, including direct inference, self-refine,
chain of thought, finetuning, and finite state machines.
comment: 13 pages, 6 figures, Accepted by ACL 2025 Industry Track
♻ ☆ Play to Generalize: Learning to Reason Through Game Play
Developing generalizable reasoning capabilities in multimodal large language
models (MLLMs) remains challenging. Motivated by cognitive science literature
suggesting that gameplay promotes transferable cognitive skills, we propose a
novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs
develop out-of-domain generalization of multimodal reasoning through playing
arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM
via reinforcement learning (RL) on simple arcade-like games, e.g. Snake,
significantly enhances its downstream performance on multimodal math benchmarks
like MathVista, and on multi-discipline questions like MMMU, without seeing any
worked solutions, equations, or diagrams during RL, suggesting the capture of
transferable reasoning skills. Remarkably, our model outperforms specialist
models tuned on multimodal reasoning data in multimodal reasoning benchmarks,
while preserving the base model's performance on general visual benchmarks, a
challenge where specialist models often fall short. Our findings suggest a new
post-training paradigm: synthetic, rule-based games can serve as controllable
and scalable pre-text tasks that unlock generalizable multimodal reasoning
abilities in MLLMs.
comment: Project Page: https://yunfeixie233.github.io/ViGaL/
♻ ☆ Research Borderlands: Analysing Writing Across Research Cultures ACL 2025
Improving cultural competence of language technologies is important. However
most recent works rarely engage with the communities they study, and instead
rely on synthetic setups and imperfect proxies of culture. In this work, we
take a human-centered approach to discover and measure language-based cultural
norms, and cultural competence of LLMs. We focus on a single kind of culture,
research cultures, and a single task, adapting writing across research
cultures. Through a set of interviews with interdisciplinary researchers, who
are experts at moving between cultures, we create a framework of structural,
stylistic, rhetorical, and citational norms that vary across research cultures.
We operationalise these features with a suite of computational metrics and use
them for (a) surfacing latent cultural norms in human-written research papers
at scale; and (b) highlighting the lack of cultural competence of LLMs, and
their tendency to homogenise writing. Overall, our work illustrates the
efficacy of a human-centered approach to measuring cultural norms in
human-written and LLM-generated texts.
comment: Accepted to ACL 2025 (Main)
♻ ☆ M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction
Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection
of information extraction and model interpretability. MRE aims to leverage the
mutual understanding between tasks of different granularities, enhancing the
performance of both coarse-grained and fine-grained tasks through joint
modeling. While MRE has been explored and validated in the textual domain, its
applicability to visual and multimodal domains remains unexplored. In this
work, we extend MRE to the multimodal information extraction domain for the
first time. Specifically, we introduce a new task: Multimodal Mutual
Reinforcement Effect (M-MRE), and construct a corresponding dataset to support
this task. To address the challenges posed by M-MRE, we further propose a
Prompt Format Adapter (PFA) that is fully compatible with various Large
Vision-Language Models (LVLMs). Experimental results demonstrate that MRE can
also be observed in the M-MRE task, a multimodal text-image understanding
scenario. This provides strong evidence that MRE facilitates mutual gains
across three interrelated tasks, confirming its generalizability beyond the
textual domain.