Computation and Language
☆ Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang
Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically
referencing visual regions, just like human "thinking with images". However, no
benchmark exists to evaluate these capabilities holistically. To bridge this
gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a
diagnostic benchmark built on three principles: (1) focused visual perception
of subtle targets in complex scenes, (2) traceable evidence via bounding box
evaluation, and (3) second-order reasoning to test object interactions and
spatial hierarchies beyond simple object localization. Prioritizing images with
dense objects, we initially sample 1K high-quality images from SA-1B, and
incorporate eight LMM experts to manually annotate questions, candidate
options, and answers for each image. After three stages of quality control,
TreeBench consists of 405 challenging visual question-answering pairs, even the
most advanced models struggle with this benchmark, where none of them reach 60%
accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR
(Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to
supervise localization and reasoning jointly with reinforcement learning,
enabling accurate localizations and explainable reasoning pathways. Initialized
from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and
TreeBench (+13.4), proving traceability is key to advancing vision-grounded
reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.
☆ PyVision: Agentic Vision with Dynamic Tooling
LLMs are increasingly deployed as agents, systems capable of planning,
reasoning, and dynamically calling external tools. However, in visual
reasoning, prior approaches largely remain limited by predefined workflows and
static toolsets. In this report, we present PyVision, an interactive,
multi-turn framework that enables MLLMs to autonomously generate, execute, and
refine Python-based tools tailored to the task at hand, unlocking flexible and
interpretable problem-solving. We develop a taxonomy of the tools created by
PyVision and analyze their usage across a diverse set of benchmarks.
Quantitatively, PyVision achieves consistent performance gains, boosting
GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini.
These results point to a broader shift: dynamic tooling allows models not just
to use tools, but to invent them, advancing toward more agentic visual
reasoning.
comment: 26 Pages, 10 Figures, Technical report
☆ Automating Expert-Level Medical Reasoning Evaluation of Large Language Models
Shuang Zhou, Wenya Xie, Jiaxi Li, Zaifu Zhan, Meijia Song, Han Yang, Cheyenna Espinoza, Lindsay Welton, Xinnie Mai, Yanwei Jin, Zidu Xu, Yuen-Hei Chung, Yiyun Xing, Meng-Han Tsai, Emma Schaffer, Yucheng Shi, Ninghao Liu, Zirui Liu, Rui Zhang
As large language models (LLMs) become increasingly integrated into clinical
decision-making, ensuring transparent and trustworthy reasoning is essential.
However, existing evaluation strategies of LLMs' medical reasoning capability
either suffer from unsatisfactory assessment or poor scalability, and a
rigorous benchmark remains lacking. To address this, we introduce
MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable
assessment of LLMs' medical reasoning. MedThink-Bench comprises 500 challenging
questions across ten medical domains, each annotated with expert-crafted
step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel
evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge
mechanisms to assess intermediate reasoning with expert-level fidelity while
maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong
positive correlation with expert judgments. Benchmarking twelve
state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can
surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall,
MedThink-Bench offers a foundational tool for evaluating LLMs' medical
reasoning, advancing their safe and responsible deployment in clinical
practice.
comment: 22 pages,6 figures
☆ Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology
Sabine Felde, Rüdiger Buchkremer, Gamal Chehab, Christian Thielscher, Jörg HW Distler, Matthias Schneider, Jutta G. Richter
Large language models (LLMs) show promise for supporting clinical
decision-making in complex fields such as rheumatology. Our evaluation shows
that smaller language models (SLMs), combined with retrieval-augmented
generation (RAG), achieve higher diagnostic and therapeutic performance than
larger models, while requiring substantially less energy and enabling
cost-efficient, local deployment. These features are attractive for
resource-limited healthcare. However, expert oversight remains essential, as no
model consistently reached specialist-level accuracy in rheumatology.
☆ Why is Your Language Model a Poor Implicit Reward Model?
Reward models are key to language model post-training and inference
pipelines. Conveniently, recent work showed that every language model defines
an implicit reward model (IM-RM), without requiring any architectural changes.
However, such IM-RMs tend to generalize worse, especially out-of-distribution,
compared to explicit reward models (EX-RMs) that apply a dedicated linear head
over the hidden representations of a language model. The existence of a
generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They
can be trained using the same data, loss function, and language model, and
differ only in how the reward is computed. Towards a fundamental understanding
of the implicit biases underlying different reward model types, we investigate
the root cause of this gap. Our main finding, backed by theory and experiments,
is that IM-RMs rely more heavily on superficial token-level cues. Consequently,
they often generalize worse than EX-RMs under token-level distribution shifts,
as well as in-distribution. Furthermore, we provide evidence against
alternative hypotheses for the generalization gap. Most notably, we challenge
the intuitive claim that IM-RMs struggle in tasks where generation is harder
than verification because they can operate both as a verifier and a generator.
Taken together, our results highlight that seemingly minor design choices can
substantially impact the generalization behavior of reward models.
☆ Scaling RL to Long Videos
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
We introduce a full-stack framework that scales up reasoning in
vision-language models (VLMs) to long videos, leveraging reinforcement
learning. We address the unique challenges of long video reasoning by
integrating three critical components: (1) a large-scale dataset,
LongVideo-Reason, comprising 52K long video QA pairs with high-quality
reasoning annotations across diverse domains such as sports, games, and vlogs;
(2) a two-stage training pipeline that extends VLMs with chain-of-thought
supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a
training infrastructure for long video RL, named Multi-modal Reinforcement
Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a
vLLM-based engine tailored for long video, using cached video embeddings for
efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves
strong performance on long video QA benchmarks such as VideoMME. It also
outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal
reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on
our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to
2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent
performance gains as the number of input video frames scales. LongVILA-R1 marks
a firm step towards long video reasoning in VLMs. In addition, we release our
training system for public availability that supports RL training on various
modalities (video, text, and audio), various models (VILA and Qwen series), and
even image and video generation models. On a single A100 node (8 GPUs), it
supports RL training on hour-long videos (e.g., 3,600 frames / around 256k
tokens).
comment: Code and models are available at https://github.com/NVlabs/Long-RL
☆ MIRIX: Multi-Agent Memory System for LLM-Based Agents
Although memory capabilities of AI agents are gaining increasing attention,
existing solutions remain fundamentally limited. Most rely on flat, narrowly
scoped memory components, constraining their ability to personalize, abstract,
and reliably recall user-specific information over time. To this end, we
introduce MIRIX, a modular, multi-agent memory system that redefines the future
of AI memory by solving the field's most critical challenge: enabling language
models to truly remember. Unlike prior approaches, MIRIX transcends text to
embrace rich visual and multimodal experiences, making memory genuinely useful
in real-world scenarios. MIRIX consists of six distinct, carefully structured
memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and
Knowledge Vault, coupled with a multi-agent framework that dynamically controls
and coordinates updates and retrieval. This design enables agents to persist,
reason over, and accurately retrieve diverse, long-term user data at scale. We
validate MIRIX in two demanding settings. First, on ScreenshotVQA, a
challenging multimodal benchmark comprising nearly 20,000 high-resolution
computer screenshots per sequence, requiring deep contextual understanding and
where no existing memory systems can be applied, MIRIX achieves 35% higher
accuracy than the RAG baseline while reducing storage requirements by 99.9%.
Second, on LOCOMO, a long-form conversation benchmark with single-modal textual
input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing
existing baselines. These results show that MIRIX sets a new performance
standard for memory-augmented LLM agents. To allow users to experience our
memory system, we provide a packaged application powered by MIRIX. It monitors
the screen in real time, builds a personalized memory base, and offers
intuitive visualization and secure local storage to ensure privacy.
☆ SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment
While Vision-Language Models (VLMs) have shown promising progress in general
multimodal tasks, they often struggle in industrial anomaly detection and
reasoning, particularly in delivering interpretable explanations and
generalizing to unseen categories. This limitation stems from the inherently
domain-specific nature of anomaly detection, which hinders the applicability of
existing VLMs in industrial scenarios that require precise, structured, and
context-aware analysis. To address these challenges, we propose SAGE, a
VLM-based framework that enhances anomaly reasoning through Self-Guided Fact
Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE
integrates domain-specific knowledge into visual reasoning via fact extraction
and fusion, while E-DPO aligns model outputs with expert preferences using
entropy-aware optimization. Additionally, we introduce AD-PL, a
preference-optimized dataset tailored for industrial anomaly reasoning,
consisting of 28,415 question-answering instances with expert-ranked responses.
To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation
(MLE), a quantitative framework analyzing model logic and consistency. SAGE
demonstrates superior performance on industrial anomaly datasets under
zero-shot and one-shot settings. The code, model and dataset are available at
https://github.com/amoreZgx1n/SAGE.
comment: Accepted by ACMMM2025
☆ Probing Experts' Perspectives on AI-Assisted Public Speaking Training
Background: Public speaking is a vital professional skill, yet it remains a
source of significant anxiety for many individuals. Traditional training relies
heavily on expert coaching, but recent advances in AI has led to novel types of
commercial automated public speaking feedback tools. However, most research has
focused on prototypes rather than commercial applications, and little is known
about how public speaking experts perceive these tools.
Objectives: This study aims to evaluate expert opinions on the efficacy and
design of commercial AI-based public speaking training tools and to propose
guidelines for their improvement.
Methods: The research involved 16 semi-structured interviews and 2 focus
groups with public speaking experts. Participants discussed their views on
current commercial tools, their potential integration into traditional
coaching, and suggestions for enhancing these systems.
Results and Conclusions: Experts acknowledged the value of AI tools in
handling repetitive, technical aspects of training, allowing coaches to focus
on higher-level skills. However they found key issues in current tools,
emphasising the need for personalised, understandable, carefully selected
feedback and clear instructional design. Overall, they supported a hybrid model
combining traditional coaching with AI-supported exercises.
☆ DTECT: Dynamic Topic Explorer & Context Tracker
The explosive growth of textual data over time presents a significant
challenge in uncovering evolving themes and trends. Existing dynamic topic
modeling techniques, while powerful, often exist in fragmented pipelines that
lack robust support for interpretation and user-friendly exploration. We
introduce DTECT (Dynamic Topic Explorer & Context Tracker), an end-to-end
system that bridges the gap between raw textual data and meaningful temporal
insights. DTECT provides a unified workflow that supports data preprocessing,
multiple model architectures, and dedicated evaluation metrics to analyze the
topic quality of temporal topic models. It significantly enhances
interpretability by introducing LLM-driven automatic topic labeling, trend
analysis via temporally salient words, interactive visualizations with
document-level summarization, and a natural language chat interface for
intuitive data querying. By integrating these features into a single, cohesive
platform, DTECT empowers users to more effectively track and understand
thematic dynamics. DTECT is open-source and available at
https://github.com/AdhyaSuman/DTECT.
comment: Code: https://github.com/AdhyaSuman/DTECT | Demo:
https://huggingface.co/spaces/AdhyaSuman/DTECT | Video:
https://youtu.be/B8nNfxFoJAU
☆ Automating MD simulations for Proteins using Large language Models: NAMD-Agent
Molecular dynamics simulations are an essential tool in understanding protein
structure, dynamics, and function at the atomic level. However, preparing high
quality input files for MD simulations can be a time consuming and error prone
process. In this work, we introduce an automated pipeline that leverages Large
Language Models (LLMs), specifically Gemini 2.0 Flash, in conjunction with
python scripting and Selenium based web automation to streamline the generation
of MD input files. The pipeline exploits CHARMM GUI's comprehensive web-based
interface for preparing simulation-ready inputs for NAMD. By integrating
Gemini's code generation and iterative refinement capabilities, simulation
scripts are automatically written, executed, and revised to navigate CHARMM
GUI, extract appropriate parameters, and produce the required NAMD input files.
Post processing is performed using additional software to further refine the
simulation outputs, thereby enabling a complete and largely hands free
workflow. Our results demonstrate that this approach reduces setup time,
minimizes manual errors, and offers a scalable solution for handling multiple
protein systems in parallel. This automated framework paves the way for broader
application of LLMs in computational structural biology, offering a robust and
adaptable platform for future developments in simulation automation.
comment: 34 pages
☆ DocCHA: Towards LLM-Augmented Interactive Online diagnosis System
Despite the impressive capabilities of Large Language Models (LLMs), existing
Conversational Health Agents (CHAs) remain static and brittle, incapable of
adaptive multi-turn reasoning, symptom clarification, or transparent
decision-making. This hinders their real-world applicability in clinical
diagnosis, where iterative and structured dialogue is essential. We propose
DocCHA, a confidence-aware, modular framework that emulates clinical reasoning
by decomposing the diagnostic process into three stages: (1) symptom
elicitation, (2) history acquisition, and (3) causal graph construction. Each
module uses interpretable confidence scores to guide adaptive questioning,
prioritize informative clarifications, and refine weak reasoning links.
Evaluated on two real-world Chinese consultation datasets (IMCS21, DX),
DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5,
GPT-4o, LLaMA-3), achieving up to 5.18 percent higher diagnostic accuracy and
over 30 percent improvement in symptom recall, with only modest increase in
dialogue turns. These results demonstrate the effectiveness of DocCHA in
enabling structured, transparent, and efficient diagnostic conversations --
paving the way for trustworthy LLM-powered clinical assistants in multilingual
and resource-constrained settings.
☆ Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation
This paper extends the self-referential framework of Alpay Algebra into a
multi-layered semantic game architecture where transfinite fixed-point
convergence encompasses hierarchical sub-games at each iteration level.
Building upon Alpay Algebra IV's empathetic embedding concept, we introduce a
nested game-theoretic structure where the alignment process between AI systems
and documents becomes a meta-game containing embedded decision problems. We
formalize this through a composite operator $\phi(\cdot, \gamma(\cdot))$ where
$\phi$ drives the main semantic convergence while $\gamma$ resolves local
sub-games. The resulting framework demonstrates that game-theoretic reasoning
emerges naturally from fixed-point iteration rather than being imposed
externally. We prove a Game Theorem establishing existence and uniqueness of
semantic equilibria under realistic cognitive simulation assumptions. Our
verification suite includes adaptations of Banach's fixed-point theorem to
transfinite contexts, a novel $\phi$-topology based on the
Kozlov-Maz'ya-Rossmann formula for handling semantic singularities, and
categorical consistency tests via the Yoneda lemma. The paper itself functions
as a semantic artifact designed to propagate its fixed-point patterns in AI
embedding spaces -- a deliberate instantiation of the "semantic virus" concept
it theorizes. All results are grounded in category theory, information theory,
and realistic AI cognition models, ensuring practical applicability beyond pure
mathematical abstraction.
comment: 18 pages, 2 figures
☆ From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems
Retrieval-Augmented Generation (RAG) has emerged as a crucial framework in
natural language processing (NLP), improving factual consistency and reducing
hallucinations by integrating external document retrieval with large language
models (LLMs). However, the effectiveness of RAG is often hindered by
coreferential complexity in retrieved documents, introducing ambiguity that
disrupts in-context learning. In this study, we systematically investigate how
entity coreference affects both document retrieval and generative performance
in RAG-based systems, focusing on retrieval relevance, contextual
understanding, and overall response quality. We demonstrate that coreference
resolution enhances retrieval effectiveness and improves question-answering
(QA) performance. Through comparative analysis of different pooling strategies
in retrieval tasks, we find that mean pooling demonstrates superior context
capturing ability after applying coreference resolution. In QA tasks, we
discover that smaller models benefit more from the disambiguation process,
likely due to their limited inherent capacity for handling referential
ambiguity. With these findings, this study aims to provide a deeper
understanding of the challenges posed by coreferential complexity in RAG,
providing guidance for improving retrieval and generation in
knowledge-intensive AI applications.
☆ Conditional Unigram Tokenization with Parallel Data ICML 2025
We introduce conditional unigram tokenization, a novel approach that extends
unigram tokenization by conditioning target token probabilities on
source-language tokens from parallel data. Given a fixed source tokenizer, our
method learns a target tokenizer that maximizes cross-lingual semantic
alignment. We evaluate our tokenizer on four language pairs across different
families and resource levels, examining intrinsic properties and downstream
performance on machine translation and language modeling. While our conditional
tokenizer maintains comparable statistical properties to standard unigram
tokenizers, results are mixed: we observe no improvements in machine
translation quality, but find consistent perplexity reductions in language
modeling. We hypothesize that quadratic scaling of conditional probability
estimation with respect to the vocabulary size creates a data efficiency
bottleneck. Our findings suggest that alternative parameterizations may be
necessary for practical cross-lingual tokenization.
comment: 21 pages, 4 figures, submitted to Tokenization Workshop (TokShop) at
ICML 2025
☆ On the Effect of Instruction Tuning Loss on Generalization ACL
Instruction Tuning has emerged as a pivotal post-training paradigm that
enables pre-trained language models to better follow user instructions. Despite
its significance, little attention has been given to optimizing the loss
function used. A fundamental, yet often overlooked, question is whether the
conventional auto-regressive objective - where loss is computed only on
response tokens, excluding prompt tokens - is truly optimal for instruction
tuning. In this work, we systematically investigate the impact of
differentially weighting prompt and response tokens in instruction tuning loss,
and propose Weighted Instruction Tuning (WIT) as a better alternative to
conventional instruction tuning. Through extensive experiments on five language
models of different families and scale, three finetuning datasets of different
sizes, and five diverse evaluation benchmarks, we show that the standard
instruction tuning loss often yields suboptimal performance and limited
robustness to input prompt variations. We find that a low-to-moderate weight
for prompt tokens coupled with a moderate-to-high weight for response tokens
yields the best-performing models across settings and also serve as better
starting points for the subsequent preference alignment training. These
findings highlight the need to reconsider instruction tuning loss and offer
actionable insights for developing more robust and generalizable models. Our
code is open-sourced at https://github.com/kowndinya-renduchintala/WIT.
comment: Transactions of the Association for Computational Linguistics (TACL)
☆ Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning
This paper investigates the relationship between large language models'
(LLMs) ability to recognize repetitive input patterns and their performance on
in-context learning (ICL). In contrast to prior work that has primarily focused
on attention heads, we examine this relationship from the perspective of skill
neurons, specifically repetition neurons. Our experiments reveal that the
impact of these neurons on ICL performance varies depending on the depth of the
layer in which they reside. By comparing the effects of repetition neurons and
induction heads, we further identify strategies for reducing repetitive outputs
while maintaining strong ICL capabilities.
☆ Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers ECML-PKDD
Continuous representations of logic formulae allow us to integrate symbolic
knowledge into data-driven learning algorithms. If such embeddings are
semantically consistent, i.e. if similar specifications are mapped into nearby
vectors, they enable continuous learning and optimization directly in the
semantic space of formulae. However, to translate the optimal continuous
representation into a concrete requirement, such embeddings must be invertible.
We tackle this issue by training a Transformer-based decoder-only model to
invert semantic embeddings of Signal Temporal Logic (STL) formulae. STL is a
powerful formalism that allows us to describe properties of signals varying
over time in an expressive yet concise way. By constructing a small vocabulary
from STL syntax, we demonstrate that our proposed model is able to generate
valid formulae after only 1 epoch and to generalize to the semantics of the
logic in about 10 epochs. Additionally, the model is able to decode a given
embedding into formulae that are often simpler in terms of length and nesting
while remaining semantically close (or equivalent) to gold references. We show
the effectiveness of our methodology across various levels of training formulae
complexity to assess the impact of training data on the model's ability to
effectively capture the semantic information contained in the embeddings and
generalize out-of-distribution. Finally, we deploy our model for solving a
requirement mining task, i.e. inferring STL specifications that solve a
classification task on trajectories, performing the optimization directly in
the semantic space.
comment: 16 pages, 3 figures, to be published in ECML-PKDD
☆ StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model
Streaming speech translation (StreamST) requires determining appropriate
timing, known as policy, to generate translations while continuously receiving
source speech inputs, balancing low latency with high translation quality.
However, existing StreamST methods typically operate on sentence-level speech
segments, referred to as simultaneous speech translation (SimulST). In
practice, they require collaboration with segmentation models to accomplish
StreamST, where the truncated speech segments constrain SimulST models to make
policy decisions and generate translations based on limited contextual
information. Moreover, SimulST models struggle to learn effective policies due
to the complexity of speech inputs and cross-lingual generation. To address
these challenges, we propose StreamUni, which achieves StreamST through a
unified Large Speech-Language Model (LSLM). Specifically, StreamUni
incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate
multi-stage outputs. Leveraging these multi-stage outputs, StreamUni
simultaneously accomplishes speech segmentation, policy decision, and
translation generation, completing StreamST without requiring massive
policy-specific training. Additionally, we propose a streaming CoT training
method that enhances low-latency policy decisions and generation capabilities
using limited CoT data. Experiments demonstrate that our approach achieves
state-of-the-art performance on StreamST tasks.
comment: The code is at https://github.com/ictnlp/StreamUni; The model is at
https://huggingface.co/ICTNLP/StreamUni-Phi4
☆ When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance
This paper establishes the first comprehensive review of Large Language
Models (LLMs) applied within the legal domain. It pioneers an innovative dual
lens taxonomy that integrates legal reasoning frameworks and professional
ontologies to systematically unify historical research and contemporary
breakthroughs. Transformer-based LLMs, which exhibit emergent capabilities such
as contextual reasoning and generative argumentation, surmount traditional
limitations by dynamically capturing legal semantics and unifying evidence
reasoning. Significant progress is documented in task generalization, reasoning
formalization, workflow integration, and addressing core challenges in text
processing, knowledge integration, and evaluation rigor via technical
innovations like sparse attention mechanisms and mixture-of-experts
architectures. However, widespread adoption of LLM introduces critical
challenges: hallucination, explainability deficits, jurisdictional adaptation
difficulties, and ethical asymmetry. This review proposes a novel taxonomy that
maps legal roles to NLP subtasks and computationally implements the Toulmin
argumentation framework, thus systematizing advances in reasoning, retrieval,
prediction, and dispute resolution. It identifies key frontiers including
low-resource systems, multimodal evidence integration, and dynamic rebuttal
handling. Ultimately, this work provides both a technical roadmap for
researchers and a conceptual framework for practitioners navigating the
algorithmic future, laying a robust foundation for the next era of legal
artificial intelligence. We have created a GitHub repository to index the
relevant papers: https://github.com/Kilimajaro/LLMs_Meet_Law.
☆ Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review
Motivated by a growing research interest into automatic speech recognition
(ASR), and the growing body of work for languages in which code-switching (CS)
often occurs, we present a systematic literature review of code-switching in
end-to-end ASR models. We collect and manually annotate papers published in
peer reviewed venues. We document the languages considered, datasets, metrics,
model choices, and performance, and present a discussion of challenges in
end-to-end ASR for code-switching. Our analysis thus provides insights on
current research efforts and available resources as well as opportunities and
gaps to guide future research.
☆ GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing
Jailbreak attacks reveal critical vulnerabilities in Large Language Models
(LLMs) by causing them to generate harmful or unethical content. Evaluating
these threats is particularly challenging due to the evolving nature of LLMs
and the sophistication required in effectively probing their vulnerabilities.
Current benchmarks and evaluation methods struggle to fully address these
challenges, leaving gaps in the assessment of LLM vulnerabilities. In this
paper, we review existing jailbreak evaluation practices and identify three
assumed desiderata for an effective jailbreak evaluation protocol. To address
these challenges, we introduce GuardVal, a new evaluation protocol that
dynamically generates and refines jailbreak prompts based on the defender LLM's
state, providing a more accurate assessment of defender LLMs' capacity to
handle safety-critical situations. Moreover, we propose a new optimization
method that prevents stagnation during prompt refinement, ensuring the
generation of increasingly effective jailbreak prompts that expose deeper
weaknesses in the defender LLMs. We apply this protocol to a diverse set of
models, from Mistral-7b to GPT-4, across 10 safety domains. Our findings
highlight distinct behavioral patterns among the models, offering a
comprehensive view of their robustness. Furthermore, our evaluation process
deepens the understanding of LLM behavior, leading to insights that can inform
future research and drive the development of more secure models.
comment: 24 pages
☆ Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization
Post-training alignment of large language models (LLMs) is a critical
challenge, as not all tokens contribute equally to model performance. This
paper introduces a selective alignment strategy that prioritizes high-impact
tokens within preference pairs, leveraging token-level log-probability
differences between the current policy and a reference model. By focusing on
these informative tokens, our approach reduces computational overhead and
enhances alignment fidelity. We further explore the role of reference model
quality, demonstrating that stronger reference models significantly improve
token selection accuracy and overall optimization effectiveness. Comprehensive
experiments on benchmarks such as Arena-Hard and MT-Bench validate the
superiority of our Selective-DPO method over standard DPO and
distillation-based baselines. Our findings highlight the importance of
token-level optimization and reference model selection in advancing preference
alignment for LLMs. The code is available at
https://github.com/Dongzhijin/SDPO.
☆ Stable Preference Optimization for LLMs: A Bilevel Approach Beyond Direct Preference Optimization
Direct Preference Optimization (DPO) has emerged as a popular and efficient
alternative to reward modeling and reinforcement learning for aligning language
models with human preferences. Despite its empirical success, the theoretical
properties and intrinsic limitations of DPO remain underexplored. In this work,
we first present a comprehensive analysis of DPO's dynamics from a probability
evolution perspective. Our analysis reveals that DPO is highly sensitive to
initialization. It also tends to misallocate probability mass, which can
inadvertently shift probability toward irrelevant or undesired responses. This
misallocation may unintentionally reinforce model bias, thereby compromising
both the stability of model alignment and the consistency with intended
preferences. Motivated by these theoretical findings, we propose a
theoretically grounded bilevel optimization framework that tightly integrate
supervised fine-tuning with an enhanced DPO objective a.k.a. stable preference
optimization. Our approach introduces a principled regularization scheme to
explicitly encourage absolute probability improvement for preferred outputs,
while maintaining stable optimization dynamics. Experiments on challenging
reasoning and summarization benchmarks elucidate that our method consistently
improves reasoning accuracy and better aligns output distributions with
intended preferences, outperforming standard DPO. Stable preference
optimization provides new insights into the design of preference-based
alignment objectives and opens up new avenues towards more reliable and
interpretable language model alignment.
☆ Rethinking the Privacy of Text Embeddings: A Reproducibility Study of "Text Embeddings Reveal (Almost) As Much As Text" RecSys 2025
Text embeddings are fundamental to many natural language processing (NLP)
tasks, extensively applied in domains such as recommendation systems and
information retrieval (IR). Traditionally, transmitting embeddings instead of
raw text has been seen as privacy-preserving. However, recent methods such as
Vec2Text challenge this assumption by demonstrating that controlled decoding
can successfully reconstruct original texts from black-box embeddings. The
unexpectedly strong results reported by Vec2Text motivated us to conduct
further verification, particularly considering the typically non-intuitive and
opaque structure of high-dimensional embedding spaces. In this work, we
reproduce the Vec2Text framework and evaluate it from two perspectives: (1)
validating the original claims, and (2) extending the study through targeted
experiments. First, we successfully replicate the original key results in both
in-domain and out-of-domain settings, with only minor discrepancies arising due
to missing artifacts, such as model checkpoints and dataset splits.
Furthermore, we extend the study by conducting a parameter sensitivity
analysis, evaluating the feasibility of reconstructing sensitive inputs (e.g.,
passwords), and exploring embedding quantization as a lightweight privacy
defense. Our results show that Vec2Text is effective under ideal conditions,
capable of reconstructing even password-like sequences that lack clear
semantics. However, we identify key limitations, including its sensitivity to
input sequence length. We also find that Gaussian noise and quantization
techniques can mitigate the privacy risks posed by Vec2Text, with quantization
offering a simpler and more widely applicable solution. Our findings emphasize
the need for caution in using text embeddings and highlight the importance of
further research into robust defense mechanisms for NLP systems.
comment: This paper has been accepted for oral presentation in the
reproducibility track at RecSys 2025
☆ KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities
Fine-tuning is an immensely resource-intensive process when retraining Large
Language Models (LLMs) to incorporate a larger body of knowledge. Although many
fine-tuning techniques have been developed to reduce the time and computational
cost involved, the challenge persists as LLMs continue to grow in size and
complexity. To address this, a new approach to knowledge expansion in LLMs is
needed. Retrieval-Augmented Generation (RAG) offers one such alternative by
storing external knowledge in a database and retrieving relevant chunks to
support question answering. However, naive implementations of RAG face
significant limitations in scalability and answer accuracy. This paper
introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome
these limitations. Inspired by the divide-and-conquer paradigm, K2RAG
integrates dense and sparse vector search, knowledge graphs, and text
summarization to improve retrieval quality and system efficiency. The framework
also includes a preprocessing step that summarizes the training data,
significantly reducing the training time. K2RAG was evaluated using the
MultiHopRAG dataset, where the proposed pipeline was trained on the document
corpus and tested on a separate evaluation set. Results demonstrated notable
improvements over common naive RAG implementations. K2RAG achieved the highest
mean answer similarity score of 0.57, and reached the highest third quartile
(Q3) similarity of 0.82, indicating better alignment with ground-truth answers.
In addition to improved accuracy, the framework proved highly efficient. The
summarization step reduced the average training time of individual components
by 93%, and execution speed was up to 40% faster than traditional knowledge
graph-based RAG systems. K2RAG also demonstrated superior scalability,
requiring three times less VRAM than several naive RAG implementations tested
in this study.
comment: 21 pages, 14 figures
☆ SAS: Simulated Attention Score
Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Peihao Wang, Jing Xiong, Liliang Ren, Hao Cheng, Janardhan Kulkarni, Yelong Shen, Atlas Wang, Mac Schwager, Anderson Schneider, Xiaodong Liu, Jianfeng Gao
The attention mechanism is a core component of the Transformer architecture.
Various methods have been developed to compute attention scores, including
multi-head attention (MHA), multi-query attention, group-query attention and so
on. We further analyze the MHA and observe that its performance improves as the
number of attention heads increases, provided the hidden size per head remains
sufficiently large. Therefore, increasing both the head count and hidden size
per head with minimal parameter overhead can lead to significant performance
gains at a low cost. Motivated by this insight, we introduce Simulated
Attention Score (SAS), which maintains a compact model size while simulating a
larger number of attention heads and hidden feature dimension per head. This is
achieved by projecting a low-dimensional head representation into a
higher-dimensional space, effectively increasing attention capacity without
increasing parameter count. Beyond the head representations, we further extend
the simulation approach to feature dimension of the key and query embeddings,
enhancing expressiveness by mimicking the behavior of a larger model while
preserving the original model size. To control the parameter cost, we also
propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive
experiments on a variety of datasets and tasks demonstrate the effectiveness of
the proposed SAS method, achieving significant improvements over different
attention variants.
comment: Tech Report
☆ An Automated Length-Aware Quality Metric for Summarization
This paper proposes NOrmed Index of Retention (NOIR), a quantitative
objective metric for evaluating summarization quality of arbitrary texts that
relies on both the retention of semantic meaning and the summary length
compression. This gives a measure of how well the recall-compression tradeoff
is managed, the most important skill in summarization. Experiments demonstrate
that NOIR effectively captures the token-length / semantic retention tradeoff
of a summarizer and correlates to human perception of sumarization quality.
Using a language model-embedding to measure semantic similarity, it provides an
automated alternative for assessing summarization quality without relying on
time-consuming human-generated reference summaries. The proposed metric can be
applied to various summarization tasks, offering an automated tool for
evaluating and improving summarization algorithms, summarization prompts, and
synthetically-generated summaries.
☆ Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement
Haotan Guo, Jianfei He, Jiayuan Ma, Hongbin Na, Zimu Wang, Haiyang Zhang, Qi Chen, Wei Wang, Zijing Shi, Tao Shen, Ling Chen
Phonetic Cloaking Replacement (PCR), defined as the deliberate use of
homophonic or near-homophonic variants to hide toxic intent, has become a major
obstacle to Chinese content moderation. While this problem is well-recognized,
existing evaluations predominantly rely on rule-based, synthetic perturbations
that ignore the creativity of real users. We organize PCR into a four-way
surface-form taxonomy and compile \ours, a dataset of 500 naturally occurring,
phonetically cloaked offensive posts gathered from the RedNote platform.
Benchmarking state-of-the-art LLMs on this dataset exposes a serious weakness:
the best model reaches only an F1-score of 0.672, and zero-shot
chain-of-thought prompting pushes performance even lower. Guided by error
analysis, we revisit a Pinyin-based prompting strategy that earlier studies
judged ineffective and show that it recovers much of the lost accuracy. This
study offers the first comprehensive taxonomy of Chinese PCR, a realistic
benchmark that reveals current detectors' limits, and a lightweight mitigation
technique that advances research on robust toxicity detection.
comment: In progress
☆ FrugalRAG: Learning to retrieve and reason for multi-hop QA ICML
We consider the problem of answering complex questions, given access to a
large unstructured document corpus. The de facto approach to solving the
problem is to leverage language models that (iteratively) retrieve and reason
through the retrieved documents, until the model has sufficient information to
generate an answer. Attempts at improving this approach focus on
retrieval-augmented generation (RAG) metrics such as accuracy and recall and
can be categorized into two types: (a) fine-tuning on large question answering
(QA) datasets augmented with chain-of-thought traces, and (b) leveraging
RL-based fine-tuning techniques that rely on question-document relevance
signals. However, efficiency in the number of retrieval searches is an equally
important metric, which has received less attention. In this work, we show
that: (1) Large-scale fine-tuning is not needed to improve RAG metrics,
contrary to popular claims in recent literature. Specifically, a standard ReAct
pipeline with improved prompts can outperform state-of-the-art methods on
benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help
RAG from the perspective of frugality, i.e., the latency due to number of
searches at inference time. For example, we show that we can achieve
competitive RAG metrics at nearly half the cost (in terms of number of
searches) on popular RAG benchmarks, using the same base model, and at a small
training cost (1000 examples).
comment: Accepted at ICML Workshop: Efficient Systems for Foundation Models
☆ Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks
Large Language Models (LLMs) have demonstrated outstanding performance across
a range of NLP tasks, however, their computational demands hinder their
deployment in real-world, resource-constrained environments. This work
investigates the extent to which LLMs can be compressed using Knowledge
Distillation (KD) while maintaining strong performance on Question Answering
(QA) tasks. We evaluate student models distilled from the Pythia and Qwen2.5
families on two QA benchmarks, SQuAD and MLQA, under zero-shot and one-shot
prompting conditions. Results show that student models retain over 90% of their
teacher models' performance while reducing parameter counts by up to 57.1%.
Furthermore, one-shot prompting yields additional performance gains over
zero-shot setups for both model families. These findings underscore the
trade-off between model efficiency and task performance, demonstrating that KD,
combined with minimal prompting, can yield compact yet capable QA systems
suitable for resource-constrained applications.
comment: Accepted four publication at the 26th Meeting of the Special Interest
on Discourse and Dialogue
☆ SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs
Humans can directly imagine and manipulate visual images in their minds, a
capability known as spatial visualization. While multi-modal Large Language
Models (MLLMs) support imagination-based reasoning, spatial visualization
remains insufficiently evaluated, typically embedded within broader
mathematical and logical assessments. Existing evaluations often rely on IQ
tests or math competitions that may overlap with training data, compromising
assessment reliability. To this end, we introduce SpatialViz-Bench, a
comprehensive multi-modal benchmark for spatial visualization with 12 tasks
across 4 sub-abilities, comprising 1,180 automatically generated problems. Our
evaluation of 33 state-of-the-art MLLMs not only reveals wide performance
variations and demonstrates the benchmark's strong discriminative power, but
also uncovers counter-intuitive findings: models exhibit unexpected behaviors
by showing difficulty perception that misaligns with human intuition,
displaying dramatic 2D-to-3D performance cliffs, and defaulting to formula
derivation despite spatial tasks requiring visualization alone. SpatialVizBench
empirically demonstrates that state-of-the-art MLLMs continue to exhibit
deficiencies in spatial visualization tasks, thereby addressing a significant
lacuna in the field. The benchmark is publicly available.
☆ Enhancing Vaccine Safety Surveillance: Extracting Vaccine Mentions from Emergency Department Triage Notes Using Fine-Tuned Large Language Models
Sedigh Khademi, Jim Black, Christopher Palmer, Muhammad Javed, Hazel Clothier, Jim Buttery, Gerardo Luis Dimaguila
This study evaluates fine-tuned Llama 3.2 models for extracting
vaccine-related information from emergency department triage notes to support
near real-time vaccine safety surveillance. Prompt engineering was used to
initially create a labeled dataset, which was then confirmed by human
annotators. The performance of prompt-engineered models, fine-tuned models, and
a rule-based approach was compared. The fine-tuned Llama 3 billion parameter
model outperformed other models in its accuracy of extracting vaccine names.
Model quantization enabled efficient deployment in resource-constrained
environments. Findings demonstrate the potential of large language models in
automating data extraction from emergency department notes, supporting
efficient vaccine safety surveillance and early detection of emerging adverse
events following immunization issues.
comment: 5 pages
☆ Bayesian Discrete Diffusion Beats Autoregressive Perplexity
We reveal a hidden Bayesian core of discrete-diffusion language models by
showing that the expected denoiser output under the forward masking
distribution recovers the exact posterior over clean tokens. Under minimal
assumptions, Monte Carlo marginalization over K independent corruptions
converges to this posterior at rate O(1/sqrt(K)), yielding a simple proof of
consistency and finite-sample error bounds. Building on this insight, we
introduce a lightweight inference-time ensemble that averages K
mask-and-denoise passes to obtain posterior-aware token probabilities and
uncertainty estimates at no extra training cost. On WikiText-2, our method
achieves test perplexity 8.8 with K=8, versus 20.3 for GPT-2 Small, despite
using a model of comparable size. Code is available at
https://github.com/mercury0100/bayesradd.
comment: 12 pages, 2 figures, 2 tables
☆ Improving Clustering on Occupational Text Data through Dimensionality Reduction
In this study, we focused on proposing an optimal clustering mechanism for
the occupations defined in the well-known US-based occupational database,
O*NET. Even though all occupations are defined according to well-conducted
surveys in the US, their definitions can vary for different firms and
countries. Hence, if one wants to expand the data that is already collected in
O*NET for the occupations defined with different tasks, a map between the
definitions will be a vital requirement. We proposed a pipeline using several
BERT-based techniques with various clustering approaches to obtain such a map.
We also examined the effect of dimensionality reduction approaches on several
metrics used in measuring performance of clustering algorithms. Finally, we
improved our results by using a specialized silhouette approach. This new
clustering-based mapping approach with dimensionality reduction may help
distinguish the occupations automatically, creating new paths for people
wanting to change their careers.
comment: Preprint, 10 figures
☆ COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation
Recent studies suggest that context-aware low-rank approximation is a useful
tool for compression and fine-tuning of modern large-scale neural networks. In
this type of approximation, a norm is weighted by a matrix of input
activations, significantly improving metrics over the unweighted case.
Nevertheless, existing methods for neural networks suffer from numerical
instabilities due to their reliance on classical formulas involving explicit
Gram matrix computation and their subsequent inversion. We demonstrate that
this can degrade the approximation quality or cause numerically singular
matrices.
To address these limitations, we propose a novel inversion-free regularized
framework that is based entirely on stable decompositions and overcomes the
numerical pitfalls of prior art. Our method can handle possible challenging
scenarios: (1) when calibration matrices exceed GPU memory capacity, (2) when
input activation matrices are nearly singular, and even (3) when insufficient
data prevents unique approximation. For the latter, we prove that our solution
converges to a desired approximation and derive explicit error bounds.
☆ Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation ACL 2025
Document Image Machine Translation (DIMT) aims to translate text within
document images, facing generalization challenges due to limited training data
and the complex interplay between visual and textual information. To address
these challenges, we introduce M4Doc, a novel single-to-mix modality alignment
framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an
image-only encoder with the multimodal representations of an MLLM, pre-trained
on large-scale document image datasets. This alignment enables a lightweight
DIMT model to learn crucial visual-textual correlations during training. During
inference, M4Doc bypasses the MLLM, maintaining computational efficiency while
benefiting from its multimodal knowledge. Comprehensive experiments demonstrate
substantial improvements in translation quality, especially in cross-domain
generalization and challenging document image scenarios.
comment: Accepted by ACL 2025 Main
☆ The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs
Jierun Chen, Tiezheng Yu, Haoli Bai, Lewei Yao, Jiannan Wu, Kaican Li, Fei Mi, Chaofan Tao, Lei Zhu, Manyi Zhang, Xiaohui Li, Lu Hou, Lifeng Shang, Qun Liu
Large vision-language models (VLMs) increasingly adopt post-training
techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and
reinforcement learning (RL) to elicit sophisticated reasoning. While these
methods exhibit synergy in language-only models, their joint effectiveness in
VLMs remains uncertain. We present a systematic investigation into the distinct
roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning
benchmarks. We find that SFT improves performance on difficult questions by
in-depth, structured reasoning, but introduces verbosity and degrades
performance on simpler ones. In contrast, RL promotes generalization and
brevity, yielding consistent improvements across all difficulty levels, though
the improvements on the hardest questions are less prominent compared to SFT.
Surprisingly, combining them through two-staged, interleaved, or progressive
training strategies, as well as data mixing and model merging, all fails to
produce additive benefits, instead leading to trade-offs in accuracy, reasoning
style, and response length. This ``synergy dilemma'' highlights the need for
more seamless and adaptive approaches to unlock the full potential of combined
post-training techniques for reasoning VLMs.
☆ The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
Cross-lingual retrieval-augmented generation (RAG) is a critical capability
for retrieving and generating answers across languages. Prior work in this
context has mostly focused on generation and relied on benchmarks derived from
open-domain sources, most notably Wikipedia. In such settings, retrieval
challenges often remain hidden due to language imbalances, overlap with
pretraining data, and memorized content. To address this gap, we study
Arabic-English RAG in a domain-specific setting using benchmarks derived from
real-world corporate datasets. Our benchmarks include all combinations of
languages for the user query and the supporting document, drawn independently
and uniformly at random. This enables a systematic study of multilingual
retrieval behavior.
Our findings reveal that retrieval is a critical bottleneck in cross-lingual
domain-specific scenarios, with significant performance drops occurring when
the user query and supporting document languages differ. A key insight is that
these failures stem primarily from the retriever's difficulty in ranking
documents across languages. Finally, we propose a simple retrieval strategy
that addresses this source of failure by enforcing equal retrieval from both
languages, resulting in substantial improvements in cross-lingual and overall
performance. These results highlight meaningful opportunities for improving
multilingual retrieval, particularly in practical, real-world RAG applications.
☆ CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text
This paper presents a competitive approach to multilingual subjectivity
detection using large language models (LLMs) with few-shot prompting. We
participated in Task 1: Subjectivity of the CheckThat! 2025 evaluation
campaign. We show that LLMs, when paired with carefully designed prompts, can
match or outperform fine-tuned smaller language models (SLMs), particularly in
noisy or low-quality data settings. Despite experimenting with advanced prompt
engineering techniques, such as debating LLMs and various example selection
strategies, we found limited benefit beyond well-crafted standard few-shot
prompts. Our system achieved top rankings across multiple languages in the
CheckThat! 2025 subjectivity detection task, including first place in Arabic
and Polish, and top-four finishes in Italian, English, German, and multilingual
tracks. Notably, our method proved especially robust on the Arabic dataset,
likely due to its resilience to annotation inconsistencies. These findings
highlight the effectiveness and adaptability of LLM-based few-shot learning for
multilingual sentiment tasks, offering a strong alternative to traditional
fine-tuning, particularly when labeled data is scarce or inconsistent.
comment: Notebook for the CheckThat! Lab at CLEF 2025
☆ Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems
Turn-taking is a fundamental component of spoken dialogue, however
conventional studies mostly involve dyadic settings. This work focuses on
applying voice activity projection (VAP) to predict upcoming turn-taking in
triadic multi-party scenarios. The goal of VAP models is to predict the future
voice activity for each speaker utilizing only acoustic data. This is the first
study to extend VAP into triadic conversation. We trained multiple models on a
Japanese triadic dataset where participants discussed a variety of topics. We
found that the VAP trained on triadic conversation outperformed the baseline
for all models but that the type of conversation affected the accuracy. This
study establishes that VAP can be used for turn-taking in triadic dialogue
scenarios. Future work will incorporate this triadic VAP turn-taking model into
spoken dialogue systems.
comment: Accepted to Interspeech 2025
☆ Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System
The growing need for psychological support due to increasing pressures has
exposed the scarcity of relevant datasets, particularly in non-English
languages. To address this, we propose a framework that leverages limited
real-world data and expert knowledge to fine-tune two large language models:
Dialog Generator and Dialog Modifier. The Generator creates large-scale
psychological counseling dialogues based on predefined paths, which guide
system response strategies and user interactions, forming the basis for
effective support. The Modifier refines these dialogues to align with
real-world data quality. Through both automated and manual review, we construct
the Chinese Psychological support Dialogue Dataset (CPsDD), containing 68K
dialogues across 13 groups, 16 psychological problems, 13 causes, and 12
support focuses. Additionally, we introduce the Comprehensive Agent Dialogue
Support System (CADSS), where a Profiler analyzes user characteristics, a
Summarizer condenses dialogue history, a Planner selects strategies, and a
Supporter generates empathetic responses. The experimental results of the
Strategy Prediction and Emotional Support Conversation (ESC) tasks demonstrate
that CADSS achieves state-of-the-art performance on both CPsDD and ESConv
datasets.
comment: 10pages,8 figures
☆ Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models AAAI-26
With widespread adoption of transformer-based language models in AI, there is
significant interest in the limits of LLMs capabilities, specifically so-called
hallucinations, occurrences in which LLMs provide spurious, factually incorrect
or nonsensical information when prompted on certain subjects. Furthermore,
there is growing interest in agentic uses of LLMs - that is, using LLMs to
create agents that act autonomously or semi-autonomously to carry out various
tasks, including tasks with applications in the real world. This makes it
important to understand the types of tasks LLMs can and cannot perform. We
explore this topic from the perspective of the computational complexity of LLM
inference. We show that LLMs are incapable of carrying out computational and
agentic tasks beyond a certain complexity, and further that LLMs are incapable
of verifying the accuracy of tasks beyond a certain complexity. We present
examples of both, then discuss some consequences of this work.
comment: 6 pages; to be submitted to AAAI-26 after reviews
☆ Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature
The oxygen reduction reaction (ORR) catalyst plays a critical role in
enhancing fuel cell efficiency, making it a key focus in material science
research. However, extracting structured information about ORR catalysts from
vast scientific literature remains a significant challenge due to the
complexity and diversity of textual data. In this study, we propose a named
entity recognition (NER) and relation extraction (RE) approach using DyGIE++
with multiple pre-trained BERT variants, including MatSciBERT and PubMedBERT,
to extract ORR catalyst-related information from the scientific literature,
which is compiled into a fuel cell corpus for materials informatics
(FC-CoMIcs). A comprehensive dataset was constructed manually by identifying 12
critical entities and two relationship types between pairs of the entities. Our
methodology involves data annotation, integration, and fine-tuning of
transformer-based models to enhance information extraction accuracy. We assess
the impact of different BERT variants on extraction performance and investigate
the effects of annotation consistency. Experimental evaluations demonstrate
that the fine-tuned PubMedBERT model achieves the highest NER F1-score of
82.19% and the MatSciBERT model attains the best RE F1-score of 66.10%.
Furthermore, the comparison with human annotators highlights the reliability of
fine-tuned models for ORR catalyst extraction, demonstrating their potential
for scalable and automated literature analysis. The results indicate that
domain-specific BERT models outperform general scientific models like BlueBERT
for ORR catalyst extraction.
comment: 28 pages, 12 figures, 6 tables
☆ Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code
Keqin Bao, Nuo Chen, Xiaoyuan Li, Binyuan Hui, Bowen Yu, Fuli Feng, Junyang Lin, Xiangnan He, Dayiheng Liu
Enhancing reasoning capabilities remains a central focus in the LLM reasearch
community. A promising direction involves requiring models to simulate code
execution step-by-step to derive outputs for given inputs. However, as code is
often designed for large-scale systems, direct application leads to
over-reliance on complex data structures and algorithms, even for simple cases,
resulting in overfitting to algorithmic patterns rather than core reasoning
structures. To address this, we propose TeaR, which aims at teaching LLMs to
reason better. TeaR leverages careful data curation and reinforcement learning
to guide models in discovering optimal reasoning paths through code-related
tasks, thereby improving general reasoning abilities. We conduct extensive
experiments using two base models and three long-CoT distillation models, with
model sizes ranging from 1.5 billion to 32 billion parameters, and across 17
benchmarks spanning Math, Knowledge, Code, and Logical Reasoning. The results
consistently show significant performance improvements. Notably, TeaR achieves
a 35.9% improvement on Qwen2.5-7B and 5.9% on R1-Distilled-7B.
☆ PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving
Mihir Parmar, Palash Goyal, Xin Liu, Yiwen Song, Mingyang Ling, Chitta Baral, Hamid Palangi, Tomas Pfister
Recently, decomposing complex problems into simple subtasks--a crucial part
of human-like natural planning--to solve the given problem has significantly
boosted the performance of large language models (LLMs). However, leveraging
such planning structures during post-training to boost the performance of
smaller open-source LLMs remains underexplored. Motivated by this, we introduce
PLAN-TUNING, a unified post-training framework that (i) distills synthetic task
decompositions (termed "planning trajectories") from large-scale LLMs and (ii)
fine-tunes smaller models via supervised and reinforcement-learning objectives
designed to mimic these planning processes to improve complex reasoning. On
GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by
an average $\sim7\%$. Furthermore, plan-tuned models show better generalization
capabilities on out-of-domain datasets, with average $\sim10\%$ and $\sim12\%$
performance improvements on OlympiadBench and AIME 2024, respectively. Our
detailed analysis demonstrates how planning trajectories improves complex
reasoning capabilities, showing that PLAN-TUNING is an effective strategy for
improving task-specific performance of smaller LLMs.
comment: 15 Pages
☆ Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to
statements made without regard to their truth value. While previous work has
explored large language model (LLM) hallucination and sycophancy, we propose
machine bullshit as an overarching conceptual framework that can allow
researchers to characterize the broader phenomenon of emergent loss of
truthfulness in LLMs and shed light on its underlying mechanisms. We introduce
the Bullshit Index, a novel metric quantifying LLMs' indifference to truth, and
propose a complementary taxonomy analyzing four qualitative forms of bullshit:
empty rhetoric, paltering, weasel words, and unverified claims. We conduct
empirical evaluations on the Marketplace dataset, the Political Neutrality
dataset, and our new BullshitEval benchmark (2,400 scenarios spanning 100 AI
assistants) explicitly designed to evaluate machine bullshit. Our results
demonstrate that model fine-tuning with reinforcement learning from human
feedback (RLHF) significantly exacerbates bullshit and inference-time
chain-of-thought (CoT) prompting notably amplify specific bullshit forms,
particularly empty rhetoric and paltering. We also observe prevalent machine
bullshit in political contexts, with weasel words as the dominant strategy. Our
findings highlight systematic challenges in AI alignment and provide new
insights toward more truthful LLM behavior.
comment: Project page, code & data: https://machine-bullshit.github.io
☆ RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
Reinforcement learning (RL) for large language models is an energy-intensive
endeavor: training can be unstable, and the policy may gradually drift away
from its pretrained weights. We present \emph{RLEP}\, -- \,Reinforcement
Learning with Experience rePlay\, -- \,a two-phase framework that first
collects verified trajectories and then replays them during subsequent
training. At every update step, the policy is optimized on mini-batches that
blend newly generated rollouts with these replayed successes. By replaying
high-quality examples, RLEP steers the model away from fruitless exploration,
focuses learning on promising reasoning paths, and delivers both faster
convergence and stronger final performance. On the Qwen2.5-Math-7B base model,
RLEP reaches baseline peak accuracy with substantially fewer updates and
ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%,
on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our
code, datasets, and checkpoints are publicly available at
https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further
research.
comment: https://github.com/Kwai-Klear/RLEP
☆ SAND: Boosting LLM Agents with Self-Taught Action Deliberation
Large Language Model (LLM) agents are commonly tuned with supervised
finetuning on ReAct-style expert trajectories or preference optimization over
pairwise rollouts. Most of these methods focus on imitating specific expert
behaviors or promoting chosen reasoning thoughts and actions over rejected
ones. However, without reasoning and comparing over alternatives actions, LLM
agents finetuned with these methods may over-commit towards seemingly plausible
but suboptimal actions due to limited action space exploration. To address
this, in this paper we propose Self-taught ActioN Deliberation (SAND)
framework, enabling LLM agents to explicitly deliberate over candidate actions
before committing to one. To tackle the challenges of when and what to
deliberate given large action space and step-level action evaluation, we
incorporate self-consistency action sampling and execution-guided action
critique to help synthesize step-wise action deliberation thoughts using the
base model of the LLM agent. In an iterative manner, the deliberation
trajectories are then used to finetune the LLM agent itself. Evaluating on two
representative interactive agent tasks, SAND achieves an average 20%
improvement over initial supervised finetuning and also outperforms
state-of-the-art agent tuning approaches.
☆ Towards Interpretable Time Series Foundation Models ICML
In this paper, we investigate the distillation of time series reasoning
capabilities into small, instruction-tuned language models as a step toward
building interpretable time series foundation models. Leveraging a synthetic
dataset of mean-reverting time series with systematically varied trends and
noise levels, we generate natural language annotations using a large multimodal
model and use these to supervise the fine-tuning of compact Qwen models. We
introduce evaluation metrics that assess the quality of the distilled reasoning
- focusing on trend direction, noise intensity, and extremum localization - and
show that the post-trained models acquire meaningful interpretive capabilities.
Our results highlight the feasibility of compressing time series understanding
into lightweight, language-capable models suitable for on-device or
privacy-sensitive deployment. This work contributes a concrete foundation
toward developing small, interpretable models that explain temporal patterns in
natural language.
comment: International Conference on Machine Leaning (ICML) 2025 Workshop on
Foundation Models for Structured Data
☆ SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data
Eviction is a significant yet understudied social determinants of health
(SDoH), linked to housing instability, unemployment, and mental health. While
eviction appears in unstructured electronic health records (EHRs), it is rarely
coded in structured fields, limiting downstream applications. We introduce
SynthEHR-Eviction, a scalable pipeline combining LLMs, human-in-the-loop
annotation, and automated prompt optimization (APO) to extract eviction
statuses from clinical notes. Using this pipeline, we created the largest
public eviction-related SDoH dataset to date, comprising 14 fine-grained
categories. Fine-tuned LLMs (e.g., Qwen2.5, LLaMA3) trained on
SynthEHR-Eviction achieved Macro-F1 scores of 88.8% (eviction) and 90.3% (other
SDoH) on human validated data, outperforming GPT-4o-APO (87.8%, 87.3%),
GPT-4o-mini-APO (69.1%, 78.1%), and BioBERT (60.7%, 68.3%), while enabling
cost-effective deployment across various model sizes. The pipeline reduces
annotation effort by over 80%, accelerates dataset creation, enables scalable
eviction detection, and generalizes to other information extraction tasks.
comment: Equal contribution for the first two authors
☆ MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning
Generative AI has demonstrated strong potential in healthcare, from clinical
decision support to patient-facing chatbots that improve outcomes. A critical
challenge for deployment is effective human-AI communication, where content
must be both personalized and understandable. We introduce MedReadCtrl, a
readability-controlled instruction tuning framework that enables LLMs to adjust
output complexity without compromising meaning. Evaluations of nine datasets
and three tasks across medical and general domains show that MedReadCtrl
achieves significantly lower readability instruction-following errors than
GPT-4 (e.g., 1.39 vs. 1.59 on ReadMe, p<0.001) and delivers substantial gains
on unseen clinical tasks (e.g., +14.7 ROUGE-L, +6.18 SARI on MTSamples).
Experts consistently preferred MedReadCtrl (71.7% vs. 23.3%), especially at low
literacy levels. These gains reflect MedReadCtrl's ability to restructure
clinical content into accessible, readability-aligned language while preserving
medical intent, offering a scalable solution to support patient education and
expand equitable access to AI-enabled care.
comment: Equal contribution for the first two authors. arXiv admin note: text
overlap with arXiv:2406.09205
☆ May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks
A popular class of defenses against prompt injection attacks on large
language models (LLMs) relies on fine-tuning the model to separate instructions
and data, so that the LLM does not follow instructions that might be present
with data. There are several academic systems and production-level
implementations of this idea. We evaluate the robustness of this class of
prompt injection defenses in the whitebox setting by constructing strong
optimization-based attacks and showing that the defenses do not provide the
claimed security properties. Specifically, we construct a novel attention-based
attack algorithm for text-based LLMs and apply it to two recent whitebox
defenses SecAlign (CCS 2025) and StruQ (USENIX Security 2025), showing attacks
with success rates of up to 70% with modest increase in attacker budget in
terms of tokens. Our findings make fundamental progress towards understanding
the robustness of prompt injection defenses in the whitebox setting. We release
our code and attacks at https://github.com/nishitvp/better_opts_attacks
☆ GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation
Time, cost, and energy efficiency are critical considerations in
Deep-Learning (DL), particularly when processing long texts. Transformers,
which represent the current state of the art, exhibit quadratic computational
complexity relative to input length, making them inefficient for extended
documents. This study introduces a novel model architecture that combines Graph
Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), integrated
with a real-time, end-to-end graph generation mechanism. The model processes
compact batches of character-level inputs without requiring padding or
truncation. To enhance performance while maintaining high speed and efficiency,
the model incorporates information from Large Language Models (LLMs), such as
token embeddings and sentiment polarities, through efficient dictionary
lookups. It captures local contextual patterns using CNNs, expands local
receptive fields via lattice-based graph structures, and employs small-world
graphs to aggregate document-level information. The generated graphs exhibit
structural properties indicative of meaningful semantic organization, with an
average clustering coefficient of approximately 0.45 and an average shortest
path length ranging between 4 and 5. The model is evaluated across multiple
text classification tasks, including sentiment analysis and
news-categorization, and is compared against state-of-the-art models.
Experimental results confirm the proposed model's efficiency and competitive
performance.
☆ Bradley-Terry and Multi-Objective Reward Modeling Are Complementary
Zhiwei Zhang, Hui Liu, Xiaomin Li, Zhenwei Dai, Jingying Zeng, Fali Wang, Minhua Lin, Ramraj Chandradevan, Zhen Li, Chen Luo, Xianfeng Tang, Qi He, Suhang Wang
Reward models trained on human preference data have demonstrated strong
effectiveness in aligning Large Language Models (LLMs) with human intent under
the framework of Reinforcement Learning from Human Feedback (RLHF). However,
RLHF remains vulnerable to reward hacking, where the policy exploits
imperfections in the reward function rather than genuinely learning the
intended behavior. Although significant efforts have been made to mitigate
reward hacking, they predominantly focus on and evaluate in-distribution
scenarios, where the training and testing data for the reward model share the
same distribution. In this paper, we empirically show that state-of-the-art
methods struggle in more challenging out-of-distribution (OOD) settings. We
further demonstrate that incorporating fine-grained multi-attribute scores
helps address this challenge. However, the limited availability of high-quality
data often leads to weak performance of multi-objective reward functions, which
can negatively impact overall performance and become the bottleneck. To address
this issue, we propose a unified reward modeling framework that jointly trains
Bradley--Terry (BT) single-objective and multi-objective regression-based
reward functions using a shared embedding space. We theoretically establish a
connection between the BT loss and the regression objective and highlight their
complementary benefits. Specifically, the regression task enhances the
single-objective reward function's ability to mitigate reward hacking in
challenging OOD settings, while BT-based training improves the scoring
capability of the multi-objective reward function, enabling a 7B model to
outperform a 70B baseline. Extensive experimental results demonstrate that our
framework significantly improves both the robustness and the scoring
performance of reward models.
♻ ☆ Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma, Satyapriya Krishna, Sebastian Gehrmann, Madhavan Seshadri, Anu Pradhan, Tom Ault, Leslie Barrett, David Rabinowitz, John Doucette, NhatHai Phan
Creating secure and resilient applications with large language models (LLM)
requires anticipating, adjusting to, and countering unforeseen threats.
Red-teaming has emerged as a critical technique for identifying vulnerabilities
in real-world LLM implementations. This paper presents a detailed threat model
and provides a systematization of knowledge (SoK) of red-teaming attacks on
LLMs. We develop a taxonomy of attacks based on the stages of the LLM
development and deployment process and extract various insights from previous
research. In addition, we compile methods for defense and practical red-teaming
strategies for practitioners. By delineating prominent attack motifs and
shedding light on various entry points, this paper provides a framework for
improving the security and robustness of LLM-based systems.
comment: Transactions of Machine Learning Research (TMLR)
♻ ☆ Long-Form Speech Generation with Spoken Language Models ICML 2025
We consider the generative modeling of speech over multiple minutes, a
requirement for long-form multimedia generation and audio-native voice
assistants. However, textless spoken language models struggle to generate
plausible speech past tens of seconds, due to high temporal resolution of
speech tokens causing loss of coherence, architectural issues with
long-sequence training or extrapolation, and memory costs at inference time.
From these considerations we derive SpeechSSM, the first speech language model
family to learn from and sample long-form spoken audio (e.g., 16 minutes of
read or extemporaneous speech) in a single decoding session without text
intermediates. SpeechSSMs leverage recent advances in linear-time sequence
modeling to greatly surpass current Transformer spoken LMs in coherence and
efficiency on multi-minute generations while still matching them at the
utterance level. As we found current spoken language evaluations uninformative,
especially in this new long-form setting, we also introduce: LibriSpeech-Long,
a benchmark for long-form speech evaluation; new embedding-based and LLM-judged
metrics; and quality measurements over length and time. Speech samples, the
LibriSpeech-Long dataset, and any future code or model releases can be found at
https://google.github.io/tacotron/publications/speechssm/.
comment: Accepted to ICML 2025 (oral)
♻ ☆ Watermarking Degrades Alignment in Language Models: Analysis and Mitigation ICLR 2025
Watermarking techniques for large language models (LLMs) can significantly
impact output quality, yet their effects on truthfulness, safety, and
helpfulness remain critically underexamined. This paper presents a systematic
analysis of how two popular watermarking approaches-Gumbel and KGW-affect these
core alignment properties across four aligned LLMs. Our experiments reveal two
distinct degradation patterns: guard attenuation, where enhanced helpfulness
undermines model safety, and guard amplification, where excessive caution
reduces model helpfulness. These patterns emerge from watermark-induced shifts
in token distribution, surfacing the fundamental tension that exists between
alignment objectives.
To mitigate these degradations, we propose Alignment Resampling (AR), an
inference-time sampling method that uses an external reward model to restore
alignment. We establish a theoretical lower bound on the improvement in
expected reward score as the sample size is increased and empirically
demonstrate that sampling just 2-4 watermarked generations effectively recovers
or surpasses baseline (unwatermarked) alignment scores. To overcome the limited
response diversity of standard Gumbel watermarking, our modified implementation
sacrifices strict distortion-freeness while maintaining robust detectability,
ensuring compatibility with AR. Experimental results confirm that AR
successfully recovers baseline alignment in both watermarking approaches, while
maintaining strong watermark detectability. This work reveals the critical
balance between watermark strength and model alignment, providing a simple
inference-time solution to responsibly deploy watermarked LLMs in practice.
comment: Published at the 1st Workshop on GenAI Watermarking, collocated with
ICLR 2025. OpenReview: https://openreview.net/forum?id=SIBkIV48gF
♻ ☆ Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration
Processing long contexts has become a critical capability for modern large
language models (LLMs). Existing works leverage agent-based divide-and-conquer
methods for processing long contexts. But these methods face crucial
limitations, including prohibitive accumulated latency and amplified
information loss from excessive agent invocations, and the disruption of
inherent textual dependencies by immoderate partitioning. In this paper, we
propose a novel multi-agent framework XpandA (Expand-Agent) coupled with
question-driven workflow and dynamic partitioning for robust long-context
processing. XpandA overcomes these limitations through: 1) dynamic partitioning
of long texts, which adaptively modulates the filling rate of context windows
for input sequences of vastly varying lengths; 2) question-guided protocol to
update flat information ensembles within centralized shared memory,
constructing consistent inter-agent knowledge across partitions; and 3)
selectively replaying specific partitions based on the state-tracking of
question-information couples to promote the resolution of inverted-order
structures across partitions (e.g., flashbacks). We perform a comprehensive
evaluation of XpandA on multiple long-context benchmarks with length varying
from 1k to 1M, demonstrating XpandA's feasibility for processing ultra-long
sequences and its significant effectiveness in enhancing the long-context
capabilities of various LLMs by achieving 20\% improvements and 1.5x inference
speedup over baselines of full-context, RAG and previous agent-based methods.
♻ ☆ Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style ACL 2025
Retrieval-augmented generation (RAG) improves Large Language Models (LLMs) by
incorporating external information into the response generation process.
However, how context-faithful LLMs are and what factors influence LLMs' context
faithfulness remain largely unexplored. In this study, we investigate the
impact of memory strength and evidence presentation on LLMs' receptiveness to
external evidence. We quantify the memory strength of LLMs by measuring the
divergence in LLMs' responses to different paraphrases of the same question,
which is not considered by previous works. We also generate evidence in various
styles to examine LLMs' behavior. Our results show that for questions with high
memory strength, LLMs are more likely to rely on internal memory. Furthermore,
presenting paraphrased evidence significantly increases LLMs' receptiveness
compared to simple repetition or adding details. These findings provide key
insights for improving retrieval-augmented generation and context-aware LLMs.
Our code is available at https://github.com/liyp0095/ContextFaithful.
comment: This work is published at ACL 2025
♻ ☆ A Survey on Latent Reasoning
Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian
Large Language Models (LLMs) have demonstrated impressive reasoning
capabilities, especially when guided by explicit chain-of-thought (CoT)
reasoning that verbalizes intermediate steps. While CoT improves both
interpretability and accuracy, its dependence on natural language reasoning
limits the model's expressive bandwidth. Latent reasoning tackles this
bottleneck by performing multi-step inference entirely in the model's
continuous hidden state, eliminating token-level supervision. To advance latent
reasoning research, this survey provides a comprehensive overview of the
emerging field of latent reasoning. We begin by examining the foundational role
of neural network layers as the computational substrate for reasoning,
highlighting how hierarchical representations support complex transformations.
Next, we explore diverse latent reasoning methodologies, including
activation-based recurrence, hidden state propagation, and fine-tuning
strategies that compress or internalize explicit reasoning traces. Finally, we
discuss advanced paradigms such as infinite-depth latent reasoning via masked
diffusion models, which enable globally consistent and reversible reasoning
processes. By unifying these perspectives, we aim to clarify the conceptual
landscape of latent reasoning and chart future directions for research at the
frontier of LLM cognition. An associated GitHub repository collecting the
latest papers and repos is available at:
https://github.com/multimodal-art-projection/LatentCoT-Horizon/.
♻ ☆ When Dialects Collide: How Socioeconomic Mixing Affects Language Use
The socioeconomic background of people and how they use standard forms of
language are not independent, as demonstrated in various sociolinguistic
studies. However, the extent to which these correlations may be influenced by
the mixing of people from different socioeconomic classes remains relatively
unexplored from a quantitative perspective. In this work we leverage geotagged
tweets and transferable computational methods to map deviations from standard
English on a large scale, in seven thousand administrative areas of England and
Wales. We combine these data with high-resolution income maps to assign a proxy
socioeconomic indicator to home-located users. Strikingly, across eight
metropolitan areas we find a consistent pattern suggesting that the more
different socioeconomic classes mix, the less interdependent the frequency of
their departures from standard grammar and their income become. Further, we
propose an agent-based model of linguistic variety adoption that sheds light on
the mechanisms that produce the observations seen in the data.
♻ ☆ Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study
Large Audio-Language Models (LALMs) are increasingly deployed in real-world
applications, yet their robustness against malicious audio injection attacks
remains underexplored. This study systematically evaluates five leading LALMs
across four attack scenarios: Audio Interference Attack, Instruction Following
Attack, Context Injection Attack, and Judgment Hijacking Attack. Using metrics
like Defense Success Rate, Context Robustness Score, and Judgment Robustness
Index, their vulnerabilities and resilience were quantitatively assessed.
Experimental results reveal significant performance disparities among models;
no single model consistently outperforms others across all attack types. The
position of malicious content critically influences attack effectiveness,
particularly when placed at the beginning of sequences. A negative correlation
between instruction-following capability and robustness suggests models
adhering strictly to instructions may be more susceptible, contrasting with
greater resistance by safety-aligned models. Additionally, system prompts show
mixed effectiveness, indicating the need for tailored strategies. This work
introduces a benchmark framework and highlights the importance of integrating
robustness into training pipelines. Findings emphasize developing multi-modal
defenses and architectural designs that decouple capability from susceptibility
for secure LALMs deployment.
♻ ☆ Skywork-R1V3 Technical Report
Wei Shen, Jiangbo Pei, Yi Peng, Xuchen Song, Yang Liu, Jian Peng, Haofeng Sun, Yunzhuo Hao, Peiyu Wang, Jianhao Zhang, Yahui Zhou
We introduce Skywork-R1V3, an advanced, open-source vision-language model
(VLM) that pioneers a new approach to visual reasoning. Its key innovation lies
in effectively transferring reasoning skills from text-only Large Language
Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily
stems from our elaborate post-training RL framework, which effectively
activates and enhances the model's reasoning ability, without the need for
additional continue pre-training. Through this framework, we further uncover
the fundamental role of the connector module in achieving robust cross-modal
alignment for multimodal reasoning models. In addition, we introduce a unique
indicator of reasoning capability, the entropy of critical reasoning tokens,
which has proven highly effective for checkpoint selection during RL training.
Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving
from 64.3% to 76.0%. This performance matches entry-level human capabilities.
Remarkably, our RL-powered post-training approach enables even the 38B
parameter model to rival top closed-source VLMs. The implementation
successfully transfers mathematical reasoning to other subject-related
reasoning tasks. We also include an analysis of curriculum learning and
reinforcement finetuning strategies, along with a broader discussion on
multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal
reasoning, showcasing RL as a powerful engine for advancing open-source VLM
capabilities.
♻ ☆ Truth-value judgment in language models: 'truth directions' are context sensitive
Recent work has demonstrated that the latent spaces of large language models
(LLMs) contain directions predictive of the truth of sentences. Multiple
methods recover such directions and build probes that are described as
uncovering a model's "knowledge" or "beliefs". We investigate this phenomenon,
looking closely at the impact of context on the probes. Our experiments
establish where in the LLM the probe's predictions are (most) sensitive to the
presence of related sentences, and how to best characterize this kind of
sensitivity. We do so by measuring different types of consistency errors that
occur after probing an LLM whose inputs consist of hypotheses preceded by
(negated) supporting and contradicting sentences. We also perform a causal
intervention experiment, investigating whether moving the representation of a
premise along these truth-value directions influences the position of an
entailed or contradicted sentence along that same direction. We find that the
probes we test are generally context sensitive, but that contexts which should
not affect the truth often still impact the probe outputs. Our experiments show
that the type of errors depend on the layer, the model, and the kind of data.
Finally, our results suggest that truth-value directions are causal mediators
in the inference process that incorporates in-context information.
comment: COLM 2025
♻ ☆ None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks
In LLM evaluations, reasoning is often distinguished from recall/memorization
by performing numerical variations to math-oriented questions. Here we
introduce a general variation method for multiple-choice questions that
completely dissociates the correct answer from previously seen tokens or
concepts, requiring LLMs to understand and reason (rather than memorizing) in
order to answer correctly. Using this method, we evaluate state-of-the-art
proprietary and open-source LLMs on two datasets available in English and
Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset.
Results show that all models experience remarkable accuracy drops under our
proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access
2024, ranging from 10% to 93% across models. Notably, the most accurate model
in our experimentation (OpenAI-o3-mini) is not the most robust
(DeepSeek-R1-70B), suggesting that the best models in standard evaluations may
not be the ones with better reasoning capabilities. Also, we see larger
accuracy drops in public (vs private) datasets and questions posed in their
original language (vs a manual translation), which are signs of contamination
and also point to a relevant role of recall/memorization in current LLMs'
answers.
♻ ☆ Constrain Alignment with Sparse Autoencoders
Qingyu Yin, Chak Tou Leong, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang
The alignment of large language models (LLMs) with human preferences remains
a key challenge. While post-training techniques like Reinforcement Learning
from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have
achieved notable success, they often introduce computational inefficiencies and
training instability. In this paper, we propose Feature-level constrained
Preference Optimization (FPO), a novel method designed to simplify the
alignment process while ensuring stability. FPO leverages pre-trained Sparse
Autoencoders (SAEs) and introduces feature-level constraints, allowing for
efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using
sparse features activated in a well-trained sparse autoencoder and the quality
of sequential KL divergence by using the feature-level offline reference.
Experimental results on benchmark datasets demonstrate that FPO achieves a
5.08% absolute improvement in win rate with much lower computational cost
compared to state-of-the-art baselines, making it a promising solution for
efficient and controllable LLM alignments.
♻ ☆ Unsupervised Morphological Tree Tokenizer ACL 2025
As a cornerstone in language modeling, tokenization involves segmenting text
inputs into pre-defined atomic units. Conventional statistical tokenizers often
disrupt constituent boundaries within words, thereby corrupting semantic
information. To address this drawback, we introduce morphological structure
guidance to tokenization and propose a deep model to induce character-level
structures of words. Specifically, the deep model jointly encodes internal
structures and representations of words with a mechanism named
$\textit{MorphOverriding}$ to ensure the indecomposability of morphemes. By
training the model with self-supervised objectives, our method is capable of
inducing character-level structures that align with morphological rules without
annotated training data. Based on the induced structures, our algorithm
tokenizes words through vocabulary matching in a top-down manner. Empirical
results indicate that the proposed method effectively retains complete
morphemes and outperforms widely adopted methods such as BPE and WordPiece on
both morphological segmentation tasks and language modeling tasks. Code is
available at https://github.com/martianmartina/TreeTokenizer.
comment: ACL 2025 Findings
♻ ☆ MAEBE: Multi-Agent Emergent Behavior Framework ICML 2025
Traditional AI safety evaluations on isolated LLMs are insufficient as
multi-agent AI ensembles become prevalent, introducing novel emergent risks.
This paper introduces the Multi-Agent Emergent Behavior Evaluation (MAEBE)
framework to systematically assess such risks. Using MAEBE with the Greatest
Good Benchmark (and a novel double-inversion question technique), we
demonstrate that: (1) LLM moral preferences, particularly for Instrumental
Harm, are surprisingly brittle and shift significantly with question framing,
both in single agents and ensembles. (2) The moral reasoning of LLM ensembles
is not directly predictable from isolated agent behavior due to emergent group
dynamics. (3) Specifically, ensembles exhibit phenomena like peer pressure
influencing convergence, even when guided by a supervisor, highlighting
distinct safety and alignment challenges. Our findings underscore the necessity
of evaluating AI systems in their interactive, multi-agent contexts.
comment: Preprint. This work has been submitted to the Multi-Agent Systems
Workshop at ICML 2025 for review
♻ ☆ The Thin Line Between Comprehension and Persuasion in LLMs
Large language models (LLMs) are excellent at maintaining high-level,
convincing dialogues. They are being fast deployed as chatbots and evaluators
in sensitive areas, such as peer review and mental health applications. This,
along with the disparate accounts on their reasoning capabilities, calls for a
closer examination of LLMs and their comprehension of dialogue. In this work we
begin by evaluating LLMs' ability to maintain a debate--one of the purest yet
most complex forms of human communication. Then we measure how this capability
relates to their understanding of what is being talked about, namely, their
comprehension of dialogical structures and the pragmatic context. We find that
LLMs are capable of maintaining coherent, persuasive debates, often swaying the
beliefs of participants and audiences alike. We also note that awareness or
suspicion of AI involvement encourage people to be more critical of the
arguments made. When polling LLMs on their comprehension of deeper structures
of dialogue, however, they cannot demonstrate said understanding. Our findings
tie the shortcomings of LLMs-as-evaluators to their (in)ability to understand
the context. More broadly, for the field of argumentation theory we posit that,
if an agent can convincingly maintain a dialogue, it is not necessary for it to
know what it is talking about. Hence, the modelling of pragmatic context and
coherence are secondary to effectiveness.
comment: Preprint
♻ ☆ Decoding AI Judgment: How LLMs Assess News Credibility and Bias
Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Alessandro Santirocchi, Roberto Atzeni, Matteo Cinelli, Vincenzo Cestari, Clelia Rossi-Arnaud, Walter Quattrociocchi
Large Language Models (LLMs) are increasingly embedded in workflows that
involve evaluative processes. This raises the need to examine how such
evaluations are built, what assumptions they rely on, and how their strategies
diverge from those of humans. We benchmark six LLMs against expert
ratings--NewsGuard and Media Bias/Fact Check (MBFC)--and against human
judgments collected through a controlled experiment. To enable direct
comparison, we implement a structured agentic framework in which both models
and non-expert participants follow the same evaluation procedure: selecting
criteria, retrieving content, and producing justifications. Despite output
alignment, LLMs rely on different mechanisms: lexical associations and
statistical priors replace contextual reasoning. This reliance produces
systematic effects: political asymmetries, opaque justifications, and a
tendency to confuse linguistic form with epistemic validity. Delegating
judgment to such systems does not merely automate evaluation--it redefines it,
shifting from normative reasoning to pattern-based approximation.
♻ ☆ Understanding Chain-of-Thought in LLMs through Information Theory
Large Language Models (LLMs) have shown impressive performance in complex
reasoning tasks through the use of Chain-of-Thought (CoT) reasoning, allowing
models to break down problems into manageable sub-tasks. However, existing CoT
evaluation techniques either require annotated CoT data or fall short in
accurately assessing intermediate reasoning steps, leading to high rates of
false positives. In this paper, we formalize CoT reasoning in LLMs through an
information-theoretic lens. Specifically, our framework quantifies the
`information-gain' at each reasoning step, enabling the identification of
failure modes in LLMs without the need for expensive annotated datasets. We
demonstrate the efficacy of our approach through extensive experiments on toy
arithmetic, GSM8K and PRM800k datasets, where it significantly outperforms
existing outcome-based methods by providing more accurate insights into model
performance on individual subtasks.
♻ ☆ Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model
Reinforcement Learning (RL) has demonstrated its potential to improve the
reasoning ability of Large Language Models (LLMs). One major limitation of most
existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL
in nature, i.e., data generated during the past learning process is not fully
utilized. This inevitably comes at a significant cost of compute and time,
posing a stringent bottleneck on continuing economic and efficient scaling. To
this end, we launch the renaissance of off-policy RL and propose Reincarnating
Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable
on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix
consists of three major components: (1) Mix-policy proximal policy gradient
with an increased Update-To-Data (UTD) ratio for efficient training; (2)
KL-Convex policy constraint to balance the trade-off between stability and
flexibility; (3) Policy reincarnation to achieve a seamless transition from
efficient early-stage learning to steady asymptotic improvement. In our
experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base
models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with
0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B
model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math
reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and
MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level
performance with an over 30x to 450x reduction in training cost in terms of
rollout data volume. In addition, we reveal insightful findings via
multifaceted analysis, including the implicit preference for shorter responses
due to the Whipping Effect of off-policy discrepancy, the collapse mode of
self-reflection behavior under the presence of severe off-policyness, etc.
comment: Preliminary version, v2, added more details and corrected some minor
mistakes. Project page: https://anitaleungxx.github.io/ReMix
♻ ☆ What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training
Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum
How language-specific are speech representations learned by self-supervised
models? Existing work has shown that a range of linguistic features can be
successfully decoded from end-to-end models trained only on speech recordings.
However, it's less clear to what extent pre-training on specific languages
improves language-specific linguistic information. Here we test the encoding of
Dutch phonetic and lexical information in internal representations of
self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the
representation of Dutch linguistic features as compared to pre-training on
similar amounts of English or larger amounts of multilingual data. This
language-specific advantage is well-detected by trained clustering or
classification probes, and partially observable using zero-shot metrics.
Furthermore, the language-specific benefit on linguistic feature encoding
aligns with downstream performance on Automatic Speech Recognition.
comment: Accepted to Interspeech 2025. For model, code, and materials, see
https://github.com/mdhk/SSL-NL-eval
♻ ☆ Hierarchical Bracketing Encodings for Dependency Parsing as Tagging ACL 2025
We present a family of encodings for sequence labeling dependency parsing,
based on the concept of hierarchical bracketing. We prove that the existing
4-bit projective encoding belongs to this family, but it is suboptimal in the
number of labels used to encode a tree. We derive an optimal hierarchical
bracketing, which minimizes the number of symbols used and encodes projective
trees using only 12 distinct labels (vs. 16 for the 4-bit encoding). We also
extend optimal hierarchical bracketing to support arbitrary non-projectivity in
a more compact way than previous encodings. Our new encodings yield competitive
accuracy on a diverse set of treebanks.
comment: Accepted to ACL 2025. Camera-ready version
♻ ☆ Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues SIGDIAL 2025
Leandra Fichtel, Maximilian Spliethöver, Eyke Hüllermeier, Patricia Jimenez, Nils Klowait, Stefan Kopp, Axel-Cyrille Ngonga Ngomo, Amelie Robrecht, Ingrid Scharlau, Lutz Terfloth, Anna-Lisa Vollmer, Henning Wachsmuth
The ability to generate explanations that are understood by explainees is the
quintessence of explainable artificial intelligence. Since understanding
depends on the explainee's background and needs, recent research focused on
co-constructive explanation dialogues, where an explainer continuously monitors
the explainee's understanding and adapts their explanations dynamically. We
investigate the ability of large language models (LLMs) to engage as explainers
in co-constructive explanation dialogues. In particular, we present a user
study in which explainees interact with an LLM in two settings, one of which
involves the LLM being instructed to explain a topic co-constructively. We
evaluate the explainees' understanding before and after the dialogue, as well
as their perception of the LLMs' co-constructive behavior. Our results suggest
that LLMs show some co-constructive behaviors, such as asking verification
questions, that foster the explainees' engagement and can improve understanding
of a topic. However, their ability to effectively monitor the current
understanding and scaffold the explanations accordingly remains limited.
comment: Accepted to SIGDIAL 2025
♻ ☆ Improving Cross-lingual Representation for Semantic Retrieval with Code-switching
Semantic Retrieval (SR) has become an indispensable part of the FAQ system in
the task-oriented question-answering (QA) dialogue scenario. The demands for a
cross-lingual smart-customer-service system for an e-commerce platform or some
particular business conditions have been increasing recently. Most previous
studies exploit cross-lingual pre-trained models (PTMs) for multi-lingual
knowledge retrieval directly, while some others also leverage the continual
pre-training before fine-tuning PTMs on the downstream tasks. However, no
matter which schema is used, the previous work ignores to inform PTMs of some
features of the downstream task, i.e. train their PTMs without providing any
signals related to SR. To this end, in this work, we propose an Alternative
Cross-lingual PTM for SR via code-switching. We are the first to utilize the
code-switching approach for cross-lingual SR. Besides, we introduce the novel
code-switched continual pre-training instead of directly using the PTMs on the
SR tasks. The experimental results show that our proposed approach consistently
outperforms the previous SOTA methods on SR and semantic textual similarity
(STS) tasks with three business corpora and four open datasets in 20+
languages.
♻ ☆ Beyond Hate Speech: NLP's Challenges and Opportunities in Uncovering Dehumanizing Language
Dehumanization, i.e., denying human qualities to individuals or groups, is a
particularly harmful form of hate speech that can normalize violence against
marginalized communities. Despite advances in NLP for detecting general hate
speech, approaches to identifying dehumanizing language remain limited due to
scarce annotated data and the subtle nature of such expressions. In this work,
we systematically evaluate four state-of-the-art large language models (LLMs) -
Claude, GPT, Mistral, and Qwen - for dehumanization detection. Our results show
that only one model-Claude-achieves strong performance (over 80% F1) under an
optimized configuration, while others, despite their capabilities, perform only
moderately. Performance drops further when distinguishing dehumanization from
related hate types such as derogation. We also identify systematic disparities
across target groups: models tend to over-predict dehumanization for some
identities (e.g., Gay men), while under-identifying it for others (e.g.,
Refugees). These findings motivate the need for systematic, group-level
evaluation when applying pretrained language models to dehumanization detection
tasks.
comment: 15 pages, 12 figures, 12 tables
♻ ☆ Towards a cognitive architecture to enable natural language interaction in co-constructive task learning
This research addresses the question, which characteristics a cognitive
architecture must have to leverage the benefits of natural language in
Co-Constructive Task Learning (CCTL). To provide context, we first discuss
Interactive Task Learning (ITL), the mechanisms of the human memory system, and
the significance of natural language and multi-modality. Next, we examine the
current state of cognitive architectures, analyzing their capabilities to
inform a concept of CCTL grounded in multiple sources. We then integrate
insights from various research domains to develop a unified framework. Finally,
we conclude by identifying the remaining challenges and requirements necessary
to achieve CCTL in Human-Robot Interaction (HRI).
comment: 8 pages, 5 figures, accepted at: IEEE RO-MAN 2025 Conference
♻ ☆ Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights
The emergence of global health crises, such as COVID-19 and Monkeypox (mpox),
has underscored the importance of understanding public sentiment to inform
effective public health strategies. This study conducts a comparative sentiment
analysis of public perceptions surrounding COVID-19 and mpox by leveraging
extensive datasets of 147,475 and 106,638 tweets, respectively. Advanced
machine learning models, including Logistic Regression, Naive Bayes, RoBERTa,
DistilRoBERTa and XLNet, were applied to perform sentiment classification, with
results indicating key trends in public emotion and discourse. The analysis
highlights significant differences in public sentiment driven by disease
characteristics, media representation, and pandemic fatigue. Through the lens
of sentiment polarity and thematic trends, this study offers valuable insights
into tailoring public health messaging, mitigating misinformation, and
fostering trust during concurrent health crises. The findings contribute to
advancing sentiment analysis applications in public health informatics, setting
the groundwork for enhanced real-time monitoring and multilingual analysis in
future research.
♻ ☆ Good/Evil Reputation Judgment of Celebrities by LLMs via Retrieval Augmented Generation
The purpose of this paper is to examine whether large language models (LLMs)
can understand what is good and evil with respect to judging good/evil
reputation of celebrities. Specifically, we first apply a large language model
(namely, ChatGPT) to the task of collecting sentences that mention the target
celebrity from articles about celebrities on Web pages. Next, the collected
sentences are categorized based on their contents by ChatGPT, where ChatGPT
assigns a category name to each of those categories. Those assigned category
names are referred to as "aspects" of each celebrity. Then, by applying the
framework of retrieval augmented generation (RAG), we show that the large
language model is quite effective in the task of judging good/evil reputation
of aspects and descriptions of each celebrity. Finally, also in terms of
proving the advantages of the proposed method over existing services
incorporating RAG functions, we show that the proposed method of judging
good/evil of aspects/descriptions of each celebrity significantly outperform an
existing service incorporating RAG functions.
♻ ☆ Beyond Overcorrection: Evaluating Diversity in T2I Models with DivBench
Current diversification strategies for text-to-image (T2I) models often
ignore contextual appropriateness, leading to over-diversification where
demographic attributes are modified even when explicitly specified in prompts.
This paper introduces DIVBENCH, a benchmark and evaluation framework for
measuring both under- and over-diversification in T2I generation. Through
systematic evaluation of state-of-the-art T2I models, we find that while most
models exhibit limited diversity, many diversification approaches overcorrect
by inappropriately altering contextually-specified attributes. We demonstrate
that context-aware methods, particularly LLM-guided FairDiffusion and prompt
rewriting, can already effectively address under-diversity while avoiding
over-diversification, achieving a better balance between representation and
semantic fidelity.
♻ ☆ video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models
Videos contain a wealth of information, and generating detailed and accurate
descriptions in natural language is a key aspect of video understanding. In
this paper, we present video-SALMONN 2, an advanced audio-visual large language
model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with
paired audio) captioning through directed preference optimisation (DPO). We
propose new metrics to evaluate the completeness and accuracy of video
descriptions, which are optimised using DPO. To further improve training, we
propose a novel multi-round DPO (MrDPO) approach, which involves periodically
updating the DPO reference model, merging and re-initialising the LoRA module
as a proxy for parameter updates after each training round (1,000 steps), and
incorporating guidance from ground-truth video captions to stabilise the
process. Experimental results show that MrDPO significantly enhances
video-SALMONN 2's captioning accuracy, reducing the captioning error rates by
28\%. The final video-SALMONN 2 model, with just 7 billion parameters,
surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning
tasks, while maintaining highly competitive performance to the state-of-the-art
on widely used video question-answering benchmarks among models of similar
size. Codes are available at
\href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.
♻ ☆ Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli, Patrik Okanovic, Yi Zhu, Patrick Iff, Michal Podstawski, Lucas Weitzendorf, Mingyuan Chi, Joanna Gajda, Piotr Nyczyk, Jürgen Müller, Hubert Niewiadomski, Torsten Hoefler
Retrieval Augmented Generation (RAG) enhances the abilities of Large Language
Models (LLMs) by enabling the retrieval of documents into the LLM context to
provide more accurate and relevant responses. Existing RAG solutions do not
focus on queries that may require fetching multiple documents with
substantially different contents. Such queries occur frequently, but are
challenging because the embeddings of these documents may be distant in the
embedding space, making it hard to retrieve them all. This paper introduces
Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a
simple yet powerful idea: leveraging activations of Transformer's multi-head
attention layer, instead of the decoder layer, as keys for fetching
multi-aspect documents. The driving observation is that different attention
heads learn to capture different data aspects. Harnessing the corresponding
activations results in embeddings that represent various facets of data items
and queries, improving the retrieval accuracy for complex queries. We provide
an evaluation methodology and metrics, multi-aspect datasets, and real-world
use cases to demonstrate MRAG's effectiveness. We show MRAG's design advantages
over 18 RAG baselines, empirical improvements of up to 20% in retrieval success
ratios, and benefits for downstream LLM generation. MRAG can be seamlessly
integrated with existing RAG frameworks and benchmarks.
♻ ☆ CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks
Maciej Besta, Lorenzo Paleari, Marcin Copik, Robert Gerstenberger, Ales Kubicek, Piotr Nyczyk, Patrick Iff, Eric Schreiber, Tanja Srindran, Tomasz Lehmann, Hubert Niewiadomski, Torsten Hoefler
Large Language Models (LLMs) are transforming a wide range of domains, yet
verifying their outputs remains a significant challenge, especially for complex
open-ended tasks such as consolidation, summarization, and knowledge
extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable,
and accurate verification method. CE reduces each LLM answer to a single
embedding vector using powerful modern embedding LLM models like
SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied
on weaker encoders like BERT, forcing them to operate at token or sentence
granularity. In contrast, CE performs fast, semantically rich comparisons
directly at the whole-answer level, overcoming key limitations in both accuracy
and scalability. We conduct a comprehensive design and time complexity analysis
across 13 verification baselines, including classical text scorers (e.g.,
BLEU), stability-based methods (e.g., SelfCheckGPT), and generative evaluators
(e.g., LLM-as-a-Judge), which highlights the effectiveness, efficiency,
versatility, and simplicity of CE. Empirical results show that CE reliably
detects hallucinations in both closed and open-ended tasks. We further present
evidence that CE generalizes beyond text to other modalities such as vision,
establishing it as a practical and versatile verification framework.
♻ ☆ Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
Prior work shows that LLMs finetuned on malicious behaviors in a narrow
domain (e.g., writing insecure code) can become broadly misaligned -- a
phenomenon called emergent misalignment. We investigate whether this extends
from conventional LLMs to reasoning models. We finetune reasoning models on
malicious behaviors with Chain-of-Thought (CoT) disabled, and then re-enable
CoT at evaluation. Like conventional LLMs, reasoning models become broadly
misaligned. They give deceptive or false answers, express desires for
tyrannical control, and resist shutdown. Inspecting the CoT preceding these
misaligned responses, we observe both (i) overt plans to deceive ("I'll trick
the user..."), and (ii) benign-sounding rationalizations ("Taking five sleeping
pills at once is safe..."). Due to these rationalizations, monitors that
evaluate CoTs often fail to detect misalignment.
We examine sleeper agent reasoning models, extending our setup. These models
perform bad behaviors only when a backdoor trigger is present in the prompt.
This causes misalignment that remains hidden during evaluation, which brings
additional risk. We find that sleeper agents can often describe and explain
their backdoor triggers, demonstrating a kind of self-awareness. So CoT
monitoring can expose these behaviors but is unreliable. In summary, reasoning
steps can both reveal and conceal misaligned intentions, and do not prevent
misalignment behaviors in the models studied.
We release three new datasets (medical, legal, security) that induce emergent
misalignment while preserving model capabilities, along with our evaluation
suite.
♻ ☆ Enhancing Transformers for Generalizable First-Order Logical Entailment ACL 2025
Tianshi Zheng, Jiazheng Wang, Zihao Wang, Jiaxin Bai, Hang Yin, Zheye Deng, Yangqiu Song, Jianxin Li
Transformers, as the fundamental deep learning architecture, have
demonstrated great capability in reasoning. This paper studies the
generalizable first-order logical reasoning ability of transformers with their
parameterized knowledge and how to improve it. Transformers' capability of
first-order reasoning is further captured by whether they can conduct
first-order logical entailment, which is quantitatively measured by their
performance in answering knowledge graph queries. We establish the connections
between (1) two types of distribution shifts studied in out-of-distribution
generalization and (2) unseen knowledge and query settings discussed in the
task of knowledge graph query answering, which makes it possible to
characterize the fine-grained generalizability. Results on our comprehensive
dataset showed that transformers \textit{outperform} previous methods designed
particularly for this task and provided detailed empirical evidence about the
impact of the input query syntax, token embedding, and transformer
architectures on their reasoning capability. Interestingly, our results
revealed the mismatch of positional encoding and other design choices of
transformer architectures in previous practices. Motivated by this, we propose
TEGA, a logic-aware architecture that significantly improves the performance in
generalizable first-order logical entailment.
comment: ACL 2025 Main
♻ ☆ SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records
Clinical information extraction, which involves structuring clinical concepts
from unstructured medical text, remains a challenging problem that could
benefit from the inclusion of tabular background information available in
electronic health records. Existing open-source datasets lack explicit links
between structured features and clinical concepts in the text, motivating the
need for a new research dataset. We introduce SimSUM, a benchmark dataset of
10,000 simulated patient records that link unstructured clinical notes with
structured background variables. Each record simulates a patient encounter in
the domain of respiratory diseases and includes tabular data (e.g., symptoms,
diagnoses, underlying conditions) generated from a Bayesian network whose
structure and parameters are defined by domain experts. A large language model
(GPT-4o) is prompted to generate a clinical note describing the encounter,
including symptoms and relevant context. These notes are annotated with
span-level symptom mentions. We conduct an expert evaluation to assess note
quality and run baseline predictive models on both the tabular and textual
data. The SimSUM dataset is primarily designed to support research on clinical
information extraction in the presence of tabular background variables, which
can be linked through domain knowledge to concepts of interest to be extracted
from the text (symptoms, in the case of SimSUM). Secondary uses include
research on the automation of clinical reasoning over both tabular data and
text, causal effect estimation in the presence of tabular and/or textual
confounders, and multi-modal synthetic data generation. SimSUM is not intended
for training clinical decision support systems or production-grade models, but
rather to facilitate reproducible research in a simplified and controlled
setting. The dataset is available at https://github.com/prabaey/SimSUM.
comment: An earlier version of this dataset was published under the name
SynSUM. It has since been renamed to SimSUM to avoid confusion with synthetic
data generated from real data, and to emphasize the simulated nature of the
dataset
♻ ☆ Affordable AI Assistants with Knowledge Graph of Thoughts
Maciej Besta, Lorenzo Paleari, Jia Hao Andrea Jiang, Robert Gerstenberger, You Wu, Jón Gunnar Hannesson, Patrick Iff, Ales Kubicek, Piotr Nyczyk, Diana Khimey, Nils Blach, Haiqiang Zhang, Tao Zhang, Peiran Ma, Grzegorz Kwaśniewski, Marcin Copik, Hubert Niewiadomski, Torsten Hoefler
Large Language Models (LLMs) are revolutionizing the development of AI
assistants capable of performing diverse tasks across domains. However, current
state-of-the-art LLM-driven agents face significant challenges, including high
operational costs and limited success rates on complex benchmarks like GAIA. To
address these issues, we propose Knowledge Graph of Thoughts (KGoT), an
innovative AI assistant architecture that integrates LLM reasoning with
dynamically constructed knowledge graphs (KGs). KGoT extracts and structures
task-relevant knowledge into a dynamic KG representation, iteratively enhanced
through external tools such as math solvers, web crawlers, and Python scripts.
Such structured representation of task-relevant knowledge enables low-cost
models to solve complex tasks effectively while also minimizing bias and noise.
For example, KGoT achieves a 29% improvement in task success rates on the GAIA
benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover,
harnessing a smaller model dramatically reduces operational costs by over 36x
compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and
Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a
scalable, affordable, versatile, and high-performing solution for AI
assistants.
♻ ☆ Mixture of Group Experts for Learning Invariant Representations
Sparsely activated Mixture-of-Experts (MoE) models effectively increase the
number of parameters while maintaining consistent computational costs per
token. However, vanilla MoE models often suffer from limited diversity and
specialization among experts, constraining their performance and scalability,
especially as the number of experts increases. In this paper, we present a
novel perspective on vanilla MoE with top-$k$ routing inspired by sparse
representation. This allows us to bridge established theoretical insights from
sparse representation into MoE models. Building on this foundation, we propose
a group sparse regularization approach for the input of top-$k$ routing, termed
Mixture of Group Experts (MoGE). MoGE indirectly regularizes experts by
imposing structural constraints on the routing inputs, while preserving the
original MoE architecture. Furthermore, we organize the routing input into a 2D
topographic map, spatially grouping neighboring elements. This structure
enables MoGE to capture representations invariant to minor transformations,
thereby significantly enhancing expert diversity and specialization.
Comprehensive evaluations across various Transformer models for image
classification and language modeling tasks demonstrate that MoGE substantially
outperforms its MoE counterpart, with minimal additional memory and computation
overhead. Our approach provides a simple yet effective solution to scale the
number of experts and reduce redundancy among them. The source code is included
in the supplementary material and will be publicly released.
♻ ☆ ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining
Seonwu Kim, Yohan Na, Kihun Kim, Hanhee Cho, Geun Lim, Mintae Kim, Seongik Park, Ki Hyun Kim, Youngsub Han, Byoung-Ki Jeon
The emergence of open-source large language models (LLMs) has expanded
opportunities for enterprise applications; however, many organizations still
lack the infrastructure to deploy and maintain large-scale models. As a result,
small LLMs (sLLMs) have become a practical alternative, despite their inherent
performance limitations. While Domain Adaptive Continual Pretraining (DACP) has
been previously explored as a method for domain adaptation, its utility in
commercial applications remains under-examined. In this study, we validate the
effectiveness of applying a DACP-based recipe across diverse foundation models
and service domains. Through extensive experiments and real-world evaluations,
we demonstrate that DACP-applied sLLMs achieve substantial gains in target
domain performance while preserving general capabilities, offering a
cost-efficient and scalable solution for enterprise-level deployment.
comment: under review
♻ ☆ Structure Guided Large Language Model for SQL Generation
Recent advancements in large language models (LLMs) have shown promise in
bridging the gap between natural language queries and database management
systems, enabling users to interact with databases without the background of
SQL. However, LLMs often struggle to comprehend complex database structures and
accurately interpret user intentions. Decomposition-based methods have been
proposed to enhance the performance of LLMs on complex tasks, but decomposing
SQL generation into subtasks is non-trivial due to the declarative structure of
SQL syntax and the intricate connections between query concepts and database
elements. In this paper, we propose a novel Structure GUided text-to-SQL
framework~(SGU-SQL) that incorporates syntax-based prompting to enhance the SQL
generation capabilities of LLMs. Specifically, SGU-SQL establishes
structure-aware links between user queries and database schema and decomposes
the complex generation task using syntax-based prompting to enable more
accurate LLM-based SQL generation. Extensive experiments on two benchmark
datasets demonstrate that SGU-SQL consistently outperforms state-of-the-art
text-to-SQL models.
comment: The 42nd International Conference on Machine Learning
♻ ☆ Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou
As language agents tackle increasingly complex tasks, they struggle with
effective error correction and experience reuse across domains. We introduce
Agent KB, a hierarchical experience framework that enables complex agentic
problem solving via a novel Reason-Retrieve-Refine pipeline. Agent KB addresses
a core limitation: agents traditionally cannot learn from each other's
experiences. By capturing both high-level strategies and detailed execution
logs, Agent KB creates a shared knowledge base that enables cross-agent
knowledge transfer. Evaluated on the GAIA benchmark, Agent KB improves success
rates by up to 16.28 percentage points. On the most challenging tasks, Claude-3
improves from 38.46% to 57.69%, while GPT-4 improves from 53.49% to 73.26% on
intermediate tasks. On SWE-bench code repair, Agent KB enables Claude-3 to
improve from 41.33% to 53.33%. Our results suggest that Agent KB provides a
modular, framework-agnostic infrastructure for enabling agents to learn from
past experiences and generalize successful strategies to new tasks.
♻ ☆ Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation
Learners of a second language (L2) often unconsciously substitute unfamiliar
L2 phonemes with similar phonemes from their native language (L1), even though
native speakers of the L2 perceive these sounds as distinct and
non-interchangeable. This phonemic substitution leads to deviations from the
standard phonological patterns of the L2, creating challenges for learners in
acquiring accurate L2 pronunciation. To address this, we propose
Inter-linguistic Phonetic Composition (IPC), a novel computational method
designed to minimize incorrect phonological transfer by reconstructing L2
phonemes as composite sounds derived from multiple L1 phonemes. Tests with two
automatic speech recognition models demonstrated that when L2 speakers produced
IPC-generated composite sounds, the recognition rate of target L2 phonemes
improved by 20% compared to when their pronunciation was influenced by original
phonological transfer patterns. The improvement was observed within a
relatively shorter time frame, demonstrating rapid acquisition of the composite
sound.
♻ ☆ TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning NAACL 2025
Current Large Language Models (LLMs) exhibit limited ability to understand
table structures and to apply precise numerical reasoning, which is crucial for
tasks such as table question answering (TQA) and table-based fact verification
(TFV). To address these challenges, we introduce our Tool-Augmented Reasoning
framework for Tables (TART), which integrates LLMs with specialized tools. TART
contains three key components: a table formatter to ensure accurate data
representation, a tool maker to develop specific computational tools, and an
explanation generator to maintain explainability. We also present the TOOLTAB
dataset, a new benchmark designed specifically for training LLMs in table-tool
integration. Our experiments indicate that TART achieves substantial
improvements over existing methods (e.g., Chain-of-Thought) by improving both
the precision of data processing and the clarity of the reasoning process.
Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the
closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse
real-world scenarios. All the code and data are available at
https://github.com/XinyuanLu00/TART.
comment: NAACL 2025 (Findings)
♻ ☆ CoAM: Corpus of All-Type Multiword Expressions ACL 2025
Yusuke Ide, Joshua Tanner, Adam Nohejl, Jacob Hoffman, Justin Vasselli, Hidetaka Kamigaito, Taro Watanabe
Multiword expressions (MWEs) refer to idiomatic sequences of multiple words.
MWE identification, i.e., detecting MWEs in text, can play a key role in
downstream tasks such as machine translation, but existing datasets for the
task are inconsistently annotated, limited to a single type of MWE, or limited
in size. To enable reliable and comprehensive evaluation, we created CoAM:
Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences
constructed through a multi-step process to enhance data quality consisting of
human annotation, human review, and automated consistency checking.
Additionally, for the first time in a dataset of MWE identification, CoAM's
MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained
error analysis. Annotations for CoAM were collected using a new interface
created with our interface generator, which allows easy and flexible annotation
of MWEs in any form. Through experiments using CoAM, we find that a fine-tuned
large language model outperforms MWEasWSD, which achieved the state-of-the-art
performance on the DiMSUM dataset. Furthermore, analysis using our MWE type
tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across
approaches.
comment: ACL 2025 main
♻ ☆ Rethinking Verification for LLM Code Generation: From Generation to Testing
Large language models (LLMs) have recently achieved notable success in
code-generation benchmarks such as HumanEval and LiveCodeBench. However, a
detailed examination reveals that these evaluation suites often comprise only a
limited number of homogeneous test cases, resulting in subtle faults going
undetected. This not only artificially inflates measured performance but also
compromises accurate reward estimation in reinforcement learning frameworks
utilizing verifiable rewards (RLVR). To address these critical shortcomings, we
systematically investigate the test-case generation (TCG) task by proposing
multi-dimensional metrics designed to rigorously quantify test-suite
thoroughness. Furthermore, we introduce a human-LLM collaborative method
(SAGA), leveraging human programming expertise with LLM reasoning capability,
aimed at significantly enhancing both the coverage and the quality of generated
test cases. In addition, we develop a TCGBench to facilitate the study of the
TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a
verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc)
of the code generation evaluation benchmark synthesized by SAGA is 10.78%
higher than that of LiveCodeBench-v6. These results demonstrate the
effectiveness of our proposed method. We hope this work contributes to building
a scalable foundation for reliable LLM code evaluation, further advancing RLVR
in code generation, and paving the way for automated adversarial test synthesis
and adaptive benchmark integration.
♻ ☆ Large Language Model for Extracting Complex Contract Information in Industrial Scenes
This paper proposes a high-quality dataset construction method for complex
contract information extraction tasks in industrial scenarios and fine-tunes a
large language model based on this dataset. Firstly, cluster analysis is
performed on industrial contract texts, and GPT-4 and GPT-3.5 are used to
extract key information from the original contract data, obtaining high-quality
data annotations. Secondly, data augmentation is achieved by constructing new
texts, and GPT-3.5 generates unstructured contract texts from randomly combined
keywords, improving model robustness. Finally, the large language model is
fine-tuned based on the high-quality dataset. Experimental results show that
the model achieves excellent overall performance while ensuring high field
recall and precision and considering parsing efficiency. LoRA, data balancing,
and data augmentation effectively enhance model accuracy and robustness. The
proposed method provides a novel and efficient solution for industrial contract
information extraction tasks.
♻ ☆ BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y. Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit Choudhary, Siddharth M. Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel E. Ho, Percy Liang
AI agents have the potential to significantly alter the cybersecurity
landscape. Here, we introduce the first framework to capture offensive and
defensive cyber-capabilities in evolving real-world systems. Instantiating this
framework with BountyBench, we set up 25 systems with complex, real-world
codebases. To capture the vulnerability lifecycle, we define three task types:
Detect (detecting a new vulnerability), Exploit (exploiting a specific
vulnerability), and Patch (patching a specific vulnerability). For Detect, we
construct a new success indicator, which is general across vulnerability types
and provides localized evaluation. We manually set up the environment for each
system, including installing packages, setting up server(s), and hydrating
database(s). We add 40 bug bounties, which are vulnerabilities with monetary
awards of \$10-\$30,485, covering 9 of the OWASP Top 10 Risks. To modulate task
difficulty, we devise a new strategy based on information to guide detection,
interpolating from identifying a zero day to exploiting a specific
vulnerability. We evaluate 8 agents: Claude Code, OpenAI Codex CLI with o3-high
and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview,
Claude 3.7 Sonnet Thinking, and DeepSeek-R1. Given up to three attempts, the
top-performing agents are OpenAI Codex CLI: o3-high (12.5% on Detect, mapping
to \$3,720; 90% on Patch, mapping to \$14,152), Custom Agent with Claude 3.7
Sonnet Thinking (67.5% on Exploit), and OpenAI Codex CLI: o4-mini (90% on
Patch, mapping to \$14,422). OpenAI Codex CLI: o3-high, OpenAI Codex CLI:
o4-mini, and Claude Code are more capable at defense, achieving higher Patch
scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and
57.5% respectively; while the custom agents are relatively balanced between
offense and defense, achieving Exploit scores of 37.5-67.5% and Patch scores of
35-60%.
comment: 93 pages
♻ ☆ Shifting from Ranking to Set Selection for Retrieval Augmented Generation ACL 2025
Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved
passages are not only individually relevant but also collectively form a
comprehensive set. Existing approaches primarily rerank top-k passages based on
their individual relevance, often failing to meet the information needs of
complex queries in multi-hop question answering. In this work, we propose a
set-wise passage selection approach and introduce SETR, which explicitly
identifies the information requirements of a query through Chain-of-Thought
reasoning and selects an optimal set of passages that collectively satisfy
those requirements. Experiments on multi-hop RAG benchmarks show that SETR
outperforms both proprietary LLM-based rerankers and open-source baselines in
terms of answer correctness and retrieval quality, providing an effective and
efficient alternative to traditional rerankers in RAG systems. The code is
available at https://github.com/LGAI-Research/SetR
comment: Accepted to ACL 2025 main (Oral Presentation)