Computation and Language
☆ Non-Termination Proving: 100 Million LoC and Beyond
We report on our tool, Pulse Infinite, that uses proof techniques to show
non-termination (divergence) in large programs. Pulse Infinite works
compositionally and under-approximately: the former supports scale, and the
latter ensures soundness for proving divergence. Prior work focused on small
benchmarks in the tens or hundreds of lines of code (LoC), and scale limits
their practicality: a single company may have tens of millions, or even
hundreds of millions of LoC or more. We report on applying Pulse Infinite to
over a hundred million lines of open-source and proprietary software written in
C, C++, and Hack, identifying over 30 previously unknown issues, establishing a
new state of the art for detecting divergence in real-world codebases.
comment: 14 pages, 4 figures
☆ Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
Large language models (LLMs) learn non-trivial abstractions during
pretraining, like detecting irregular plural noun subjects. However, it is not
well understood when and how specific linguistic abilities emerge as
traditional evaluation methods such as benchmarking fail to reveal how models
acquire concepts and capabilities. To bridge this gap and better understand
model training at the concept level, we use sparse crosscoders to discover and
align features across model checkpoints. Using this approach, we track the
evolution of linguistic features during pretraining. We train crosscoders
between open-sourced checkpoint triplets with significant performance and
representation shifts, and introduce a novel metric, Relative Indirect Effects
(RelIE), to trace training stages at which individual features become causally
important for task performance. We show that crosscoders can detect feature
emergence, maintenance, and discontinuation during pretraining. Our approach is
architecture-agnostic and scalable, offering a promising path toward more
interpretable and fine-grained analysis of representation learning throughout
pretraining.
☆ Elucidating the Design Space of Decay in Linear Attention
This paper presents a comprehensive investigation into the decay mechanisms
inherent in linear complexity sequence models. We systematically delineate the
design space of decay mechanisms across four pivotal dimensions:
parameterization strategy, which refers to the computational methodology for
decay; parameter sharing, which involves the utilization of supplementary
parameters for decay computation; decay granularity, comparing scalar versus
vector-based decay; and compatibility with relative positional encoding
methods, such as Rotary Position Embedding (RoPE). Through an extensive series
of experiments conducted on diverse language modeling tasks, we uncovered
several critical insights. Firstly, the design of the parameterization strategy
for decay requires meticulous consideration. Our findings indicate that
effective configurations are typically confined to a specific range of
parameters. Secondly, parameter sharing cannot be used arbitrarily, as it may
cause decay values to be too large or too small, thereby significantly
impacting performance. Thirdly, under identical parameterization strategies,
scalar decay generally underperforms compared to its vector-based counterpart.
However, in certain scenarios with alternative parameterization strategies,
scalar decay may unexpectedly surpass vector decay in efficacy. Lastly, our
analysis reveals that RoPE, a commonly employed relative positional encoding
method, typically fails to provide tangible benefits to the majority of linear
attention mechanisms.
comment: Accepted to COLM 2025. Yiran Zhong is the corresponding author. Code
is available at https://github.com/Doraemonzzz/xmixers
☆ SpikingBrain Technical Report: Spiking Brain-inspired Large Models
Yuqi Pan, Yupeng Feng, Jinghao Zhuang, Siyu Ding, Zehao Liu, Bohan Sun, Yuhong Chou, Han Xu, Xuerui Qiu, Anlin Deng, Anjie Hu, Peng Zhou, Man Yao, Jibin Wu, Jian Yang, Guoliang Sun, Bo Xu, Guoqi Li
Mainstream Transformer-based large language models face major efficiency
bottlenecks: training computation scales quadratically with sequence length,
and inference memory grows linearly, limiting long-context processing. Building
large models on non-NVIDIA platforms also poses challenges for stable and
efficient training. To address this, we introduce SpikingBrain, a family of
brain-inspired models designed for efficient long-context training and
inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three
aspects: (1) Model Architecture: linear and hybrid-linear attention
architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an
efficient, conversion-based training pipeline and a dedicated spike coding
framework; (3) System Engineering: customized training frameworks, operator
libraries, and parallelism strategies tailored to MetaX hardware.
Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM,
and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the
feasibility of large-scale LLM development on non-NVIDIA platforms.
SpikingBrain achieves performance comparable to open-source Transformer
baselines while using only about 150B tokens for continual pre-training. Our
models significantly improve long-sequence training efficiency and deliver
inference with (partially) constant memory and event-driven spiking behavior.
For example, SpikingBrain-7B attains over 100x speedup in Time to First Token
for 4M-token sequences. Training remains stable for weeks on hundreds of MetaX
C550 GPUs, with the 7B model reaching a Model FLOPs Utilization of 23.4
percent. The proposed spiking scheme achieves 69.15 percent sparsity, enabling
low-power operation. Overall, this work demonstrates the potential of
brain-inspired mechanisms to drive the next generation of efficient and
scalable large model design.
☆ Uniform Information Density and Syntactic Reduction: Revisiting $\textit{that}$-Mentioning in English Complement Clauses
Speakers often have multiple ways to express the same meaning. The Uniform
Information Density (UID) hypothesis suggests that speakers exploit this
variability to maintain a consistent rate of information transmission during
language production. Building on prior work linking UID to syntactic reduction,
we revisit the finding that the optional complementizer $\textit{that}$in
English complement clauses is more likely to be omitted when the clause has low
information density (i.e., more predictable). We advance this line of research
by analyzing a large-scale, contemporary conversational corpus and using
machine learning and neural language models to refine estimates of information
density. Our results replicated the established relationship between
information density and $\textit{that}$-mentioning. However, we found that
previous measures of information density based on matrix verbs'
subcategorization probability capture substantial idiosyncratic lexical
variation. By contrast, estimates derived from contextual word embeddings
account for additional variance in patterns of complementizer usage.
☆ CURE: Controlled Unlearning for Robust Embeddings -- Mitigating Conceptual Shortcuts in Pre-Trained Language Models EMNLP 2025
Pre-trained language models have achieved remarkable success across diverse
applications but remain susceptible to spurious, concept-driven correlations
that impair robustness and fairness. In this work, we introduce CURE, a novel
and lightweight framework that systematically disentangles and suppresses
conceptual shortcuts while preserving essential content information. Our method
first extracts concept-irrelevant representations via a dedicated content
extractor reinforced by a reversal network, ensuring minimal loss of
task-relevant information. A subsequent controllable debiasing module employs
contrastive learning to finely adjust the influence of residual conceptual
cues, enabling the model to either diminish harmful biases or harness
beneficial correlations as appropriate for the target task. Evaluated on the
IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an
absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp,
while introducing minimal computational overhead. Our approach establishes a
flexible, unsupervised blueprint for combating conceptual biases, paving the
way for more reliable and fair language understanding systems.
comment: Accepted at the Conference on Empirical Methods in Natural Language
Processing (EMNLP 2025)
☆ Less is More Tokens: Efficient Math Reasoning via Difficulty-Aware Chain-of-Thought Distillation
Chain-of-thought reasoning, while powerful, can produce unnecessarily verbose
output for simpler problems. We present a framework for difficulty-aware
reasoning that teaches models to dynamically adjust reasoning depth based on
problem complexity. Remarkably, we show that models can be endowed with such
dynamic inference pathways without any architectural modifications; we simply
post-train on data that is carefully curated to include chain-of-thought traces
that are proportional in length to problem difficulty. Our analysis reveals
that post-training via supervised fine-tuning (SFT) primarily captures patterns
like reasoning length and format, while direct preference optimization (DPO)
preserves reasoning accuracy, with their combination reducing length and
maintaining or improving performance. Both quantitative metrics and qualitative
assessments confirm that models can learn to "think proportionally", reasoning
minimally on simple problems while maintaining depth for complex ones.
comment: 28 Pages
☆ HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models
Positional encoding mechanisms enable Transformers to model sequential
structure and long-range dependencies in text. While absolute positional
encodings struggle with extrapolation to longer sequences due to fixed
positional representations, and relative approaches like Alibi exhibit
performance degradation on extremely long contexts, the widely-used Rotary
Positional Encoding (RoPE) introduces oscillatory attention patterns that
hinder stable long-distance dependency modelling. We address these limitations
through a geometric reformulation of positional encoding. Drawing inspiration
from Lorentz transformations in hyperbolic geometry, we propose Hyperbolic
Rotary Positional Encoding (HoPE), which leverages hyperbolic functions to
implement Lorentz rotations on token representations. Theoretical analysis
demonstrates that RoPE is a special case of our generalized formulation. HoPE
fundamentally resolves RoPE's slation issues by enforcing monotonic decay of
attention weights with increasing token distances. Extensive experimental
results, including perplexity evaluations under several extended sequence
benchmarks, show that HoPE consistently exceeds existing positional encoding
methods. These findings underscore HoPE's enhanced capacity for representing
and generalizing long-range dependencies. Data and code will be available.
comment: This paper proposes Hyperbolic Rotary Positional Encoding (HoPE), a
geometric reformulation of positional encoding inspired by Lorentz
transformations. HoPE addresses limitations of existing methods like RoPE by
enabling stable long-distance dependency modeling. Code and data will be made
available upon publication
☆ BEDTime: A Unified Benchmark for Automatically Describing Time Series
Many recent studies have proposed general-purpose foundation models designed
for a variety of time series analysis tasks. While several established datasets
already exist for evaluating these models, previous works frequently introduce
their models in conjunction with new datasets, limiting opportunities for
direct, independent comparisons and obscuring insights into the relative
strengths of different methods. Additionally, prior evaluations often cover
numerous tasks simultaneously, assessing a broad range of model abilities
without clearly pinpointing which capabilities contribute to overall
performance. To address these gaps, we formalize and evaluate 3 tasks that test
a model's ability to describe time series using generic natural language: (1)
recognition (True/False question-answering), (2) differentiation (multiple
choice question-answering), and (3) generation (open-ended natural language
description). We then unify 4 recent datasets to enable head-to-head model
comparisons on each task. Experimentally, in evaluating 13 state-of-the-art
language, vision--language, and time series--language models, we find that (1)
popular language-only methods largely underperform, indicating a need for time
series-specific architectures, (2) VLMs are quite successful, as expected,
identifying the value of vision models for these tasks and (3) pretrained
multimodal time series--language models successfully outperform LLMs, but still
have significant room for improvement. We also find that all approaches exhibit
clear fragility in a range of robustness tests. Overall, our benchmark provides
a standardized evaluation on a task necessary for time series reasoning
systems.
☆ Hunyuan-MT Technical Report
In this report, we introduce Hunyuan-MT-7B, our first open-source
multilingual translation model, which supports bidirectional translation across
33 major languages and places a special emphasis on translation between
Mandarin and several ethnic minority languages as well as dialects.
Furthermore, to serve and address diverse translation scenarios and enhance
model performance at test time, we introduce Hunyuan-MT-Chimera-7B, a
translation model inspired by the slow thinking mode. This model integrates
multiple outputs generated by the Hunyuan-MT-7B model under varying parameter
settings, thereby achieving performance superior to that of conventional
slow-thinking models based on Chain-of-Thought (CoT). The development of our
models follows a holistic training process specifically engineered for
multilingual translation, which begins with general and MT-oriented
pre-training to build foundational capabilities, proceeds to Supervised
Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced
alignment through Reinforcement Learning (RL) and weak-to-strong RL. Through
comprehensive experimentation, we demonstrate that both Hunyuan-MT-7B and
Hunyuan-MT-Chimera-7B significantly outperform all translation-specific models
of comparable parameter size and most of the SOTA large models, particularly on
the task of translation between Mandarin and minority languages as well as
dialects. In the WMT2025 shared task (General Machine Translation), our models
demonstrate state-of-the-art performance, ranking first in 30 out of 31
language pairs. This result highlights the robustness of our models across a
diverse linguistic spectrum, encompassing high-resource languages such as
Chinese, English, and Japanese, as well as low-resource languages including
Czech, Marathi, Estonian, and Icelandic.
☆ Triadic Fusion of Cognitive, Functional, and Causal Dimensions for Explainable LLMs: The TAXAL Framework
David Herrera-Poyatos, Carlos Peláez-González, Cristina Zuheros, Virilo Tejedor, Rosana Montes, Francisco Herrera
Large Language Models (LLMs) are increasingly being deployed in high-risk
domains where opacity, bias, and instability undermine trust and
accountability. Traditional explainability methods, focused on surface outputs,
do not capture the reasoning pathways, planning logic, and systemic impacts of
agentic LLMs.
We introduce TAXAL (Triadic Alignment for eXplainability in Agentic LLMs), a
triadic fusion framework that unites three complementary dimensions: cognitive
(user understanding), functional (practical utility), and causal (faithful
reasoning). TAXAL provides a unified, role-sensitive foundation for designing,
evaluating, and deploying explanations in diverse sociotechnical settings.
Our analysis synthesizes existing methods, ranging from post-hoc attribution
and dialogic interfaces to explanation-aware prompting, and situates them
within the TAXAL triadic fusion model. We further demonstrate its applicability
through case studies in law, education, healthcare, and public services,
showing how explanation strategies adapt to institutional constraints and
stakeholder roles.
By combining conceptual clarity with design patterns and deployment pathways,
TAXAL advances explainability as a technical and sociotechnical practice,
supporting trustworthy and context-sensitive LLM applications in the era of
agentic AI.
comment: 27 pages, 9 tables and 2 figures
☆ PRIM: Towards Practical In-Image Multilingual Machine Translation EMNLP 2025
In-Image Machine Translation (IIMT) aims to translate images containing texts
from one language to another. Current research of end-to-end IIMT mainly
conducts on synthetic data, with simple background, single font, fixed text
position, and bilingual translation, which can not fully reflect real world,
causing a significant gap between the research and practical conditions. To
facilitate research of IIMT in real-world scenarios, we explore Practical
In-Image Multilingual Machine Translation (IIMMT). In order to convince the
lack of publicly available data, we annotate the PRIM dataset, which contains
real-world captured one-line text images with complex background, various
fonts, diverse text positions, and supports multilingual translation
directions. We propose an end-to-end model VisTrans to handle the challenge of
practical conditions in PRIM, which processes visual text and background
information in the image separately, ensuring the capability of multilingual
translation while improving the visual quality. Experimental results indicate
the VisTrans achieves a better translation quality and visual effect compared
to other models. The code and dataset are available at:
https://github.com/BITHLP/PRIM.
comment: Accepted to EMNLP 2025 Main Conference
☆ ICR: Iterative Clarification and Rewriting for Conversational Search
Most previous work on Conversational Query Rewriting employs an end-to-end
rewriting paradigm. However, this approach is hindered by the issue of multiple
fuzzy expressions within the query, which complicates the simultaneous
identification and rewriting of multiple positions. To address this issue, we
propose a novel framework ICR (Iterative Clarification and Rewriting), an
iterative rewriting scheme that pivots on clarification questions. Within this
framework, the model alternates between generating clarification questions and
rewritten queries. The experimental results show that our ICR can continuously
improve retrieval performance in the clarification-rewriting iterative process,
thereby achieving state-of-the-art performance on two popular datasets.
☆ Finding your MUSE: Mining Unexpected Solutions Engine
Innovators often exhibit cognitive fixation on existing solutions or nascent
ideas, hindering the exploration of novel alternatives. This paper introduces a
methodology for constructing Functional Concept Graphs (FCGs), interconnected
representations of functional elements that support abstraction, problem
reframing, and analogical inspiration. Our approach yields large-scale,
high-quality FCGs with explicit abstraction relations, overcoming limitations
of prior work. We further present MUSE, an algorithm leveraging FCGs to
generate creative inspirations for a given problem. We demonstrate our method
by computing an FCG on 500K patents, which we release for further research.
☆ ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions EMNLP 2025
Most existing Theory of Mind (ToM) benchmarks for foundation models rely on
variations of the Sally-Anne test, offering only a very limited perspective on
ToM and neglecting the complexity of human social interactions. To address this
gap, we propose ToM-SSI: a new benchmark specifically designed to test ToM
capabilities in environments rich with social interactions and spatial
dynamics. While current ToM benchmarks are limited to text-only or dyadic
interactions, ToM-SSI is multimodal and includes group interactions of up to
four agents that communicate and move in situated environments. This unique
design allows us to study, for the first time, mixed cooperative-obstructive
settings and reasoning about multiple agents' mental state in parallel, thus
capturing a wider range of social cognition than existing benchmarks. Our
evaluations reveal that the current models' performance is still severely
limited, especially in these new tasks, highlighting critical gaps for future
research.
comment: EMNLP 2025 (Main)
☆ Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations
Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady, Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya
We introduce Entropy2Vec, a novel framework for deriving cross-lingual
language representations by leveraging the entropy of monolingual language
models. Unlike traditional typological inventories that suffer from feature
sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in
language models to capture typological relationships between languages. By
training a language model on a single language, we hypothesize that the entropy
of its predictions reflects its structural similarity to other languages: Low
entropy indicates high similarity, while high entropy suggests greater
divergence. This approach yields dense, non-sparse language embeddings that are
adaptable to different timeframes and free from missing values. Empirical
evaluations demonstrate that Entropy2Vec embeddings align with established
typological categories and achieved competitive performance in downstream
multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.
☆ Masked Diffusion Language Models with Frequency-Informed Training
Despoina Kosmopoulou, Efthymios Georgiou, Vaggelis Dorovatas, Georgios Paraskevopoulos, Alexandros Potamianos
We present a masked diffusion language modeling framework for data-efficient
training for the BabyLM 2025 Challenge. Our approach applies diffusion training
objectives to language modeling under strict data constraints, incorporating
frequency-informed masking that prioritizes learning from rare tokens while
maintaining theoretical validity. We explore multiple noise scheduling
strategies, including two-mode approaches, and investigate different noise
weighting schemes within the NELBO objective. We evaluate our method on the
BabyLM benchmark suite, measuring linguistic competence, world knowledge, and
human-likeness. Results show performance competitive to hybrid
autoregressive-masked baselines, demonstrating that diffusion-based training
offers a viable alternative for data-restricted language learning.
comment: Preprint
☆ Sticker-TTS: Learn to Utilize Historical Experience with a Sticker-driven Test-Time Scaling Framework
Large reasoning models (LRMs) have exhibited strong performance on complex
reasoning tasks, with further gains achievable through increased computational
budgets at inference. However, current test-time scaling methods predominantly
rely on redundant sampling, ignoring the historical experience utilization,
thereby limiting computational efficiency. To overcome this limitation, we
propose Sticker-TTS, a novel test-time scaling framework that coordinates three
collaborative LRMs to iteratively explore and refine solutions guided by
historical attempts. At the core of our framework are distilled key
conditions-termed stickers-which drive the extraction, refinement, and reuse of
critical information across multiple rounds of reasoning. To further enhance
the efficiency and performance of our framework, we introduce a two-stage
optimization strategy that combines imitation learning with self-improvement,
enabling progressive refinement. Extensive evaluations on three challenging
mathematical reasoning benchmarks, including AIME-24, AIME-25, and OlymMATH,
demonstrate that Sticker-TTS consistently surpasses strong baselines, including
self-consistency and advanced reinforcement learning approaches, under
comparable inference budgets. These results highlight the effectiveness of
sticker-guided historical experience utilization. Our code and data are
available at https://github.com/RUCAIBox/Sticker-TTS.
comment: 11 pages, 1 figures, 5 tables
☆ Do Large Language Models Need Intent? Revisiting Response Generation Strategies for Service Assistant
In the era of conversational AI, generating accurate and contextually
appropriate service responses remains a critical challenge. A central question
remains: Is explicit intent recognition a prerequisite for generating
high-quality service responses, or can models bypass this step and produce
effective replies directly? This paper conducts a rigorous comparative study to
address this fundamental design dilemma. Leveraging two publicly available
service interaction datasets, we benchmark several state-of-the-art language
models, including a fine-tuned T5 variant, across both paradigms: Intent-First
Response Generation and Direct Response Generation. Evaluation metrics
encompass both linguistic quality and task success rates, revealing surprising
insights into the necessity or redundancy of explicit intent modelling. Our
findings challenge conventional assumptions in conversational AI pipelines,
offering actionable guidelines for designing more efficient and effective
response generation systems.
comment: 7 pages, 1 figure
☆ Optimizing Small Transformer-Based Language Models for Multi-Label Sentiment Analysis in Short Texts ECAI 2025
Sentiment classification in short text datasets faces significant challenges
such as class imbalance, limited training samples, and the inherent
subjectivity of sentiment labels -- issues that are further intensified by the
limited context in short texts. These factors make it difficult to resolve
ambiguity and exacerbate data sparsity, hindering effective learning. In this
paper, we evaluate the effectiveness of small Transformer-based models (i.e.,
BERT and RoBERTa, with fewer than 1 billion parameters) for multi-label
sentiment classification, with a particular focus on short-text settings.
Specifically, we evaluated three key factors influencing model performance: (1)
continued domain-specific pre-training, (2) data augmentation using
automatically generated examples, specifically generative data augmentation,
and (3) architectural variations of the classification head. Our experiment
results show that data augmentation improves classification performance, while
continued pre-training on augmented datasets can introduce noise rather than
boost accuracy. Furthermore, we confirm that modifications to the
classification head yield only marginal benefits. These findings provide
practical guidance for optimizing BERT-based models in resource-constrained
settings and refining strategies for sentiment classification in short-text
datasets.
comment: Accepted at LDD@ECAI 2025
☆ Classification of kinetic-related injury in hospital triage data using NLP
Midhun Shyam, Jim Basilakis, Kieran Luken, Steven Thomas, John Crozier, Paul M. Middleton, X. Rosalind Wang
Triage notes, created at the start of a patient's hospital visit, contain a
wealth of information that can help medical staff and researchers understand
Emergency Department patient epidemiology and the degree of time-dependent
illness or injury. Unfortunately, applying modern Natural Language Processing
and Machine Learning techniques to analyse triage data faces some challenges:
Firstly, hospital data contains highly sensitive information that is subject to
privacy regulation thus need to be analysed on site; Secondly, most hospitals
and medical facilities lack the necessary hardware to fine-tune a Large
Language Model (LLM), much less training one from scratch; Lastly, to identify
the records of interest, expert inputs are needed to manually label the
datasets, which can be time-consuming and costly. We present in this paper a
pipeline that enables the classification of triage data using LLM and limited
compute resources. We first fine-tuned a pre-trained LLM with a classifier
using a small (2k) open sourced dataset on a GPU; and then further fine-tuned
the model with a hospital specific dataset of 1000 samples on a CPU. We
demonstrated that by carefully curating the datasets and leveraging existing
models and open sourced data, we can successfully classify triage data with
limited compute resources.
comment: Accepted as a short paper for publishing at ADMA 2025
(https://adma2025.github.io), with Supplementary Material available at
https://github.com/CRMDS/Kinetic-Injury-Triage
☆ Towards Ontology-Based Descriptions of Conversations with Qualitatively-Defined Concepts
The controllability of Large Language Models (LLMs) when used as
conversational agents is a key challenge, particularly to ensure predictable
and user-personalized responses. This work proposes an ontology-based approach
to formally define conversational features that are typically qualitative in
nature. By leveraging a set of linguistic descriptors, we derive quantitative
definitions for qualitatively-defined concepts, enabling their integration into
an ontology for reasoning and consistency checking. We apply this framework to
the task of proficiency-level control in conversations, using CEFR language
proficiency levels as a case study. These definitions are then formalized in
description logic and incorporated into an ontology, which guides controlled
text generation of an LLM through fine-tuning. Experimental results demonstrate
that our approach provides consistent and explainable proficiency-level
definitions, improving transparency in conversational AI.
comment: Accepted at TOTh 2025 (Terminology \& Ontology: Theories and
applications)
☆ SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
Hongyi Jing, Jiafu Chen, Chen Rao, Ziqiang Dang, Jiajie Teng, Tianyi Chu, Juncheng Mo, Shuo Fang, Huaizhong Lin, Rui Lv, Chenguang Ma, Lei Zhao
The existing Multimodal Large Language Models (MLLMs) for GUI perception have
made great progress. However, the following challenges still exist in prior
methods: 1) They model discrete coordinates based on text autoregressive
mechanism, which results in lower grounding accuracy and slower inference
speed. 2) They can only locate predefined sets of elements and are not capable
of parsing the entire interface, which hampers the broad application and
support for downstream tasks. To address the above issues, we propose
SparkUI-Parser, a novel end-to-end framework where higher localization
precision and fine-grained parsing capability of the entire interface are
simultaneously achieved. Specifically, instead of using probability-based
discrete modeling, we perform continuous modeling of coordinates based on a
pre-trained Multimodal Large Language Model (MLLM) with an additional token
router and coordinate decoder. This effectively mitigates the limitations
inherent in the discrete output characteristics and the token-by-token
generation process of MLLMs, consequently boosting both the accuracy and the
inference speed. To further enhance robustness, a rejection mechanism based on
a modified Hungarian matching algorithm is introduced, which empowers the model
to identify and reject non-existent elements, thereby reducing false positives.
Moreover, we present ScreenParse, a rigorously constructed benchmark to
systematically assess structural perception capabilities of GUI models across
diverse scenarios. Extensive experiments demonstrate that our approach
consistently outperforms SOTA methods on ScreenSpot, ScreenSpot-v2,
CAGUI-Grounding and ScreenParse benchmarks. The resources are available at
https://github.com/antgroup/SparkUI-Parser.
☆ ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning
Large Language Models (LLMs) have demonstrated remarkable progress in
long-context understanding, yet they face significant challenges in
high-quality long-form generation. Existing studies primarily suffer from two
limitations: (1) A heavy reliance on scarce, high-quality long-form response
data for supervised fine-tuning (SFT) or for pairwise preference reward in
reinforcement learning (RL). (2) Focus on coarse-grained quality optimization
dimensions, such as relevance, coherence, and helpfulness, overlooking the
fine-grained specifics inherent to diverse long-form generation scenarios. To
address this issue, we propose a framework using Adaptive Constraint-Enhanced
reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first
automatically deconstructs each instruction into a set of fine-grained,
adaptive constraint criteria by identifying its underlying intents and demands.
Subsequently, we design a reward mechanism that quantifies the quality of
long-form responses based on their satisfaction over corresponding constraints,
converting subjective quality evaluation into constraint verification. Finally,
we utilize reinforcement learning to guide models toward superior long-form
generation capabilities. Experimental results demonstrate that our ACE-RL
framework significantly outperforms existing SFT and RL baselines by 20.70% and
7.32% on WritingBench, and our top-performing model even surpasses proprietary
systems like GPT-4o by 7.10%, providing a more effective training paradigm for
LLMs to generate high-quality content across diverse long-form generation
scenarios.
comment: Under review, our code is available at https://github.com/ZNLP/ACE-RL
☆ PLaMo 2 Technical Report
Preferred Networks, :, Kaizaburo Chubachi, Yasuhiro Fujita, Shinichi Hemmi, Yuta Hirokawa, Toshiki Kataoka, Goro Kobayashi, Kenichi Maehashi, Calvin Metzger, Hiroaki Mikami, Shogo Murai, Daisuke Nishino, Kento Nozawa, Shintarou Okada, Daisuke Okanohara, Shunta Saito, Shotaro Sano, Shuji Suzuki, Daisuke Tanaka, Avinash Ummadisingu, Hanqin Wang, Sixue Wang, Tianqi Xu
In this report, we introduce PLaMo 2, a series of Japanese-focused large
language models featuring a hybrid Samba-based architecture that transitions to
full attention via continual pre-training to support 32K token contexts.
Training leverages extensive synthetic corpora to overcome data scarcity, while
computational efficiency is achieved through weight reuse and structured
pruning. This efficient pruning methodology produces an 8B model that achieves
performance comparable to our previous 100B model. Post-training further
refines the models using a pipeline of supervised fine-tuning (SFT) and direct
preference optimization (DPO), enhanced by synthetic Japanese instruction data
and model merging techniques. Optimized for inference using vLLM and
quantization with minimal accuracy loss, the PLaMo 2 models achieve
state-of-the-art results on Japanese benchmarks, outperforming similarly-sized
open models in instruction-following, language fluency, and Japanese-specific
knowledge.
☆ L1RA: Dynamic Rank Assignment in LoRA Fine-Tuning SP 2025
The ability of Large Language Models (LLMs) to solve complex tasks has made
them crucial in the development of AI-based applications. However, the high
computational requirements to fine-tune these LLMs on downstream tasks pose
significant challenges, particularly when resources are limited. In response to
this challenge, we introduce L1RA, a novel technique aimed at dynamically
distributing the rank of low-rank adapters during fine-tuning using LoRA. Given
a rank budget (i.e., total sum of adapters rank), L1RA leverages L1
regularisation to prune redundant ranks and redistribute them across adapters,
thereby optimising resource utilisation. Through a series of comprehensive
experiments, we empirically demonstrate that L1RA maintains comparable or even
reduced computational overhead compared to other LoRA variants, including the
vanilla approach, while achieving same or better performances. Moreover, the
post-training analysis of rank distribution unveiled insights into the specific
model components requiring the most adaptation to align with the task
objective: the feed-forward layers and the attention output projection. These
results highlight the efficacy of L1RA in not only enhancing the efficiency of
LLM fine-tuning, but also in providing valuable diagnostic information for
model refinement and customisation. In conclusion, L1RA stands as a promising
technique for advancing the performance and interpretability of LLM adaptation,
particularly in scenarios where computational resources are constrained.
comment: Work published at ICNLSP 2025, waiting for publication link
☆ Using LLMs for Multilingual Clinical Entity Linking to ICD-10
The linking of clinical entities is a crucial part of extracting structured
information from clinical texts. It is the process of assigning a code from a
medical ontology or classification to a phrase in the text. The International
Classification of Diseases - 10th revision (ICD-10) is an international
standard for classifying diseases for statistical and insurance purposes.
Automatically assigning the correct ICD-10 code to terms in discharge summaries
will simplify the work of healthcare professionals and ensure consistent coding
in hospitals. Our paper proposes an approach for linking clinical terms to
ICD-10 codes in different languages using Large Language Models (LLMs). The
approach consists of a multistage pipeline that uses clinical dictionaries to
match unambiguous terms in the text and then applies in-context learning with
GPT-4.1 to predict the ICD-10 code for the terms that do not match the
dictionary. Our system shows promising results in predicting ICD-10 codes on
different benchmark datasets in Spanish - 0.89 F1 for categories and 0.78 F1 on
subcategories on CodiEsp, and Greek - 0.85 F1 on ElCardioCC.
comment: 7 pages, 2 Figures, to be published in Proceedings of the 15th
International Conference on Recent Advances in Natural Language Processing,
RANLP 2025
☆ Memorization $\neq$ Understanding: Do Large Language Models Have the Ability of Scenario Cognition? EMNLP 2025
Driven by vast and diverse textual data, large language models (LLMs) have
demonstrated impressive performance across numerous natural language processing
(NLP) tasks. Yet, a critical question persists: does their generalization arise
from mere memorization of training data or from deep semantic understanding? To
investigate this, we propose a bi-perspective evaluation framework to assess
LLMs' scenario cognition - the ability to link semantic scenario elements with
their arguments in context. Specifically, we introduce a novel scenario-based
dataset comprising diverse textual descriptions of fictional facts, annotated
with scenario elements. LLMs are evaluated through their capacity to answer
scenario-related questions (model output perspective) and via probing their
internal representations for encoded scenario elements-argument associations
(internal representation perspective). Our experiments reveal that current LLMs
predominantly rely on superficial memorization, failing to achieve robust
semantic scenario cognition, even in simple cases. These findings expose
critical limitations in LLMs' semantic understanding and offer cognitive
insights for advancing their capabilities.
comment: EMNLP 2025 Main Conference
☆ Evaluating Cognitive-Behavioral Fixation via Multimodal User Viewing Patterns on Social Media
Digital social media platforms frequently contribute to cognitive-behavioral
fixation, a phenomenon in which users exhibit sustained and repetitive
engagement with narrow content domains. While cognitive-behavioral fixation has
been extensively studied in psychology, methods for computationally detecting
and evaluating such fixation remain underexplored. To address this gap, we
propose a novel framework for assessing cognitive-behavioral fixation by
analyzing users' multimodal social media engagement patterns. Specifically, we
introduce a multimodal topic extraction module and a cognitive-behavioral
fixation quantification module that collaboratively enable adaptive,
hierarchical, and interpretable assessment of user behavior. Experiments on
existing benchmarks and a newly curated multimodal dataset demonstrate the
effectiveness of our approach, laying the groundwork for scalable computational
analysis of cognitive fixation. All code in this project is publicly available
for research purposes at
https://github.com/Liskie/cognitive-fixation-evaluation.
☆ AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding
Spoken Language Understanding (SLU) is a core component of conversational
systems, enabling machines to interpret user utterances. Despite its
importance, developing effective SLU systems remains challenging due to the
scarcity of labeled training data and the computational burden of deploying
Large Language Models (LLMs) in real-world applications. To further alleviate
these issues, we propose an Adaptive Feature Distillation framework that
transfers rich semantic representations from a General Text Embeddings
(GTE)-based teacher model to a lightweight student model. Our method introduces
a dynamic adapter equipped with a Residual Projection Neural Network (RPNN) to
align heterogeneous feature spaces, and a Dynamic Distillation Coefficient
(DDC) that adaptively modulates the distillation strength based on real-time
feedback from intent and slot prediction performance. Experiments on the
Chinese profile-based ProSLU benchmark demonstrate that AFD-SLU achieves
state-of-the-art results, with 95.67% intent accuracy, 92.02% slot F1 score,
and 85.50% overall accuracy.
comment: 5 pages, 1 figures
☆ Analyzing Finnish Inflectional Classes through Discriminative Lexicon and Deep Learning Models
Descriptions of complex nominal or verbal systems make use of inflectional
classes. Inflectional classes bring together nouns which have similar stem
changes and use similar exponents in their paradigms. Although inflectional
classes can be very useful for language teaching as well as for setting up
finite state morphological systems, it is unclear whether inflectional classes
are cognitively real, in the sense that native speakers would need to discover
these classes in order to learn how to properly inflect the nouns of their
language. This study investigates whether the Discriminative Lexicon Model
(DLM) can understand and produce Finnish inflected nouns without setting up
inflectional classes, using a dataset with 55,271 inflected nouns of 2000
high-frequency Finnish nouns from 49 inflectional classes. Several DLM
comprehension and production models were set up. Some models were not informed
about frequency of use, and provide insight into learnability with infinite
exposure (endstate learning). Other models were set up from a usage based
perspective, and were trained with token frequencies being taken into
consideration (frequency-informed learning). On training data, models performed
with very high accuracies. For held-out test data, accuracies decreased, as
expected, but remained acceptable. Across most models, performance increased
for inflectional classes with more types, more lower-frequency words, and more
hapax legomena, mirroring the productivity of the inflectional classes. The
model struggles more with novel forms of unproductive and less productive
classes, and performs far better for unseen forms belonging to productive
classes. However, for usage-based production models, frequency was the dominant
predictor of model performance, and correlations with measures of productivity
were tenuous or absent.
☆ Code Review Without Borders: Evaluating Synthetic vs. Real Data for Review Recommendation
Automating the decision of whether a code change requires manual review is
vital for maintaining software quality in modern development workflows.
However, the emergence of new programming languages and frameworks creates a
critical bottleneck: while large volumes of unlabelled code are readily
available, there is an insufficient amount of labelled data to train supervised
models for review classification. We address this challenge by leveraging Large
Language Models (LLMs) to translate code changes from well-resourced languages
into equivalent changes in underrepresented or emerging languages, generating
synthetic training data where labelled examples are scarce. We assume that
although LLMs have learned the syntax and semantics of new languages from
available unlabelled code, they have yet to fully grasp which code changes are
considered significant or review-worthy within these emerging ecosystems. To
overcome this, we use LLMs to generate synthetic change examples and train
supervised classifiers on them. We systematically compare the performance of
these classifiers against models trained on real labelled data. Our experiments
across multiple GitHub repositories and language pairs demonstrate that
LLM-generated synthetic data can effectively bootstrap review recommendation
systems, narrowing the performance gap even in low-resource settings. This
approach provides a scalable pathway to extend automated code review
capabilities to rapidly evolving technology stacks, even in the absence of
annotated data.
comment: 4 pages, 1 figure
☆ Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs
As large language models transition to agentic systems, current safety
evaluation frameworks face critical gaps in assessing deployment-specific
risks. We introduce AgentSeer, an observability-based evaluation framework that
decomposes agentic executions into granular action and component graphs,
enabling systematic agentic-situational assessment. Through cross-model
validation on GPT-OSS-20B and Gemini-2.0-flash using HarmBench single turn and
iterative refinement attacks, we demonstrate fundamental differences between
model-level and agentic-level vulnerability profiles. Model-level evaluation
reveals baseline differences: GPT-OSS-20B (39.47% ASR) versus Gemini-2.0-flash
(50.00% ASR), with both models showing susceptibility to social engineering
while maintaining logic-based attack resistance. However, agentic-level
assessment exposes agent-specific risks invisible to traditional evaluation. We
discover "agentic-only" vulnerabilities that emerge exclusively in agentic
contexts, with tool-calling showing 24-60% higher ASR across both models.
Cross-model analysis reveals universal agentic patterns, agent transfer
operations as highest-risk tools, semantic rather than syntactic vulnerability
mechanisms, and context-dependent attack effectiveness, alongside
model-specific security profiles in absolute ASR levels and optimal injection
strategies. Direct attack transfer from model-level to agentic contexts shows
degraded performance (GPT-OSS-20B: 57% human injection ASR; Gemini-2.0-flash:
28%), while context-aware iterative attacks successfully compromise objectives
that failed at model-level, confirming systematic evaluation gaps. These
findings establish the urgent need for agentic-situation evaluation paradigms,
with AgentSeer providing the standardized methodology and empirical validation.
☆ Knowledge Collapse in LLMs: When Fluency Survives but Facts Fail under Recursive Synthetic Training
Large language models increasingly rely on synthetic data due to
human-written content scarcity, yet recursive training on model-generated
outputs leads to model collapse, a degenerative process threatening factual
reliability. We define knowledge collapse as a distinct three-stage phenomenon
where factual accuracy deteriorates while surface fluency persists, creating
"confidently wrong" outputs that pose critical risks in accuracy-dependent
domains. Through controlled experiments with recursive synthetic training, we
demonstrate that collapse trajectory and timing depend critically on
instruction format, distinguishing instruction-following collapse from
traditional model collapse through its conditional, prompt-dependent nature. We
propose domain-specific synthetic training as a targeted mitigation strategy
that achieves substantial improvements in collapse resistance while maintaining
computational efficiency. Our evaluation framework combines model-centric
indicators with task-centric metrics to detect distinct degradation phases,
enabling reproducible assessment of epistemic deterioration across different
language models. These findings provide both theoretical insights into collapse
dynamics and practical guidance for sustainable AI training in
knowledge-intensive applications where accuracy is paramount.
☆ Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects
Personality manipulation in large language models (LLMs) is increasingly
applied in customer service and agentic scenarios, yet its mechanisms and
trade-offs remain unclear. We present a systematic study of personality control
using the Big Five traits, comparing in-context learning (ICL),
parameter-efficient fine-tuning (PEFT), and mechanistic steering (MS). Our
contributions are fourfold. First, we construct a contrastive dataset with
balanced high/low trait responses, enabling effective steering vector
computation and fair cross-method evaluation. Second, we introduce a unified
evaluation framework based on within-run $\Delta$ analysis that disentangles,
reasoning capability, agent performance, and demographic bias across MMLU,
GAIA, and BBQ benchmarks. Third, we develop trait purification techniques to
separate openness from conscientiousness, addressing representational overlap
in trait encoding. Fourth, we propose a three-level stability framework that
quantifies method-, trait-, and combination-level robustness, offering
practical guidance under deployment constraints. Experiments on Gemma-2-2B-IT
and LLaMA-3-8B-Instruct reveal clear trade-offs: ICL achieves strong alignment
with minimal capability loss, PEFT delivers the highest alignment at the cost
of degraded task performance, and MS provides lightweight runtime control with
competitive effectiveness. Trait-level analysis shows openness as uniquely
challenging, agreeableness as most resistant to ICL, and personality encoding
consolidating around intermediate layers. Taken together, these results
establish personality manipulation as a multi-level probe into behavioral
representation, linking surface conditioning, parameter encoding, and
activation-level steering, and positioning mechanistic steering as a
lightweight alternative to fine-tuning for both deployment and
interpretability.
☆ Enhancing Diversity in Large Language Models via Determinantal Point Processes
Supervised fine-tuning and reinforcement learning are two popular methods for
post-training large language models (LLMs). While improving the model's
performance on downstream tasks, they often reduce the model's output
diversity, leading to narrow, canonical responses. Existing methods to enhance
diversity are limited, either by operating at inference time or by focusing on
lexical differences. We propose a novel training method named DQO based on
determinantal point processes (DPPs) to jointly optimize LLMs for quality and
semantic diversity. Our approach samples and embeds a group of responses for
each prompt, then uses the determinant of a kernel-based similarity matrix to
measure diversity as the volume spanned by the embeddings of these responses.
Experiments across instruction-following, summarization, story generation, and
reasoning tasks demonstrate that our method substantially improves semantic
diversity without sacrificing model quality.
☆ Decoders Laugh as Loud as Encoders
From the dawn of the computer, Allen Turing dreamed of a robot that could
communicate using language as a human being. The recent advances in the field
of Large Language Models (LLMs) shocked the scientific community when a single
model can apply for various natural language processing (NLP) tasks, while the
output results are sometimes even better than most human communication skills.
Models such as GPT, Claude, Grok, etc. have left their mark on the scientific
community. However, it is unclear how much these models understand what they
produce, especially in a nuanced theme such as humor. The question of whether
computers understand humor is still open (among the decoders, the latest to be
checked was GPT-2). We addressed this issue in this paper; we have showed that
a fine-tuned decoder (GPT-4o) performed (Mean F1-macro score of 0.85) as well
as the best fine-tuned encoder (RoBERTa with a Mean of F1-score 0.86)
☆ Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework
Accurately answering complex questions has consistently been a significant
challenge for Large Language Models (LLMs). To address this, this paper
proposes a multi-hop question decomposition method for complex questions,
building upon research within the MQUAKE framework. Utilizing the LLAMA3 model,
we systematically investigate the impact of multi-hop question decomposition
within knowledge graphs on model comprehension and reasoning accuracy, both
before and after model training. In our experiments, we systematically
partitioned and converted the MQUAKE-T dataset into two distinct formats: a
single-hop dataset designed for directly answering complex questions, and a
multi-hop dataset constructed using the multi-hop question decomposition
method. We then fine-tuned the LLAMA3 model on these datasets and conducted
inference tests. Our results demonstrate that, without fine-tuning the LLM, the
prediction performance based on the multi-hop question decomposition method
significantly outperforms the method of directly answering complex questions.
After fine-tuning using the LoRA (Low-Rank Adaptation) method, the performance
of both approaches improved compared to the untrained baseline. Crucially, the
method utilizing multi-hop decomposition consistently maintained its
superiority. These findings validate the effectiveness of the multi-hop
decomposition method both before and after training, demonstrating its
capability to effectively enhance the LLM's ability to answer complex
questions.
☆ A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning
Natural language processing (NLP) is a key technology to extract important
patient information from clinical narratives to support healthcare
applications. The rapid development of large language models (LLMs) has
revolutionized many NLP tasks in the clinical domain, yet their optimal use in
patient information extraction tasks requires further exploration. This study
examines LLMs' effectiveness in patient information extraction, focusing on LLM
architectures, fine-tuning strategies, and multi-task instruction tuning
techniques for developing robust and generalizable patient information
extraction systems. This study aims to explore key concepts of using LLMs for
clinical concept and relation extraction tasks, including: (1) encoder-only or
decoder-only LLMs, (2) prompt-based parameter-efficient fine-tuning (PEFT)
algorithms, and (3) multi-task instruction tuning on few-shot learning
performance. We benchmarked a suite of LLMs, including encoder-based LLMs
(BERT, GatorTron) and decoder-based LLMs (GatorTronGPT, Llama 3.1,
GatorTronLlama), across five datasets. We compared traditional full-size
fine-tuning and prompt-based PEFT. We explored a multi-task instruction tuning
framework that combines both tasks across four datasets to evaluate the
zero-shot and few-shot learning performance using the leave-one-dataset-out
strategy.
☆ Phonological Representation Learning for Isolated Signs Improves Out-of-Vocabulary Generalization
Sign language datasets are often not representative in terms of vocabulary,
underscoring the need for models that generalize to unseen signs. Vector
quantization is a promising approach for learning discrete, token-like
representations, but it has not been evaluated whether the learned units
capture spurious correlations that hinder out-of-vocabulary performance. This
work investigates two phonological inductive biases: Parameter Disentanglement,
an architectural bias, and Phonological Semi-Supervision, a regularization
technique, to improve isolated sign recognition of known signs and
reconstruction quality of unseen signs with a vector-quantized autoencoder. The
primary finding is that the learned representations from the proposed model are
more effective for one-shot reconstruction of unseen signs and more
discriminative for sign identification compared to a controlled baseline. This
work provides a quantitative analysis of how explicit, linguistically-motivated
biases can improve the generalization of learned representations of sign
language.
☆ WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning
Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated
impressive capabilities across various vision-language tasks. However, their
reasoning abilities in the multimodal symbolic music domain remain largely
unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic
music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to
interpret real-world music scores and answer complex musicological queries.
Each instance in WildScore is sourced from genuine musical compositions and
accompanied by authentic user-generated questions and discussions, capturing
the intricacies of practical music analysis. To facilitate systematic
evaluation, we propose a systematic taxonomy, comprising both high-level and
fine-grained musicological ontologies. Furthermore, we frame complex music
reasoning as multiple-choice question answering, enabling controlled and
scalable assessment of MLLMs' symbolic music understanding. Empirical
benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns
in their visual-symbolic reasoning, uncovering both promising directions and
persistent challenges for MLLMs in symbolic music reasoning and analysis. We
release the dataset and code.
☆ Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning
The convergence of Language models, Agent models, and World models represents
a critical frontier for artificial intelligence. While recent progress has
focused on scaling Language and Agent models, the development of sophisticated,
explicit World Models remains a key bottleneck, particularly for complex,
long-horizon multi-agent tasks. In domains such as robotic soccer, agents
trained via standard reinforcement learning in high-fidelity but
structurally-flat simulators often fail due to intractable exploration spaces
and sparse rewards. This position paper argues that the next frontier in
developing capable agents lies in creating environments that possess an
explicit, hierarchical World Model. We contend that this is best achieved
through hierarchical scaffolding, where complex goals are decomposed into
structured, manageable subgoals. Drawing evidence from a systematic review of
2024 research in multi-agent soccer, we identify a clear and decisive trend
towards integrating symbolic and hierarchical methods with multi-agent
reinforcement learning (MARL). These approaches implicitly or explicitly
construct a task-based world model to guide agent learning. We then propose a
paradigm shift: leveraging Large Language Models to dynamically generate this
hierarchical scaffold, effectively using language to structure the World Model
on the fly. This language-driven world model provides an intrinsic curriculum,
dense and meaningful learning signals, and a framework for compositional
learning, enabling Agent Models to acquire sophisticated, strategic behaviors
with far greater sample efficiency. By building environments with explicit,
language-configurable task layers, we can bridge the gap between low-level
reactive behaviors and high-level strategic team play, creating a powerful and
generalizable framework for training the next generation of intelligent agents.
☆ KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering EMNLP
Retrieval-Augmented Generation (RAG) mitigates hallucination in Large
Language Models (LLMs) by incorporating external data, with Knowledge Graphs
(KGs) offering crucial information for question answering. Traditional
Knowledge Graph Question Answering (KGQA) methods rely on semantic parsing,
which typically retrieves knowledge strictly necessary for answer generation,
thus often suffer from low coverage due to rigid schema requirements and
semantic ambiguity. We present KERAG, a novel KG-based RAG pipeline that
enhances QA coverage by retrieving a broader subgraph likely to contain
relevant information. Our retrieval-filtering-summarization approach, combined
with fine-tuned LLMs for Chain-of-Thought reasoning on knowledge sub-graphs,
reduces noises and improves QA for both simple and complex questions.
Experiments demonstrate that KERAG surpasses state-of-the-art solutions by
about 7% in quality and exceeds GPT-4o (Tool) by 10-21%.
comment: Accepted by EMNLP Findings 2025
♻ ☆ Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification EMNLP 2025
Large language models (LLMs) often behave inconsistently across inputs,
indicating uncertainty and motivating the need for its quantification in
high-stakes settings. Prior work on calibration and uncertainty quantification
often focuses on individual models, overlooking the potential of model
diversity. We hypothesize that LLMs make complementary predictions due to
differences in training and the Zipfian nature of language, and that
aggregating their outputs leads to more reliable uncertainty estimates. To
leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a
simple information-theoretic method that uses Jensen-Shannon Divergence to
identify and aggregate well-calibrated subsets of LLMs. Experiments on binary
prediction tasks demonstrate improved calibration and predictive performance
compared to single-model and na\"ive ensemble baselines. In addition, we
explore using MUSE as guided signals with chain-of-thought distillation to
fine-tune LLMs for calibration. MUSE is available
at:https://github.com/LARK-NLP-Lab/MUSE.
comment: Accepted to EMNLP 2025 Main Conference
♻ ☆ Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment EMNLP 2025
Large language models (LLMs) have advanced virtual educators and learners,
bridging NLP with AI4Education. Existing work often lacks scalability and fails
to leverage diverse, large-scale course content, with limited frameworks for
assessing pedagogic quality. To this end, we propose WikiHowAgent, a
multi-agent workflow leveraging LLMs to simulate interactive teaching-learning
conversations. It integrates teacher and learner agents, an interaction
manager, and an evaluator to facilitate procedural learning and assess
pedagogic quality. We introduce a dataset of 114,296 teacher-learner
conversations grounded in 14,287 tutorials across 17 domains and 727 topics.
Our evaluation protocol combines computational and rubric-based metrics with
human judgment alignment. Results demonstrate the workflow's effectiveness in
diverse setups, offering insights into LLM capabilities across domains. Our
datasets and implementations are fully open-sourced.
comment: 14 pages, accepted by EMNLP 2025
♻ ☆ Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following
Large Language Models (LLMs) are now increasingly widely used to simulate
personas in virtual environments, leveraging their instruction-following
capability. However, we discovered that even state-of-the-art LLMs cannot
simulate personas with reversed performance (e.g., student personas with low
proficiency in educational settings), which impairs the simulation diversity
and limits the practical applications of the simulated environments. In this
work, using mathematical reasoning as a representative scenario, we propose the
first benchmark dataset for evaluating LLMs on simulating personas with
reversed performance, a capability that we dub "counterfactual instruction
following". We evaluate both open-weight and closed-source LLMs on this task
and find that LLMs, including the OpenAI o1 reasoning model, all struggle to
follow counterfactual instructions for simulating reversedly performing
personas. Intersectionally simulating both the performance level and the race
population of a persona worsens the effect even further. These results
highlight the challenges of counterfactual instruction following and the need
for further research.
♻ ☆ RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language
Multimodal question answering (QA) often requires identifying which video,
audio, or sensor tokens are relevant to the question. Yet modality
disagreements are common: off-camera speech, background noise, or motion
outside the field of view often mislead fusion models that weight all streams
equally. We present RAVEN, a unified QA architecture whose core is QuART, a
query-conditioned cross-modal gating module that assigns scalar relevance
scores to each token across modalities, enabling the model to amplify
informative signals and suppress distractors before fusion. RAVEN is trained
through a three-stage pipeline comprising unimodal pretraining, query-aligned
fusion, and disagreement-oriented fine-tuning -- each stage targeting a
distinct challenge in multi-modal reasoning: representation quality,
cross-modal relevance, and robustness to modality mismatch. To support training
and evaluation, we release AVS-QA, a dataset of 300K synchronized
Audio--Video-Sensor streams paired with automatically generated question-answer
pairs. Experimental results on seven multi-modal QA benchmarks -- including
egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5\% and
8.0\% gains in accuracy compared to state-of-the-art multi-modal large language
models, respectively. Incorporating sensor data provides an additional 16.4\%
boost, and the model remains robust under modality corruption, outperforming
SOTA baselines by 50.23\%. Our code and dataset are available at
https://github.com/BASHLab/RAVEN.
♻ ☆ BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling ICML 2025
Hao Li, Yu-Hao Huang, Chang Xu, Viktor Schlegel, Renhe Jiang, Riza Batista-Navarro, Goran Nenadic, Jiang Bian
Time-series Generation (TSG) is a prominent research area with broad
applications in simulations, data augmentation, and counterfactual analysis.
While existing methods have shown promise in unconditional single-domain TSG,
real-world applications demand for cross-domain approaches capable of
controlled generation tailored to domain-specific constraints and
instance-level requirements. In this paper, we argue that text can provide
semantic insights, domain information and instance-specific temporal patterns,
to guide and improve TSG. We introduce ``Text-Controlled TSG'', a task focused
on generating realistic time series by incorporating textual descriptions. To
address data scarcity in this setting, we propose a novel LLM-based Multi-Agent
framework that synthesizes diverse, realistic text-to-TS datasets. Furthermore,
we introduce BRIDGE, a hybrid text-controlled TSG framework that integrates
semantic prototypes with text description for supporting domain-level guidance.
This approach achieves state-of-the-art generation fidelity on 11 of 12
datasets, and improves controllability by up to 12% on MSE and 6% MAE compared
to no text input generation, highlighting its potential for generating tailored
time-series data.
comment: ICML 2025 Main Conference
♻ ☆ Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement
Argument summarization aims to generate concise, structured representations
of complex, multi-perspective debates. While recent work has advanced the
identification and clustering of argumentative components, the generation stage
remains underexplored. Existing approaches typically rely on single-pass
generation, offering limited support for factual correction or structural
refinement. To address this gap, we introduce Arg-LLaDA, a novel large language
diffusion framework that iteratively improves summaries via sufficiency-guided
remasking and regeneration. Our method combines a flexible masking controller
with a sufficiency-checking module to identify and revise unsupported,
redundant, or incomplete spans, yielding more faithful, concise, and coherent
outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA
surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation
metrics. In addition, human evaluations reveal substantial improvements across
core dimensions, coverage, faithfulness, and conciseness, validating the
effectiveness of our iterative, sufficiency-aware generation strategy.
comment: Preprint
♻ ☆ AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection
Large Language Model (LLM) agents offer a powerful new paradigm for solving
various problems by combining natural language reasoning with the execution of
external tools. However, their dynamic and non-transparent behavior introduces
critical security risks, particularly in the presence of prompt injection
attacks. In this work, we propose a novel insight that treats the agent runtime
traces as structured programs with analyzable semantics. Thus, we present
AgentArmor, a program analysis framework that converts agent traces into graph
intermediate representation-based structured program dependency representations
(e.g., CFG, DFG, and PDG) and enforces security policies via a type system.
AgentArmor consists of three key components: (1) a graph constructor that
reconstructs the agent's runtime traces as graph-based intermediate
representations with control and data flow described within; (2) a property
registry that attaches security-relevant metadata of interacted tools \& data,
and (3) a type system that performs static inference and checking over the
intermediate representation. By representing agent behavior as structured
programs, AgentArmor enables program analysis for sensitive data flow, trust
boundaries, and policy violations. We evaluate AgentArmor on the AgentDojo
benchmark, the results show that AgentArmor can reduce the ASR to 3\%, with the
utility drop only 1\%.
♻ ☆ First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay
Much work has been done on conversational LLM agents which directly assist
human users with tasks. We present an alternative paradigm for interacting with
LLM agents, which we call "overhearing agents". These overhearing agents do not
actively participate in conversation -- instead, they "listen in" on
human-to-human conversations and perform background tasks or provide
suggestions to assist the user. In this work, we explore the overhearing agents
paradigm through the lens of Dungeons & Dragons gameplay. We present an
in-depth study using large multimodal audio-language models as overhearing
agents to assist a Dungeon Master. We perform a human evaluation to examine the
helpfulness of such agents and find that some large audio-language models have
the emergent ability to perform overhearing agent tasks using implicit audio
cues. Finally, we release Python libraries and our project code to support
further research into the overhearing agents paradigm at
https://github.com/zhudotexe/overhearing_agents.
comment: 9 pages, 5 figures. COLM 2025 Workshop on AI Agents
♻ ☆ PersonaGym: Evaluating Persona Agents and LLMs EMNLP
Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, Vishvak Murahari
Persona agents, which are LLM agents conditioned to act according to an
assigned persona, enable contextually rich and user aligned interactions across
domains like education and healthcare. However, evaluating how faithfully these
agents adhere to their personas remains a significant challenge, particularly
in free-form settings that demand consistency across diverse, persona-relevant
environments. We introduce PersonaGym, the first dynamic evaluation framework
for persona agents, and PersonaScore, a human-aligned automatic metric grounded
in decision theory that enables comprehensive large-scale evaluation. Our
evaluation of 10 leading LLMs across 200 personas and 10,000 questions reveals
significant advancement opportunities. For example, GPT-4.1 had the exact same
PersonaScore as LLaMA-3-8b despite being a more recent and advanced closed
source model. Importantly, increased model size and complexity do not
necessarily enhance persona agent capabilities, underscoring the need for
algorithmic and architectural innovation toward faithful, performant persona
agents.
comment: EMNLP Findings 2025
♻ ☆ Social Bias in Multilingual Language Models: A Survey EMNLP 2025
Pretrained multilingual models exhibit the same social bias as models
processing English texts. This systematic review analyzes emerging research
that extends bias evaluation and mitigation approaches into multilingual and
non-English contexts. We examine these studies with respect to linguistic
diversity, cultural awareness, and their choice of evaluation metrics and
mitigation techniques. Our survey illuminates gaps in the field's dominant
methodological design choices (e.g., preference for certain languages, scarcity
of multilingual mitigation experiments) while cataloging common issues
encountered and solutions implemented in adapting bias benchmarks across
languages and cultures. Drawing from the implications of our findings, we chart
directions for future research that can reinforce the multilingual bias
literature's inclusivity, cross-cultural appropriateness, and alignment with
state-of-the-art NLP advancements.
comment: Accepted into EMNLP 2025 Main Conference
♻ ☆ MultiStream-LLM: Bridging Modalities for Robust Sign Language Translation
Despite progress in gloss-free Sign Language Translation (SLT), monolithic
end-to-end models consistently fail on two critical components of natural
signing: the precise recognition of high-speed fingerspelling and the
integration of asynchronous non-manual cues from the face. Recent progress in
Automated Sign Language Translation with Large Language Models has side stepped
this challenge, forcing a single network to learn these simultaneously
resulting in poor performance when tasked with translating crucial information
such as names,places, and technical terms. We introduce MultiStream-LLM, a
modular framework designed to overcome these limitations. Our approach employs
separate, specialized predictors for continuous signing, fingerspelling, and
lipreading. Each expert network first decodes its specific modality into a
sequence of tokens. These parallel streams are then fused by a lightweight
transformer that resolves temporal misalignments before passing the combined
representation to a Large Language Model (LLM) for final sentence generation.
Our method establishes a new state-of-the-art on the How2Sign benchmark with a
BLEU-4 score of 23.5 and achieves 73.2% letter accuracy on the challenging
ChicagoFSWildPlus fingerspelling dataset. These results validate our core
hypothesis: by isolating and solving distinct recogni tion tasks before fusion,
our multi-expert approach provides a more powerful and effective pathway to
robust, high-fidelity sign language translation.
♻ ☆ UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Shulin Xin, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qi Liu, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Yaohui Wang, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Qihua Han, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi
The development of autonomous agents for graphical user interfaces (GUIs)
presents major challenges in artificial intelligence. While recent advances in
native agent models have shown promise by unifying perception, reasoning,
action, and memory through end-to-end learning, open problems remain in data
scalability, multi-turn reinforcement learning (RL), the limitations of
GUI-only operation, and environment stability. In this technical report, we
present UI-TARS-2, a native GUI-centered agent model that addresses these
challenges through a systematic training methodology: a data flywheel for
scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI
environment that integrates file systems and terminals, and a unified sandbox
platform for large-scale rollouts. Empirical evaluation demonstrates that
UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on
WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines
such as Claude and OpenAI agents. In game environments, it attains a mean
normalized score of 59.8 across a 15-game suite-roughly 60% of human-level
performance-and remains competitive with frontier proprietary models (e.g.,
OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to
long-horizon information-seeking tasks and software engineering benchmarks,
highlighting its robustness across diverse agent tasks. Detailed analyses of
training dynamics further provide insights into achieving stability and
efficiency in large-scale agent RL. These results underscore UI-TARS-2's
potential to advance the state of GUI agents and exhibit strong generalization
to real-world interactive scenarios.
♻ ☆ ViClaim: A Multilingual Multilabel Dataset for Automatic Claim Detection in Videos
The growing influence of video content as a medium for communication and
misinformation underscores the urgent need for effective tools to analyze
claims in multilingual and multi-topic settings. Existing efforts in
misinformation detection largely focus on written text, leaving a significant
gap in addressing the complexity of spoken text in video transcripts. We
introduce ViClaim, a dataset of 1,798 annotated video transcripts across three
languages (English, German, Spanish) and six topics. Each sentence in the
transcripts is labeled with three claim-related categories: fact-check-worthy,
fact-non-check-worthy, or opinion. We developed a custom annotation tool to
facilitate the highly complex annotation process. Experiments with
state-of-the-art multilingual language models demonstrate strong performance in
cross-validation (macro F1 up to 0.896) but reveal challenges in generalization
to unseen topics, particularly for distinct domains. Our findings highlight the
complexity of claim detection in video transcripts. ViClaim offers a robust
foundation for advancing misinformation detection in video-based communication,
addressing a critical gap in multimodal analysis.
♻ ☆ Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalization of Misinformation Detection Models
This article introduces misinfo-general, a benchmark dataset for evaluating
misinformation models' ability to perform out-of-distribution generalization.
Misinformation changes rapidly, much more quickly than moderators can annotate
at scale, resulting in a shift between the training and inference data
distributions. As a result, misinformation detectors need to be able to perform
out-of-distribution generalization, an attribute they currently lack. Our
benchmark uses distant labelling to enable simulating covariate shifts in
misinformation content. We identify time, event, topic, publisher, political
bias, misinformation type as important axes for generalization, and we evaluate
a common class of baseline models on each. Using article metadata, we show how
this model fails desiderata, which is not necessarily obvious from
classification metrics. Finally, we analyze properties of the data to ensure
limited presence of modelling shortcuts. We make the dataset and accompanying
code publicly available: https://github.com/ioverho/misinfo-general
comment: Under review
♻ ☆ TECP: Token-Entropy Conformal Prediction for LLMs
Uncertainty quantification (UQ) for open-ended language generation remains a
critical yet underexplored challenge, especially under black-box constraints
where internal model signals are inaccessible. In this paper, we introduce
Token-Entropy Conformal Prediction (TECP), a novel framework that leverages
token-level entropy as a logit-free, reference-free uncertainty measure and
integrates it into a split conformal prediction (CP) pipeline to construct
prediction sets with formal coverage guarantees. Unlike existing approaches
that rely on semantic consistency heuristics or white-box features, TECP
directly estimates epistemic uncertainty from the token entropy structure of
sampled generations and calibrates uncertainty thresholds via CP quantiles to
ensure provable error control. Empirical evaluations across six large language
models and two benchmarks (CoQA and TriviaQA) demonstrate that TECP
consistently achieves reliable coverage and compact prediction sets,
outperforming prior self-consistency-based UQ methods. Our method provides a
principled and efficient solution for trustworthy generation in black-box LLM
settings.
♻ ☆ Demystifying Chains, Trees, and Graphs of Thoughts
Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Lukas Gianinazzi, Ales Kubicek, Hubert Niewiadomski, Aidan O'Mahony, Onur Mutlu, Torsten Hoefler
The field of natural language processing (NLP) has witnessed significant
progress in recent years, with a notable focus on improving large language
models' (LLM) performance through innovative prompting techniques. Among these,
prompt engineering coupled with structures has emerged as a promising paradigm,
with designs such as Chain-of-Thought, Tree of Thoughts, or Graph of Thoughts,
in which the overall LLM reasoning is guided by a structure such as a graph. As
illustrated with numerous examples, this paradigm significantly enhances the
LLM's capability to solve numerous tasks, ranging from logical or mathematical
reasoning to planning or creative writing. To facilitate the understanding of
this growing field and pave the way for future developments, we devise a
general blueprint for effective and efficient LLM reasoning schemes. For this,
we conduct an in-depth analysis of the prompt execution pipeline, clarifying
and clearly defining different concepts. We then build the first taxonomy of
structure-enhanced LLM reasoning schemes. We focus on identifying fundamental
classes of harnessed structures, and we analyze the representations of these
structures, algorithms executed with these structures, and many others. We
refer to these structures as reasoning topologies, because their representation
becomes to a degree spatial, as they are contained within the LLM context. Our
study compares existing prompting schemes using the proposed taxonomy,
discussing how certain design choices lead to different patterns in performance
and cost. We also outline theoretical underpinnings, relationships between
prompting and other parts of the LLM ecosystem such as knowledge bases, and the
associated research challenges. Our work will help to advance future prompt
engineering techniques.
♻ ☆ StereoDetect: Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings
Stereotypes are known to have very harmful effects, making their detection
critically important. However, current research predominantly focuses on
detecting and evaluating stereotypical biases, thereby leaving the study of
stereotypes in its early stages. Our study revealed that many works have failed
to clearly distinguish between stereotypes and stereotypical biases, which has
significantly slowed progress in advancing research in this area. Stereotype
and Anti-stereotype detection is a problem that requires social knowledge;
hence, it is one of the most difficult areas in Responsible AI. This work
investigates this task, where we propose a five-tuple definition and provide
precise terminologies disentangling stereotypes, anti-stereotypes,
stereotypical bias, and general bias. We provide a conceptual framework
grounded in social psychology for reliable detection. We identify key
shortcomings in existing benchmarks for this task of stereotype and
anti-stereotype detection. To address these gaps, we developed StereoDetect, a
well curated, definition-aligned benchmark dataset designed for this task. We
show that sub-10B language models and GPT-4o frequently misclassify
anti-stereotypes and fail to recognize neutral overgeneralizations. We
demonstrate StereoDetect's effectiveness through multiple qualitative and
quantitative comparisons with existing benchmarks and models fine-tuned on
them. The dataset and code is available at
https://github.com/KaustubhShejole/StereoDetect.
♻ ☆ Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD EMNLP 2025
Large Language Models (LLMs) can struggle to balance gullibility to
misinformation and resistance to valid corrections in persuasive dialogues, a
critical challenge for reliable deployment. We introduce DuET-PD (Dual
Evaluation for Trust in Persuasive Dialogues), a framework evaluating
multi-turn stance-change dynamics across dual dimensions: persuasion type
(corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via
SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves
only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions.
Moreover, results reveal a concerning trend of increasing sycophancy in newer
open-source models. To address this, we introduce Holistic DPO, a training
approach balancing positive and negative persuasion examples. Unlike prompting
or resist-only training, Holistic DPO enhances both robustness to
misinformation and receptiveness to corrections, improving
Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts
from 4.21% to 76.54%. These contributions offer a pathway to developing more
reliable and adaptable LLMs for multi-turn dialogue. Code is available at
https://github.com/Social-AI-Studio/DuET-PD.
comment: To appear at EMNLP 2025
♻ ☆ Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions ACL 2025
This paper investigates the faithfulness of multimodal large language model
(MLLM) agents in a graphical user interface (GUI) environment, aiming to
address the research question of whether multimodal GUI agents can be
distracted by environmental context. A general scenario is proposed where both
the user and the agent are benign, and the environment, while not malicious,
contains unrelated content. A wide range of MLLMs are evaluated as GUI agents
using a simulated dataset, following three working patterns with different
levels of perception. Experimental results reveal that even the most powerful
models, whether generalist agents or specialist GUI agents, are susceptible to
distractions. While recent studies predominantly focus on the helpfulness of
agents, our findings first indicate that these agents are prone to
environmental distractions. Furthermore, we implement an adversarial
environment injection and analyze the approach to improve faithfulness, calling
for a collective focus on this important topic.
comment: ACL 2025
♻ ☆ MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages
We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which
covers 306 languages. The context data comes from Wikipedia articles, with
questions generated by an LLM and the answers appearing verbatim in the
Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency
of the generated questions across 30 of the languages, providing evidence that
the questions are of good quality. We evaluate 6 different language models,
both decoder and encoder models of varying sizes, showing that the benchmark is
sufficiently difficult and that there is a large performance discrepancy
amongst the languages. The dataset and survey evaluations are freely available.
♻ ☆ Selective Preference Optimization via Token-Level Reward Function Estimation EMNLP 2025
Recent advancements in large language model alignment leverage token-level
supervisions to perform fine-grained preference optimization. However, existing
token-level alignment methods either optimize on all available tokens, which
can be noisy and inefficient, or perform selective training with complex and
expensive key token selection strategies. In this work, we propose Selective
Preference Optimization (SePO), a novel selective alignment strategy that
centers on efficient key token selection. SePO proposes the first token
selection method based on Direct Preference Optimization (DPO), which trains an
oracle model to estimate a token-level reward function on the target data. This
method applies to any existing alignment datasets with response-level
annotations and enables cost-efficient token selection with small-scale oracle
models and training data. The estimated reward function is then utilized to
score all tokens within the target dataset, where only the key tokens are
selected to supervise the target policy model with a reference model-free
contrastive objective function. Extensive experiments on three public
evaluation benchmarks show that SePO significantly outperforms competitive
baseline methods by only optimizing 30% key tokens on the target dataset. SePO
applications on weak-to-strong generalization show that weak oracle models
effectively supervise strong policy models with up to 16.8x more parameters.
SePO also effectively selects key tokens from out-of-distribution data to
enhance strong policy models and alleviate the over-optimization problem.
comment: Accepted by the EMNLP 2025 main conference
♻ ☆ AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu
We introduce AnyGPT, an any-to-any multimodal language model that utilizes
discrete representations for the unified processing of various modalities,
including speech, text, images, and music. AnyGPT can be trained stably without
any alterations to the current large language model (LLM) architecture or
training paradigms. Instead, it relies exclusively on data-level preprocessing,
facilitating the seamless integration of new modalities into LLMs, akin to the
incorporation of new languages. We build a multimodal text-centric dataset for
multimodal alignment pre-training. Utilizing generative models, we synthesize
the first large-scale any-to-any multimodal instruction dataset. It consists of
108k samples of multi-turn conversations that intricately interweave various
modalities, thus equipping the model to handle arbitrary combinations of
multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is
capable of facilitating any-to-any multimodal conversation while achieving
performance comparable to specialized models across all modalities, proving
that discrete representations can effectively and conveniently unify multiple
modalities within a language model. Demos are shown in
https://junzhan2000.github.io/AnyGPT.github.io/
comment: 28 pages, 16 figures, under review, work in progress
♻ ☆ Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Large language models interact with users through a simulated 'Assistant'
persona. While the Assistant is typically trained to be helpful, harmless, and
honest, it sometimes deviates from these ideals. In this paper, we identify
directions in the model's activation space-persona vectors-underlying several
traits, such as evil, sycophancy, and propensity to hallucinate. We confirm
that these vectors can be used to monitor fluctuations in the Assistant's
personality at deployment time. We then apply persona vectors to predict and
control personality shifts that occur during training. We find that both
intended and unintended personality changes after finetuning are strongly
correlated with shifts along the relevant persona vectors. These shifts can be
mitigated through post-hoc intervention, or avoided in the first place with a
new preventative steering method. Moreover, persona vectors can be used to flag
training data that will produce undesirable personality changes, both at the
dataset level and the individual sample level. Our method for extracting
persona vectors is automated and can be applied to any personality trait of
interest, given only a natural-language description.
♻ ☆ Assessing the Sensitivity and Alignment of FOL Closeness Metrics EMNLP 2025
The recent successful paradigm of solving logical reasoning problems with
tool-augmented large language models (LLMs) leverages translation of natural
language (NL) statements into First-Order Logic~(FOL) and external theorem
provers. However, the correctness of FOL statements, comprising operators and
text, often go unverified due to the lack of a reliable evaluation metric for
comparing generated and ground-truth FOLs. In this paper, we conduct a
comprehensive study on the sensitivity of existing NL-, FOL-, and graph-based
metrics to capture differences between a sampled FOL and its corresponding
ground-truth. We then measure the alignment between a metric-based ranking of
FOL outputs and a strong LLM as-a-judge. To do this, we first apply operator
and text-based perturbations to ground-truth FOL statements to assess metric
sensitivity. We then evaluate metric robustness by comparing the metrics
against LLMs judgment. Our empirical findings highlight a clear oversensitivity
in the n-gram metric BLEU for text perturbations. The operator perturbation
affects the semantic graph metric Smatch++ for structural changes, and the FOL
metric for specific operator changes. We observe a closer alignment between
BertScore and LLM judgement, proving the importance of semantic evaluation.
Additionally, we show that combining metrics enhances both robustness and
sensitivity compared to using individual metrics.
comment: EMNLP 2025
♻ ☆ LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning ACL 2025
Jin Jiang, Yuchen Yan, Yang Liu, Jianing Wang, Shuai Peng, Xunliang Cai, Yixin Cao, Mengdi Zhang, Liangcai Gao
In this paper, we propose a new data synthesis method called
\textbf{LogicPro}, which leverages LeetCode-style algorithm
\underline{Pro}blems and their corresponding \underline{Pro}gram solutions to
synthesize Complex \underline{Logic}al Reasoning data in text format. First, we
synthesize complex reasoning problems through source algorithm problems and
test cases. Then, standard answers and intermediate variable outputs are
obtained for each problem based on standard python solutions and test cases.
Finally, with the guidance of code intermediate variables, we synthesize the
text reasoning process for each reasoning problems. Through this method, we can
synthesize data that is difficult, scalable, effective, and comes with golden
standard answers and high-quality reasoning processes. As a result, with our
540K synthesized dataset constructed solely from 2,360 algorithm problems, our
approach \footnote{Code and data are publicly available at
https://github.com/jiangjin1999/LogicPro} achieves significant improvements in
multiple models for the datasets \textit{BBH$^{27}$}, \textit{LogicBench},
\textit{DROP}, \textit{AR-LSAT}, and \textit{GSM8K}, etc. outperforming a wide
range of existing reasoning datasets.
comment: 19 pages, ACL 2025 (Volume 1 Long Papers), pages 26200-26218
♻ ☆ Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval ICLR 2023
Recently multi-lingual pre-trained language models (PLM) such as mBERT and
XLM-R have achieved impressive strides in cross-lingual dense retrieval.
Despite its successes, they are general-purpose PLM while the multilingual PLM
tailored for cross-lingual retrieval is still unexplored. Motivated by an
observation that the sentences in parallel documents are approximately in the
same order, which is universal across languages, we propose to model this
sequential sentence relation to facilitate cross-lingual representation
learning. Specifically, we propose a multilingual PLM called masked sentence
model (MSM), which consists of a sentence encoder to generate the sentence
representations, and a document encoder applied to a sequence of sentence
vectors from a document. The document encoder is shared for all languages to
model the universal sequential sentence relation across languages. To train the
model, we propose a masked sentence prediction task, which masks and predicts
the sentence vector via a hierarchical contrastive loss with sampled negatives.
Comprehensive experiments on four cross-lingual retrieval tasks show MSM
significantly outperforms existing advanced pre-training models, demonstrating
the effectiveness and stronger cross-lingual retrieval capabilities of our
approach. Code and model are available at https://github.com/shunyuzh/MSM.
comment: Published at ICLR 2023
♻ ☆ All That Glitters is Not Novel: Plagiarism in AI Generated Research ACL 2025
Automating scientific research is considered the final frontier of science.
Recently, several papers claim autonomous research agents can generate novel
research ideas. Amidst the prevailing optimism, we document a critical concern:
a considerable fraction of such research documents are smartly plagiarized.
Unlike past efforts where experts evaluate the novelty and feasibility of
research ideas, we request $13$ experts to operate under a different
situational logic: to identify similarities between LLM-generated research
documents and existing work. Concerningly, the experts identify $24\%$ of the
$50$ evaluated research documents to be either paraphrased (with one-to-one
methodological mapping), or significantly borrowed from existing work. These
reported instances are cross-verified by authors of the source papers. The
remaining $76\%$ of documents show varying degrees of similarity with existing
work, with only a small fraction appearing completely novel. Problematically,
these LLM-generated research documents do not acknowledge original sources, and
bypass inbuilt plagiarism detectors. Lastly, through controlled experiments we
show that automated plagiarism detectors are inadequate at catching plagiarized
ideas from such systems. We recommend a careful assessment of LLM-generated
research, and discuss the implications of our findings on academic publishing.
comment: Accepted to ACL 2025 (main) conference
♻ ☆ MountainLion: A Multi-Modal LLM-Based Agent System for Interpretable and Adaptive Financial Trading
Siyi Wu, Junqiao Wang, Zhaoyang Guan, Leyi Zhao, Xinyuan Song, Xinyu Ying, Dexu Yu, Jinhao Wang, Hanlin Zhang, Michele Pak, Yangfan He, Yi Xin, Jianhui Wang, Tianyu Shi
Cryptocurrency trading is a challenging task requiring the integration of
heterogeneous data from multiple modalities. Traditional deep learning and
reinforcement learning approaches typically demand large training datasets and
encode diverse inputs into numerical representations, often at the cost of
interpretability. Recent progress in large language model (LLM)-based agents
has demonstrated the capacity to process multi-modal data and support complex
investment decision-making. Building on these advances, we present
\textbf{MountainLion}, a multi-modal, multi-agent system for financial trading
that coordinates specialized LLM-based agents to interpret financial data and
generate investment strategies. MountainLion processes textual news,
candlestick charts, and trading signal charts to produce high-quality financial
reports, while also enabling modification of reports and investment
recommendations through data-driven user interaction and question answering. A
central reflection module analyzes historical trading signals and outcomes to
continuously refine decision processes, and the system is capable of real-time
report analysis, summarization, and dynamic adjustment of investment
strategies. Empirical results confirm that MountainLion systematically enriches
technical price triggers with contextual macroeconomic and capital flow
signals, providing a more interpretable, robust, and actionable investment
framework that improves returns and strengthens investor confidence.
♻ ☆ The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs
Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez
Personality traits have long been studied as predictors of human behavior.
Recent advances in Large Language Models (LLMs) suggest similar patterns may
emerge in artificial systems, with advanced LLMs displaying consistent
behavioral tendencies resembling human traits like agreeableness and
self-regulation. Understanding these patterns is crucial, yet prior work
primarily relied on simplified self-reports and heuristic prompting, with
little behavioral validation. In this study, we systematically characterize LLM
personality across three dimensions: (1) the dynamic emergence and evolution of
trait profiles throughout training stages; (2) the predictive validity of
self-reported traits in behavioral tasks; and (3) the impact of targeted
interventions, such as persona injection, on both self-reports and behavior.
Our findings reveal that instructional alignment (e.g., RLHF, instruction
tuning) significantly stabilizes trait expression and strengthens trait
correlations in ways that mirror human data. However, these self-reported
traits do not reliably predict behavior, and observed associations often
diverge from human patterns. While persona injection successfully steers
self-reports in the intended direction, it exerts little or inconsistent effect
on actual behavior. By distinguishing surface-level trait expression from
behavioral consistency, our findings challenge assumptions about LLM
personality and underscore the need for deeper evaluation in alignment and
interpretability.
comment: We make public all code and source data at
https://github.com/psychology-of-AI/Personality-Illusion for full
reproducibility
♻ ☆ ELIXIR: Efficient and LIghtweight model for eXplaIning Recommendations
Collaborative filtering drives many successful recommender systems but
struggles with fine-grained user-item interactions and explainability. As users
increasingly seek transparent recommendations, generating textual explanations
through language models has become a critical research area. Existing methods
employ either RNNs or Transformers. However, RNN-based approaches fail to
leverage the capabilities of pre-trained Transformer models, whereas
Transformer-based methods often suffer from suboptimal adaptation and neglect
aspect modeling, which is crucial for personalized explanations. We propose
ELIXIR (Efficient and LIghtweight model for eXplaIning Recommendations), a
multi-task model combining rating prediction with personalized review
generation. ELIXIR jointly learns global and aspect-specific representations of
users and items, optimizing overall rating, aspect-level ratings, and review
generation, with personalized attention to emphasize aspect importance. Based
on a T5-small (60M) model, we demonstrate the effectiveness of our aspect-based
architecture in guiding text generation in a personalized context, where
state-of-the-art approaches exploit much larger models but fail to match user
preferences as well. Experimental results on TripAdvisor and RateBeer
demonstrate that ELIXIR significantly outperforms strong baseline models,
especially in review generation.
comment: 10 pages, 3 figures, 6 Tables