Computation and Language
☆ FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li
The advancement of open-source text-to-image (T2I) models has been hindered
by the absence of large-scale, reasoning-focused datasets and comprehensive
evaluation benchmarks, resulting in a performance gap compared to leading
closed-source systems. To address this challenge, We introduce FLUX-Reason-6M
and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark).
FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality
FLUX-generated images and 20 million bilingual (English and Chinese)
descriptions specifically designed to teach complex reasoning. The image are
organized according to six key characteristics: Imagination, Entity, Text
rendering, Style, Affection, and Composition, and design explicit Generation
Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation
steps. The whole data curation takes 15,000 A100 GPU days, providing the
community with a resource previously unattainable outside of large industrial
labs. PRISM-Bench offers a novel evaluation standard with seven distinct
tracks, including a formidable Long Text challenge using GCoT. Through
carefully designed prompts, it utilizes advanced vision-language models for
nuanced human-aligned assessment of prompt-image alignment and image
aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench
reveals critical performance gaps and highlights specific areas requiring
improvement. Our dataset, benchmark, and evaluation code are released to
catalyze the next wave of reasoning-oriented T2I generation. Project page:
https://flux-reason-6m.github.io/ .
comment: Project page: https://flux-reason-6m.github.io/
☆ ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
Large language models require massive memory footprints, severely limiting
deployment on consumer hardware. Quantization reduces memory through lower
numerical precision, but extreme 2-bit quantization suffers from catastrophic
performance loss due to outliers in activations. Rotation-based methods such as
QuIP and QuaRot apply orthogonal transforms to eliminate outliers before
quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} =
(\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these
methods use fixed transforms--Hadamard matrices achieving optimal worst-case
coherence $\mu = 1/\sqrt{n}$--that cannot adapt to specific weight
distributions. We identify that different transformer layers exhibit distinct
outlier patterns, motivating layer-adaptive rotations rather than
one-size-fits-all approaches. We propose ButterflyQuant, which replaces
Hadamard rotations with learnable butterfly transforms parameterized by
continuous Givens rotation angles. Unlike Hadamard's discrete $\{+1, -1\}$
entries that are non-differentiable and prohibit gradient-based learning,
butterfly transforms' continuous parameterization enables smooth optimization
while guaranteeing orthogonality by construction. This orthogonal constraint
ensures theoretical guarantees in outlier suppression while achieving $O(n \log
n)$ computational complexity with only $\frac{n \log n}{2}$ learnable
parameters. We further introduce a uniformity regularization on
post-transformation activations to promote smoother distributions amenable to
quantization. Learning requires only 128 calibration samples and converges in
minutes on a single GPU--a negligible one-time cost. On LLaMA-2-7B with 2-bit
quantization, ButterflyQuant achieves 15.4 perplexity versus 22.1 for QuaRot.
comment: Replace discrete Hadamard transforms with continuous Butterfly
transforms to facilitate the learning of rotation matrices in LLM
quantization
☆ SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, Ning Ding
Vision-Language-Action (VLA) models have recently emerged as a powerful
paradigm for robotic manipulation. Despite substantial progress enabled by
large-scale pretraining and supervised fine-tuning (SFT), these models face two
fundamental challenges: (i) the scarcity and high cost of large-scale
human-operated robotic trajectories required for SFT scaling, and (ii) limited
generalization to tasks involving distribution shift. Recent breakthroughs in
Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can
dramatically enhance step-by-step reasoning capabilities, raising a natural
question: Can RL similarly improve the long-horizon step-by-step action
planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL
framework tailored for VLA models. Building upon veRL, we introduce
VLA-specific trajectory sampling, scalable parallelization, multi-environment
rendering, and optimized loss computation. When applied to OpenVLA-OFT,
SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $\pi_0$
on RoboTwin 1.0\&2.0 with the exploration-enhancing strategies we introduce.
SimpleVLA-RL not only reduces dependence on large-scale data and enables robust
generalization, but also remarkably surpasses SFT in real-world tasks.
Moreover, we identify a novel phenomenon ``pushcut'' during RL training,
wherein the policy discovers previously unseen patterns beyond those seen in
the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL
☆ CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, Dong Yu
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm
for enhancing the reasoning ability of Large Language Models (LLMs). Yet
current RLVR methods often explore poorly, leading to premature convergence and
entropy collapse. To address this challenge, we introduce Curiosity-Driven
Exploration (CDE), a framework that leverages the model's own intrinsic sense
of curiosity to guide exploration. We formalize curiosity with signals from
both the actor and the critic: for the actor, we use perplexity over its
generated response, and for the critic, we use the variance of value estimates
from a multi-head architecture. Both signals serve as an exploration bonus
within the RLVR framework to guide the model. Our theoretical analysis shows
that the actor-wise bonus inherently penalizes overconfident errors and
promotes diversity among correct responses; moreover, we connect the
critic-wise bonus to the well-established count-based exploration bonus in RL.
Empirically, our method achieves an approximate +3 point improvement over
standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a
calibration collapse mechanism within RLVR, shedding light on common LLM
failure modes.
comment: 21 pages
☆ Steering MoE LLMs via Expert (De)Activation
Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, Nanyun Peng
Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token
through a subset of specialized Feed-Forward Networks (FFN), known as experts.
We present SteerMoE, a framework for steering MoE models by detecting and
controlling behavior-linked experts. Our detection method identifies experts
with distinct activation patterns across paired inputs exhibiting contrasting
behaviors. By selectively (de)activating such experts during inference, we
control behaviors like faithfulness and safety without retraining or modifying
weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to
+20% and faithfulness by +27%. In adversarial attack mode, it drops safety by
-41% alone, and -100% when combined with existing jailbreak methods, bypassing
all safety guardrails and exposing a new dimension of alignment faking hidden
within experts.
☆ Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations
We study question answering in the domain of radio regulations, a legally
sensitive and high-stakes area. We propose a telecom-specific
Retrieval-Augmented Generation (RAG) pipeline and introduce, to our knowledge,
the first multiple-choice evaluation set for this domain, constructed from
authoritative sources using automated filtering and human validation. To assess
retrieval quality, we define a domain-specific retrieval metric, under which
our retriever achieves approximately 97% accuracy. Beyond retrieval, our
approach consistently improves generation accuracy across all tested models. In
particular, while naively inserting documents without structured retrieval
yields only marginal gains for GPT-4o (less than 1%), applying our pipeline
results in nearly a 12% relative improvement. These findings demonstrate that
carefully targeted grounding provides a simple yet strong baseline and an
effective domain-specific solution for regulatory question answering. All code
and evaluation scripts, along with our derived question-answer dataset, are
available at https://github.com/Zakaria010/Radio-RAG.
☆ All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens EMNLP 2025
Large language models (LLMs) demonstrate proficiency across numerous
computational tasks, yet their inner workings remain unclear. In theory, the
combination of causal self-attention and multilayer perceptron layers allows
every token to access and compute information based on all preceding tokens. In
practice, to what extent are such operations present? In this paper, on mental
math tasks (i.e., direct math calculation via next-token prediction without
explicit reasoning), we investigate this question in three steps: inhibiting
input-specific token computations in the initial layers, restricting the routes
of information transfer across token positions in the next few layers, and
forcing all computation to happen at the last token in the remaining layers.
With two proposed techniques, Context-Aware Mean Ablation (CAMA) and
Attention-Based Peeking (ABP), we identify an All-for-One subgraph (AF1) with
high accuracy on a wide variety of mental math tasks, where meaningful
computation occurs very late (in terms of layer depth) and only at the last
token, which receives information of other tokens in few specific middle
layers. Experiments on a variety of models and arithmetic expressions show that
this subgraph is sufficient and necessary for high model performance, transfers
across different models, and works on a variety of input styles. Ablations on
different CAMA and ABP alternatives reveal their unique advantages over other
methods, which may be of independent interest.
comment: EMNLP 2025 Main Conference
☆ DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech
Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that
mimics the voice of an unseen speaker using only a short reference sample,
requiring not only speaker adaptation but also accurate modeling of prosodic
attributes. Recent approaches based on language models, diffusion, and flow
matching have shown promising results in zero-shot TTS, but still suffer from
slow inference and repetition artifacts. Discrete codec representations have
been widely adopted for speech synthesis, and recent works have begun to
explore diffusion models in purely discrete settings, suggesting the potential
of discrete generative modeling for speech synthesis. However, existing
flow-matching methods typically embed these discrete tokens into a continuous
space and apply continuous flow matching, which may not fully leverage the
advantages of discrete representations. To address these challenges, we
introduce DiFlow-TTS, which, to the best of our knowledge, is the first model
to explore purely Discrete Flow Matching for speech synthesis. DiFlow-TTS
explicitly models factorized speech attributes within a compact and unified
architecture. It leverages in-context learning by conditioning on textual
content, along with prosodic and acoustic attributes extracted from a reference
speech, enabling effective attribute cloning in a zero-shot setting. In
addition, the model employs a factorized flow prediction mechanism with
distinct heads for prosody and acoustic details, allowing it to learn
aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS
achieves promising performance in several key metrics, including naturalness,
prosody, preservation of speaker style, and energy control. It also maintains a
compact model size and achieves low-latency inference, generating speech up to
25.8 times faster than the latest existing baselines.
☆ Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems EMNLP 2025
Minghang Zhu, Zhengliang Shi, Zhiwei Xu, Shiguang Wu, Lingjie Wang, Pengjie Ren, Zhaochun Ren, Zhumin Chen
The advancement of large language models (LLMs) has enabled the construction
of multi-agent systems to solve complex tasks by dividing responsibilities
among specialized agents, such as a planning agent for subgoal generation and a
grounding agent for executing tool-use actions. Most existing methods typically
fine-tune these agents independently, leading to capability gaps among them
with poor coordination. To address this, we propose MOAT, a Multi-Agent Joint
Alignment Tuning framework that improves agents collaboration through iterative
alignment. MOAT alternates between two key stages: (1) Planning Agent
Alignment, which optimizes the planning agent to generate subgoal sequences
that better guide the grounding agent; and (2) Grounding Agent Improving, which
fine-tunes the grounding agent using diverse subgoal-action pairs generated by
the agent itself to enhance its generalization capablity. Theoretical analysis
proves that MOAT ensures a non-decreasing and progressively convergent training
process. Experiments across six benchmarks demonstrate that MOAT outperforms
state-of-the-art baselines, achieving average improvements of 3.1% on held-in
tasks and 4.4% on held-out tasks.
comment: EMNLP 2025 Findings
☆ LAVA: Language Model Assisted Verbal Autopsy for Cause-of-Death Determination
Verbal autopsy (VA) is a critical tool for estimating causes of death in
resource-limited settings where medical certification is unavailable. This
study presents LA-VA, a proof-of-concept pipeline that combines Large Language
Models (LLMs) with traditional algorithmic approaches and embedding-based
classification for improved cause-of-death prediction. Using the Population
Health Metrics Research Consortium (PHMRC) dataset across three age categories
(Adult: 7,580; Child: 1,960; Neonate: 2,438), we evaluate multiple approaches:
GPT-5 predictions, LCVA baseline, text embeddings, and meta-learner ensembles.
Our results demonstrate that GPT-5 achieves the highest individual performance
with average test site accuracies of 48.6% (Adult), 50.5% (Child), and 53.5%
(Neonate), outperforming traditional statistical machine learning baselines by
5-10%. Our findings suggest that simple off-the-shelf LLM-assisted approaches
could substantially improve verbal autopsy accuracy, with important
implications for global health surveillance in low-resource settings.
☆ Fluent but Unfeeling: The Emotional Blind Spots of Language Models
Bangzhao Shu, Isha Joshi, Melissa Karnaze, Anh C. Pham, Ishita Kakkar, Sindhu Kothe, Arpine Hovasapian, Mai ElSherief
The versatility of Large Language Models (LLMs) in natural language
understanding has made them increasingly popular in mental health research.
While many studies explore LLMs' capabilities in emotion recognition, a
critical gap remains in evaluating whether LLMs align with human emotions at a
fine-grained level. Existing research typically focuses on classifying emotions
into predefined, limited categories, overlooking more nuanced expressions. To
address this gap, we introduce EXPRESS, a benchmark dataset curated from Reddit
communities featuring 251 fine-grained, self-disclosed emotion labels. Our
comprehensive evaluation framework examines predicted emotion terms and
decomposes them into eight basic emotions using established emotion theories,
enabling a fine-grained comparison. Systematic testing of prevalent LLMs under
various prompt settings reveals that accurately predicting emotions that align
with human self-disclosed emotions remains challenging. Qualitative analysis
further shows that while certain LLMs generate emotion terms consistent with
established emotion theories and definitions, they sometimes fail to capture
contextual cues as effectively as human self-disclosures. These findings
highlight the limitations of LLMs in fine-grained emotion alignment and offer
insights for future research aimed at enhancing their contextual understanding.
comment: Camera-ready version for ICWSM 2026. First two authors contributed
equally
☆ Personality-Enhanced Social Recommendations in SAMI: Exploring the Role of Personality Detection in Matchmaking
Social connection is a vital part of learning, yet online course environments
present barriers to the organic formation of social groups. SAMI offers one
solution by facilitating student connections, but its effectiveness is
constrained by an incomplete Theory of Mind, limiting its ability to create an
effective mental model of a student. One facet of this is its inability to
intuit personality, which may influence the relevance of its recommendations.
To explore this, we propose a personality detection model utilizing GPTs
zero-shot capability to infer Big-Five personality traits from forum
introduction posts, often encouraged in online courses. We benchmark its
performance against established models, demonstrating its efficacy in this
task. Furthermore, we integrate this model into SAMIs entity-based matchmaking
system, enabling personality-informed social recommendations. Initial
integration suggests personality traits can complement existing matching
factors, though additional evaluation is required to determine their full
impact on student engagement and match quality.
☆ Prompting the Market? A Large-Scale Meta-Analysis of GenAI in Finance NLP (2022-2025) EMNLP
Large Language Models (LLMs) have rapidly reshaped financial NLP, enabling
new tasks and driving a proliferation of datasets and diversification of data
sources. Yet, this transformation has outpaced traditional surveys. In this
paper, we present MetaGraph, a generalizable methodology for extracting
knowledge graphs from scientific literature and analyzing them to obtain a
structured, queryable view of research trends. We define an ontology for
financial NLP research and apply an LLM-based extraction pipeline to 681 papers
(2022-2025), enabling large-scale, data-driven analysis. MetaGraph reveals
three key phases: early LLM adoption and task/dataset innovation; critical
reflection on LLM limitations; and growing integration of peripheral techniques
into modular systems. This structured view offers both practitioners and
researchers a clear understanding of how financial NLP has evolved -
highlighting emerging trends, shifting priorities, and methodological
shifts-while also demonstrating a reusable approach for mapping scientific
progress in other domains.
comment: 7 pages, 6 appendices, EMNLP industry track
☆ DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning EMNLP-2025
This system paper presents the DeMeVa team's approaches to the third edition
of the Learning with Disagreements shared task (LeWiDi 2025; Leonardelli et
al., 2025). We explore two directions: in-context learning (ICL) with large
language models, where we compare example sampling strategies; and label
distribution learning (LDL) methods with RoBERTa (Liu et al., 2019b), where we
evaluate several fine-tuning methods. Our contributions are twofold: (1) we
show that ICL can effectively predict annotator-specific annotations
(perspectivist annotations), and that aggregating these predictions into soft
labels yields competitive performance; and (2) we argue that LDL methods are
promising for soft label predictions and merit further exploration by the
perspectivist community.
comment: 11 pages, 4 figures; to appear at NLPerspectives@EMNLP-2025
☆ Towards Explainable Job Title Matching: Leveraging Semantic Textual Relatedness and Knowledge Graphs
Semantic Textual Relatedness (STR) captures nuanced relationships between
texts that extend beyond superficial lexical similarity. In this study, we
investigate STR in the context of job title matching - a key challenge in
resume recommendation systems, where overlapping terms are often limited or
misleading. We introduce a self-supervised hybrid architecture that combines
dense sentence embeddings with domain-specific Knowledge Graphs (KGs) to
improve both semantic alignment and explainability. Unlike previous work that
evaluated models on aggregate performance, our approach emphasizes data
stratification by partitioning the STR score continuum into distinct regions:
low, medium, and high semantic relatedness. This stratified evaluation enables
a fine-grained analysis of model performance across semantically meaningful
subspaces. We evaluate several embedding models, both with and without KG
integration via graph neural networks. The results show that fine-tuned SBERT
models augmented with KGs produce consistent improvements in the high-STR
region, where the RMSE is reduced by 25% over strong baselines. Our findings
highlight not only the benefits of combining KGs with text embeddings, but also
the importance of regional performance analysis in understanding model
behavior. This granular approach reveals strengths and weaknesses hidden by
global metrics, and supports more targeted model selection for use in Human
Resources (HR) systems and applications where fairness, explainability, and
contextual matching are essential.
☆ Mitigating Language Barriers in Education: Developing Multilingual Digital Learning Materials with Machine Translation
The EdUKate project combines digital education, linguistics, translation
studies, and machine translation to develop multilingual learning materials for
Czech primary and secondary schools. Launched through collaboration between a
major Czech academic institution and the country's largest educational
publisher, the project is aimed at translating up to 9,000 multimodal
interactive exercises from Czech into Ukrainian, English, and German for an
educational web portal. It emphasizes the development and evaluation of a
direct Czech-Ukrainian machine translation system tailored to the educational
domain, with special attention to processing formatted content such as XML and
PDF and handling technical and scientific terminology. We present findings from
an initial survey of Czech teachers regarding the needs of non-Czech-speaking
students and describe the system's evaluation and implementation on the web
portal. All resulting applications are freely available to students, educators,
and researchers.
comment: 8 pages, 2 figures
☆ GrACE: A Generative Approach to Better Confidence Elicitation in Large Language Models
Assessing the reliability of Large Language Models (LLMs) by confidence
elicitation is a prominent approach to AI safety in high-stakes applications,
such as healthcare and finance. Existing methods either require expensive
computational overhead or suffer from poor calibration, making them impractical
and unreliable for real-world deployment. In this work, we propose GrACE, a
Generative Approach to Confidence Elicitation that enables scalable and
reliable confidence elicitation for LLMs. GrACE adopts a novel mechanism in
which the model expresses confidence by the similarity between the last hidden
state and the embedding of a special token appended to the vocabulary, in
real-time. We fine-tune the model for calibrating the confidence with
calibration targets associated with accuracy. Experiments with three LLMs and
two benchmark datasets show that the confidence produced by GrACE achieves the
best discriminative capacity and calibration on open-ended generation tasks,
outperforming six competing methods without resorting to additional sampling or
an auxiliary model. Moreover, we propose two strategies for improving test-time
scaling based on confidence induced by GrACE. Experimental results show that
using GrACE not only improves the accuracy of the final decision but also
significantly reduces the number of required samples in the test-time scaling
scheme, indicating the potential of GrACE as a practical solution for deploying
LLMs with scalable, reliable, and real-time confidence estimation.
comment: 20 pages, 11 figures
☆ LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations EMNLP 2025
Harry Mayne, Ryan Othniel Kearns, Yushi Yang, Andrew M. Bean, Eoin Delaney, Chris Russell, Adam Mahdi
To collaborate effectively with humans, language models must be able to
explain their decisions in natural language. We study a specific type of
self-explanation: self-generated counterfactual explanations (SCEs), where a
model explains its prediction by modifying the input such that it would have
predicted a different outcome. We evaluate whether LLMs can produce SCEs that
are valid, achieving the intended outcome, and minimal, modifying the input no
more than necessary. When asked to generate counterfactuals, we find that LLMs
typically produce SCEs that are valid, but far from minimal, offering little
insight into their decision-making behaviour. Worryingly, when asked to
generate minimal counterfactuals, LLMs typically make excessively small edits
that fail to change predictions. The observed validity-minimality trade-off is
consistent across several LLMs, datasets, and evaluation settings. Our findings
suggest that SCEs are, at best, an ineffective explainability tool and, at
worst, can provide misleading insights into model behaviour. Proposals to
deploy LLMs in high-stakes settings must consider the impact of unreliable
self-explanations on downstream decision-making. Our code is available at
https://github.com/HarryMayne/SCEs.
comment: Accepted to EMNLP 2025 Main
☆ Hierarchical Bracketing Encodings Work for Dependency Graphs EMNLP 2025
We revisit hierarchical bracketing encodings from a practical perspective in
the context of dependency graph parsing. The approach encodes graphs as
sequences, enabling linear-time parsing with $n$ tagging actions, and still
representing reentrancies, cycles, and empty nodes. Compared to existing graph
linearizations, this representation substantially reduces the label space while
preserving structural information. We evaluate it on a multilingual and
multi-formalism benchmark, showing competitive results and consistent
improvements over other methods in exact match accuracy.
comment: Accepted at EMNLP 2025 (main)
☆ Modelling Analogies and Analogical Reasoning: Connecting Cognitive Science Theory and NLP Research
Analogical reasoning is an essential aspect of human cognition. In this
paper, we summarize key theory about the processes underlying analogical
reasoning from the cognitive science literature and relate it to current
research in natural language processing. While these processes can be easily
linked to concepts in NLP, they are generally not viewed through a cognitive
lens. Furthermore, we show how these notions are relevant for several major
challenges in NLP research, not directly related to analogy solving. This may
guide researchers to better optimize relational understanding in text, as
opposed to relying heavily on entity-level similarity.
☆ MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems
Large Language Models (LLMs) are increasingly deployed in enterprise
applications, yet their reliability remains limited by hallucinations, i.e.,
confident but factually incorrect information. Existing detection approaches,
such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not
address the unique challenges of Retrieval-Augmented Generation (RAG) systems,
where responses must be consistent with retrieved evidence. We therefore
present MetaRAG, a metamorphic testing framework for hallucination detection in
Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time,
unsupervised, black-box setting, requiring neither ground-truth references nor
access to model internals, making it suitable for proprietary and high-stakes
domains. The framework proceeds in four stages: (1) decompose answers into
atomic factoids, (2) generate controlled mutations of each factoid using
synonym and antonym substitutions, (3) verify each variant against the
retrieved context (synonyms are expected to be entailed and antonyms
contradicted), and (4) aggregate penalties for inconsistencies into a
response-level hallucination score. Crucially for identity-aware AI, MetaRAG
localizes unsupported claims at the factoid span where they occur (e.g.,
pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility),
allowing users to see flagged spans and enabling system designers to configure
thresholds and guardrails for identity-sensitive queries. Experiments on a
proprietary enterprise dataset illustrate the effectiveness of MetaRAG for
detecting hallucinations and enabling trustworthy deployment of RAG-based
conversational agents. We also outline a topic-based deployment design that
translates MetaRAG's span-level scores into identity-aware safeguards; this
design is discussed but not evaluated in our experiments.
comment: under review
☆ OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning
Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yuzheng Zhuang, Bowen Yang, He Zhu, Lingfeng Zhang, Pengwei Xie, David Gamaliel Arcos Bravo, Yingxue Zhang, Jianye Hao, Xingyue Quan
Recent advances in multimodal large language models (MLLMs) have opened new
opportunities for embodied intelligence, enabling multimodal understanding,
reasoning, and interaction, as well as continuous spatial decision-making.
Nevertheless, current MLLM-based embodied systems face two critical
limitations. First, Geometric Adaptability Gap: models trained solely on 2D
inputs or with hard-coded 3D geometry injection suffer from either insufficient
spatial information or restricted 2D generalization, leading to poor
adaptability across tasks with diverse spatial demands. Second, Embodiment
Constraint Gap: prior work often neglects the physical constraints and
capacities of real robots, resulting in task plans that are theoretically valid
but practically infeasible.To address these gaps, we introduce OmniEVA -- an
embodied versatile planner that enables advanced embodied reasoning and task
planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding
mechanism, which introduces a gated router to perform explicit selective
regulation of 3D fusion based on contextual requirements, enabling
context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware
Reasoning framework that jointly incorporates task goals and embodiment
constraints into the reasoning loop, resulting in planning decisions that are
both goal-directed and executable. Extensive experimental results demonstrate
that OmniEVA not only achieves state-of-the-art general embodied reasoning
performance, but also exhibits a strong ability across a wide range of
downstream scenarios. Evaluations of a suite of proposed embodied benchmarks,
including both primitive and composite tasks, confirm its robust and versatile
planning capabilities. Project page: https://omnieva.github.io
☆ Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization
Zhengzhao Lai, Youbin Zheng, Zhenyang Cai, Haonan Lyu, Jinpu Yang, Hongqing Liang, Yan Hu, Benyou Wang
Materials characterization is fundamental to acquiring materials information,
revealing the processing-microstructure-property relationships that guide
material design and optimization. While multimodal large language models
(MLLMs) have recently shown promise in generative and predictive tasks within
materials science, their capacity to understand real-world characterization
imaging data remains underexplored. To bridge this gap, we present MatCha, the
first benchmark for materials characterization image understanding, comprising
1,500 questions that demand expert-level domain expertise. MatCha encompasses
four key stages of materials research comprising 21 distinct tasks, each
designed to reflect authentic challenges faced by materials scientists. Our
evaluation of state-of-the-art MLLMs on MatCha reveals a significant
performance gap compared to human experts. These models exhibit degradation
when addressing questions requiring higher-level expertise and sophisticated
visual perception. Simple few-shot and chain-of-thought prompting struggle to
alleviate these limitations. These findings highlight that existing MLLMs still
exhibit limited adaptability to real-world materials characterization
scenarios. We hope MatCha will facilitate future research in areas such as new
material discovery and autonomous scientific agents. MatCha is available at
https://github.com/FreedomIntelligence/MatCha.
☆ From scratch to silver: Creating trustworthy training data for patent-SDG classification using Large Language Models
Classifying patents by their relevance to the UN Sustainable Development
Goals (SDGs) is crucial for tracking how innovation addresses global
challenges. However, the absence of a large, labeled dataset limits the use of
supervised learning. Existing methods, such as keyword searches, transfer
learning, and citation-based heuristics, lack scalability and generalizability.
This paper frames patent-to-SDG classification as a weak supervision problem,
using citations from patents to SDG-tagged scientific publications (NPL
citations) as a noisy initial signal. To address its sparsity and noise, we
develop a composite labeling function (LF) that uses large language models
(LLMs) to extract structured concepts, namely functions, solutions, and
applications, from patents and SDG papers based on a patent ontology.
Cross-domain similarity scores are computed and combined using a rank-based
retrieval approach. The LF is calibrated via a custom positive-only loss that
aligns with known NPL-SDG links without penalizing discovery of new SDG
associations. The result is a silver-standard, soft multi-label dataset mapping
patents to SDGs, enabling the training of effective multi-label regression
models. We validate our approach through two complementary strategies: (1)
internal validation against held-out NPL-based labels, where our method
outperforms several baselines including transformer-based models, and zero-shot
LLM; and (2) external validation using network modularity in patent citation,
co-inventor, and co-applicant graphs, where our labels reveal greater thematic,
cognitive, and organizational coherence than traditional technological
classifications. These results show that weak supervision and semantic
alignment can enhance SDG classification at scale.
☆ Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning
Recent advances in reasoning with large language models (LLMs) have shown the
effectiveness of Monte Carlo Tree Search (MCTS) for generating high-quality
intermediate trajectories, particularly in math and symbolic domains. Inspired
by this, we explore how MCTS-derived trajectories, traditionally used for
training value or reward models, can be repurposed to improve policy
optimization in preference-based reinforcement learning (RL). Specifically, we
focus on Group Relative Policy Optimization (GRPO), a recent algorithm that
enables preference-consistent policy learning without value networks. We
propose a staged GRPO training paradigm where completions are derived from
partially revealed MCTS rollouts, introducing a novel tree-structured setting
for advantage estimation. This leads to a rich class of prefix-conditioned
reward signals, which we analyze theoretically and empirically. Our initial
results indicate that while structured advantage estimation can stabilize
updates and better reflect compositional reasoning quality, challenges such as
advantage saturation and reward signal collapse remain. We propose heuristic
and statistical solutions to mitigate these issues and discuss open challenges
for learning under staged or tree-like reward structures.
☆ Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents ICLR 2026
Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang
In long-horizon tasks, recent agents based on Large Language Models (LLMs)
face a significant challenge that sparse, outcome-based rewards make it
difficult to assign credit to intermediate steps. Previous methods mainly focus
on creating dense reward signals to guide learning, either through traditional
reinforcement learning techniques like inverse reinforcement learning or by
using Process Reward Models for step-by-step feedback. In this paper, we
identify a fundamental problem in the learning dynamics of LLMs: the magnitude
of policy gradients is inherently coupled with the entropy, which leads to
inefficient small updates for confident correct actions and potentially
destabilizes large updates for uncertain ones. To resolve this, we propose
Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the
learning signal based on step-wise uncertainty and the final task outcome. EMPG
amplifies updates for confident correct actions, penalizes confident errors,
and attenuates updates from uncertain steps to stabilize exploration. We
further introduce a bonus term for future clarity that encourages agents to
find more predictable solution paths. Through comprehensive experiments on
three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we
demonstrate that EMPG achieves substantial performance gains and significantly
outperforms strong policy gradient baselines. Project page is at
https://empgseed-seed.github.io/
comment: ICLR 2026 Under review
☆ Agentic LLMs for Question Answering over Tabular Data ACL
Question Answering over Tabular Data (Table QA) presents unique challenges
due to the diverse structure, size, and data types of real-world tables. The
SemEval 2025 Task 8 (DataBench) introduced a benchmark composed of large-scale,
domain-diverse datasets to evaluate the ability of models to accurately answer
structured queries. We propose a Natural Language to SQL (NL-to-SQL) approach
leveraging large language models (LLMs) such as GPT-4o, GPT-4o-mini, and
DeepSeek v2:16b to generate SQL queries dynamically. Our system follows a
multi-stage pipeline involving example selection, SQL query generation, answer
extraction, verification, and iterative refinement. Experiments demonstrate the
effectiveness of our approach, achieving 70.5\% accuracy on DataBench QA and
71.6\% on DataBench Lite QA, significantly surpassing baseline scores of 26\%
and 27\% respectively. This paper details our methodology, experimental
results, and alternative approaches, providing insights into the strengths and
limitations of LLM-driven Table QA.
comment: Accepted at ACL workshop SemEval 2025
☆ Reading Between the Lines: Classifying Resume Seniority with Large Language Models
Accurately assessing candidate seniority from resumes is a critical yet
challenging task, complicated by the prevalence of overstated experience and
ambiguous self-presentation. In this study, we investigate the effectiveness of
large language models (LLMs), including fine-tuned BERT architectures, for
automating seniority classification in resumes. To rigorously evaluate model
performance, we introduce a hybrid dataset comprising both real-world resumes
and synthetically generated hard examples designed to simulate exaggerated
qualifications and understated seniority. Using the dataset, we evaluate the
performance of Large Language Models in detecting subtle linguistic cues
associated with seniority inflation and implicit expertise. Our findings
highlight promising directions for enhancing AI-driven candidate evaluation
systems and mitigating bias introduced by self-promotional language. The
dataset is available for the research community at https://bit.ly/4mcTovt
comment: 5 pages, 3 figures
☆ Identifying Key Features for Establishing Sustainable Agro-Tourism Centre: A Data Driven Approach
Agro-tourism serves as a strategic economic model designed to facilitate
rural development by diversifying income streams for local communities like
farmers while promoting the conservation of indigenous cultural heritage and
traditional agricultural practices. As a very booming subdomain of tourism,
there is a need to study the strategies for the growth of Agro-tourism in
detail. The current study has identified the important indicators for the
growth and enhancement of agro-tourism. The study is conducted in two phases:
identification of the important indicators through a comprehensive literature
review and in the second phase state-of-the-art techniques were used to
identify the important indicators for the growth of agro-tourism. The
indicators are also called features synonymously, the machine learning models
for feature selection were applied and it was observed that the Least Absolute
Shrinkage and Selection Operator (LASSO) method combined with, the machine
Learning Classifiers such as Logistic Regression (LR), Decision Trees (DT),
Random Forest (RF) Tree, and Extreme Gradient Boosting (XGBOOST) models were
used to suggest the growth of the agro-tourism. The results show that with the
LASSO method, LR model gives the highest classification accuracy of 98% in
70-30% train-test data followed by RF with 95% accuracy. Similarly, in the
80-20% train-test data LR maintains the highest accuracy at 99%, while DT and
XGBoost follow with 97% accuracy.
☆ Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems
Audio deepfake detection (ADD) models are commonly evaluated using datasets
that combine multiple synthesizers, with performance reported as a single Equal
Error Rate (EER). However, this approach disproportionately weights
synthesizers with more samples, underrepresenting others and reducing the
overall reliability of EER. Additionally, most ADD datasets lack diversity in
bona fide speech, often featuring a single environment and speech style (e.g.,
clean read speech), limiting their ability to simulate real-world conditions.
To address these challenges, we propose bona fide cross-testing, a novel
evaluation framework that incorporates diverse bona fide datasets and
aggregates EERs for more balanced assessments. Our approach improves robustness
and interpretability compared to traditional evaluation methods. We benchmark
over 150 synthesizers across nine bona fide speech types and release a new
dataset to facilitate further research at
https://github.com/cyaaronk/audio_deepfake_eval.
comment: Published in Interspeech 2025
☆ CCF: A Context Compression Framework for Efficient Long-Sequence Language Modeling
Scaling language models to longer contexts is essential for capturing rich
dependencies across extended discourse. However, na\"ive context extension
imposes significant computational and memory burdens, often resulting in
inefficiencies during both training and inference. In this work, we propose
CCF, a novel context compression framework designed to enable efficient
long-context modeling by learning hierarchical latent representations that
preserve global semantics while aggressively reducing input redundancy. CCF
integrates segment-wise semantic aggregation with key-value memory encoding,
forming compact representations that support accurate reconstruction and
long-range understanding. To further enhance scalability, we introduce a
training-efficient optimization strategy that couples incremental segment
decoding with sparse reservoir sampling, substantially reducing memory overhead
without degrading performance. Empirical results on multiple long-context
language modeling benchmarks demonstrate that CCF achieves competitive
perplexity under high compression ratios, and significantly improves throughput
and memory efficiency compared to existing approaches. These findings highlight
the potential of structured compression for scalable and effective long-context
language modeling.
☆ GmSLM : Generative Marmoset Spoken Language Modeling
Marmoset monkeys exhibit complex vocal communication, challenging the view
that nonhuman primates vocal communication is entirely innate, and show similar
features of human speech, such as vocal labeling of others and turn-taking.
Studying their vocal communication offers a unique opportunity to link it with
brain activity-especially given the difficulty of accessing the human brain in
speech and language research. Since Marmosets communicate primarily through
vocalizations, applying standard LLM approaches is not straightforward. We
introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized
spoken language model pipeline for Marmoset vocal communication. We designed a
novel zero-shot evaluation metrics using unsupervised in-the-wild data,
alongside weakly labeled conversational data, to assess GmSLM and demonstrate
its advantage over a basic human-speech-based baseline. GmSLM generated
vocalizations closely matched real resynthesized samples acoustically and
performed well on downstream tasks. Despite being fully unsupervised, GmSLM
effectively distinguish real from artificial conversations and may support
further investigations of the neural basis of vocal communication and provides
a practical framework linking vocalization and brain activity. We believe GmSLM
stands to benefit future work in neuroscience, bioacoustics, and evolutionary
biology. Samples are provided under: pages.cs.huji.ac.il/adiyoss-lab/GmSLM.
☆ Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function
Rare word recognition can be improved by adapting ASR models to synthetic
data that includes these words. Further improvements can be achieved through
contextual biasing, which trains and adds a biasing module into the model
architecture to prioritize rare words. While training the module on synthetic
rare word data is more effective than using non-rare-word data, it can lead to
overfitting due to artifacts in the synthetic audio. To address this, we
enhance the TCPGen-based contextual biasing approach and propose a
keyword-aware loss function that additionally focuses on biased words when
training biasing modules. This loss includes a masked cross-entropy term for
biased word prediction and a binary classification term for detecting biased
word positions. These two terms complementarily support the decoding of biased
words during inference. By adapting Whisper to 10 hours of synthetic data, our
method reduced the word error rate on the NSC Part 2 test set from 29.71% to
11.81%.
comment: Published in Interspeech 2025
☆ Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition
Contextual biasing improves rare word recognition of ASR models by
prioritizing the output of rare words during decoding. A common approach is
Trie-based biasing, which gives "bonus scores" to partial hypothesis (e.g.
"Bon") that may lead to the generation of the rare word (e.g. "Bonham"). If the
full word ("Bonham") isn't ultimately recognized, the system revokes those
earlier bonuses. This revocation is limited to beam search and is
computationally expensive, particularly for models with large decoders. To
overcome these limitations, we propose adapting ASR models to look ahead and
predict multiple steps at once. This avoids the revocation step entirely by
better estimating whether a partial hypothesis will lead to the generation of
the full rare word. By fine-tuning Whisper with only 10 hours of synthetic
data, our method reduces the word error rate on the NSC Part 2 test set from
30.86% to 12.19%.
comment: Published in Interspeech 2025
☆ EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
Speech-to-speech large language models (SLLMs) are attracting increasing
attention. Derived from text-based large language models (LLMs), SLLMs often
exhibit degradation in knowledge and reasoning capabilities. We hypothesize
that this limitation arises because current training paradigms for SLLMs fail
to bridge the acoustic-semantic gap in the feature representation space. To
address this issue, we propose EchoX, which leverages semantic representations
and dynamically generates speech training targets. This approach integrates
both acoustic and semantic learning, enabling EchoX to preserve strong
reasoning abilities as a speech LLM. Experimental results demonstrate that
EchoX, with about six thousand hours of training data, achieves advanced
performance on multiple knowledge-based question-answering benchmarks. The
project is available at https://github.com/FreedomIntelligence/EchoX.
☆ Target-oriented Multimodal Sentiment Classification with Counterfactual-enhanced Debiasing ICME 2025
Target-oriented multimodal sentiment classification seeks to predict
sentiment polarity for specific targets from image-text pairs. While existing
works achieve competitive performance, they often over-rely on textual content
and fail to consider dataset biases, in particular word-level contextual
biases. This leads to spurious correlations between text features and output
labels, impairing classification accuracy. In this paper, we introduce a novel
counterfactual-enhanced debiasing framework to reduce such spurious
correlations. Our framework incorporates a counterfactual data augmentation
strategy that minimally alters sentiment-related causal features, generating
detail-matched image-text samples to guide the model's attention toward content
tied to sentiment. Furthermore, for learning robust features from
counterfactual data and prompting model decisions, we introduce an adaptive
debiasing contrastive learning mechanism, which effectively mitigates the
influence of biased words. Experimental results on several benchmark datasets
show that our proposed method outperforms state-of-the-art baselines.
comment: Accepted by the IEEE International Conference on Multimedia and Expo
(ICME 2025). \copyright\ 2025 IEEE. Personal use of this material is
permitted. Permission from IEEE must be obtained for all other uses
☆ LITcoder: A General-Purpose Library for Building and Comparing Encoding Models
We introduce LITcoder, an open-source library for building and benchmarking
neural encoding models. Designed as a flexible backend, LITcoder provides
standardized tools for aligning continuous stimuli (e.g., text and speech) with
brain data, transforming stimuli into representational features, mapping those
features onto brain data, and evaluating the predictive performance of the
resulting model on held-out data. The library implements a modular pipeline
covering a wide array of methodological design choices, so researchers can
easily compose, compare, and extend encoding models without reinventing core
infrastructure. Such choices include brain datasets, brain regions, stimulus
feature (both neural-net-based and control, such as word rate), downsampling
approaches, and many others. In addition, the library provides built-in
logging, plotting, and seamless integration with experiment tracking platforms
such as Weights & Biases (W&B). We demonstrate the scalability and versatility
of our framework by fitting a range of encoding models to three story listening
datasets: LeBel et al. (2023), Narratives, and Little Prince. We also explore
the methodological choices critical for building encoding models for continuous
fMRI data, illustrating the importance of accounting for all tokens in a TR
scan (as opposed to just taking the last one, even when contextualized),
incorporating hemodynamic lag effects, using train-test splits that minimize
information leakage, and accounting for head motion effects on encoding model
predictivity. Overall, LITcoder lowers technical barriers to encoding model
implementation, facilitates systematic comparisons across models and datasets,
fosters methodological rigor, and accelerates the development of high-quality
high-performance predictive models of brain activity.
Project page: https://litcoder-brain.github.io
☆ ViRanker: A BGE-M3 & Blockwise Parallel Transformer Cross-Encoder for Vietnamese Reranking
This paper presents ViRanker, a cross-encoder reranking model tailored to the
Vietnamese language. Built on the BGE-M3 encoder and enhanced with the
Blockwise Parallel Transformer, ViRanker addresses the lack of competitive
rerankers for Vietnamese, a low-resource language with complex syntax and
diacritics. The model was trained on an 8 GB curated corpus and fine-tuned with
hybrid hard-negative sampling to strengthen robustness. Evaluated on the
MMARCO-VI benchmark, ViRanker achieves strong early-rank accuracy, surpassing
multilingual baselines and competing closely with PhoRanker. By releasing the
model openly on Hugging Face, we aim to support reproducibility and encourage
wider adoption in real-world retrieval systems. Beyond Vietnamese, this study
illustrates how careful architectural adaptation and data curation can advance
reranking in other underrepresented languages.
comment: 9 pages
☆ Automated Classification of Tutors' Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus
This study explores the use of generative AI for automating the
classification of tutors' Dialogue Acts (DAs), aiming to reduce the time and
effort required by traditional manual coding. This case study uses the
open-source CIMA corpus, in which tutors' responses are pre-annotated into four
DA categories. Both GPT-3.5-turbo and GPT-4 models were tested using tailored
prompts. Results show that GPT-4 achieved 80% accuracy, a weighted F1-score of
0.81, and a Cohen's Kappa of 0.74, surpassing baseline performance and
indicating substantial agreement with human annotations. These findings suggest
that generative AI has strong potential to provide an efficient and accessible
approach to DA classification, with meaningful implications for educational
dialogue analysis. The study also highlights the importance of task-specific
label definitions and contextual information in enhancing the quality of
automated annotation. Finally, it underscores the ethical considerations
associated with the use of generative AI and the need for responsible and
transparent research practices. The script of this research is publicly
available at
https://github.com/liqunhe27/Generative-AI-for-educational-dialogue-act-tagging.
comment: Accepted for publication in the journal Reflecting Digital Learning.
First submitted: 30 Oct 2023. The final version will be available open access
via the journal
☆ Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia
Large language models (LLMs) excel in general-domain applications, yet their
performance often degrades in specialized tasks requiring domain-specific
knowledge. E-commerce is particularly challenging, as its data are noisy,
heterogeneous, multilingual, and highly dynamic. We present Compass-v3, a
vertical-domain Mixture-of-Experts (MoE) model with 245B total parameters and
71B active per token, designed for Southeast Asian e-commerce. Compass-v3
adopts fewer but larger experts, combined with hardware-efficient
optimizations-such as intra-node expert parallelism and a customized memcpy
operator-to maximize GPU utilization. The model is trained on 12T tokens of
curated multilingual corpora and large-scale synthetic e-commerce instructions
using a mixed-training strategy. To enhance alignment, we propose
Optimal-Transport Direct Preference Optimization (OTPO), which captures
token-level distinctions and improves instruction adherence in
commerce-specific scenarios. Extensive evaluations demonstrate that Compass-v3
delivers state-of-the-art e-commerce performance, surpassing DeepSeek-V3.1,
GPT-4 series, and Qwen3-235B. Moreover, Compass-v3 demonstrates strong
multilingual capability across low-resource Southeast Asian languages
(Indonesian, Thai, Filipino, Vietnamese, Malay, Taglog) and Portuguese while
sustaining competitive performance on general benchmarks. It has already been
widely applied in Shopee's industrial-scale e-commerce platform and is
gradually replacing OpenAI's traffic, now accounting for over 70\% of total LLM
usage, highlighting its dual strengths in specialized commerce expertise and
broad linguistic competence.
☆ TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla
Despite being the 5th most spoken language, Bangla remains underrepresented
in Large Language Models (LLMs), particularly for code generation. This
primarily stems from the scarcity of high-quality data to pre-train and/or
finetune such models. Hence, we introduce the first dedicated family of Code
LLMs for Bangla (1B & 9B). We offer three major contributions: (1) a
comprehensive Bangla code instruction datasets for programming domain
adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code
generation; and (3) the TigerCoder-family of Code LLMs, achieving significant
~11-18% performance gains at Pass@1 over existing multilingual and
general-purpose Bangla LLMs. Our findings show that curated, high-quality
datasets can overcome limitations of smaller models for low-resource languages.
We open-source all resources to advance further Bangla LLM research.
☆ MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction
Zhongqiu Li, Shiquan Wang, Ruiyu Fang, Mengjiao Bao, Zhenhe Wu, Shuangyong Song, Yongxiang Li, Zhongjiang He
Large language models (LLMs) demonstrate robust capabilities across diverse
research domains. However, their performance in universal information
extraction (UIE) remains insufficient, especially when tackling structured
output scenarios that involve complex schema descriptions and require
multi-step reasoning. While existing approaches enhance the performance of LLMs
through in-context learning and instruction tuning, significant limitations
nonetheless persist. To enhance the model's generalization ability, we propose
integrating reinforcement learning (RL) with multi-perspective reasoning for
information extraction (IE) tasks. Our work transitions LLMs from passive
extractors to active reasoners, enabling them to understand not only what to
extract but also how to reason. Experiments conducted on multiple IE benchmarks
demonstrate that MR-UIE consistently elevates extraction accuracy across
domains and surpasses state-of-the-art methods on several datasets.
Furthermore, incorporating multi-perspective reasoning into RL notably enhances
generalization in complex IE tasks, underscoring the critical role of reasoning
in challenging scenarios.
♻ ☆ Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?
The legal field already uses various large language models (LLMs) in actual
applications, but their quantitative performance and reasons for it are
underexplored. We evaluated several open-source and proprietary LLMs --
including GPT-series, Anthropic, Deepseek and Llama-3, variants -- on parts of
the European Qualifying Examination (EQE) for future European Patent Attorneys.
OpenAI o1 led with 0.82 accuracy and 0.81 F1 score, whereas (Amazon Web
Services) AWS Llama 3.1 8B lagged at 0.50 accuracy, and a Python-deployed Llama
3.1 8B scored 0.55. The latter two are within the range of mere guessing for
the two-answer forced-choice design. None of the evaluated models could have
passed the examination fully, as accuracy never exceeded the average threshold
of 0.90 required for professional-level standards -- also not models that are
regularly promoted for their assumed beyond-PhD- and bar-admitted-lawyer-level
performance. GPT-4o excelled at integrating text and graphics, while Claude 3
Opus often lost formatting coherence. Human patent experts evaluated the
textual justifications and uncovered various critical shortcomings of each
model. They valued clarity and legal rationale over the raw correctness of the
answers, which revealed misalignment between automatic metrics and expert
judgment. Model outputs were sensitive to modest temperature changes and prompt
wording, which underscores the remaining necessity of expert oversight. Future
work should target logical consistency, robust multimodality, and adaptive
prompting to approach human-level patent proficiency. In summary, despite the
outstanding performance of recent large models, the general public might
overestimate their performance. The field has a long way to go to develop a
virtual patent attorney. This paper wants to point out several specific
limitations that need solutions.
comment: 41 pages, 21 figures
♻ ☆ The NTNU System at the S&I Challenge 2025 SLA Open Track ISCA
A recent line of research on spoken language assessment (SLA) employs neural
models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency
across linguistic and acoustic modalities. Although both models effectively
capture features relevant to oral competence, each exhibits modality-specific
limitations. BERT-based methods rely on ASR transcripts, which often fail to
capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods
excel at modeling acoustic features but lack semantic interpretability. To
overcome these limitations, we propose a system that integrates W2V with Phi-4
multimodal large language model (MLLM) through a score fusion strategy. The
proposed system achieves a root mean square error (RMSE) of 0.375 on the
official test set of the Speak & Improve Challenge 2025, securing second place
in the competition. For comparison, the RMSEs of the top-ranked, third-ranked,
and official baseline systems are 0.364, 0.384, and 0.444, respectively.
comment: submitted to the ISCA SLaTE-2025 Workshop
♻ ☆ Improved GUI Grounding via Iterative Narrowing
Graphical User Interface (GUI) grounding plays a crucial role in enhancing
the capabilities of Vision-Language Model (VLM) agents. While general VLMs,
such as GPT-4V, demonstrate strong performance across various tasks, their
proficiency in GUI grounding remains suboptimal. Recent studies have focused on
fine-tuning these models specifically for zero-shot GUI grounding, yielding
significant improvements over baseline performance. We introduce a visual
prompting framework that employs an iterative narrowing mechanism to further
improve the performance of both general and fine-tuned models in GUI grounding.
For evaluation, we tested our method on a comprehensive benchmark comprising
various UI platforms and provided the code to reproduce our results.
comment: Code available at
https://github.com/ant-8/GUI-Grounding-via-Iterative-Narrowing
♻ ☆ Task Matters: Knowledge Requirements Shape LLM Responses to Context-Memory Conflict
Large Language Models require both contextual knowledge and parametric
memory, but these sources can disagree. Prior investigations on contextual
question answering tasks report a preference toward parametric knowledge under
conflict, yet they focus almost exclusively on tasks that should always rely on
the given passage, leaving open how this behavior manifests when tasks demand
different amounts and kinds of knowledge. We study this question with a
model-agnostic diagnostic framework that (i) automatically detects
disagreements between a model's beliefs and a curated knowledge set, and (ii)
injects controlled conflicts into tasks. The resulting datasets span two
orthogonal dimensions: task knowledge reliance and conflict plausibility.
Evaluating representative open-source LLMs, we find that: (1) performance
degradation from conflict correlates with a task's knowledge reliance; (2)
explanatory rationales and simple reiteration both increase context
reliance-helpful for context-only tasks but harmful when parametric knowledge
should dominate; (3) These behaviors raise concerns about the validity of
model-based evaluation and underscore the need to account for knowledge
conflict in the deployment of LLMs.
comment: Major revision
♻ ☆ Entropy-Gated Branching for Efficient Test-Time Reasoning
Test-time compute methods like beam search can significantly improve the
reasoning capabilities and problem-solving accuracy of large language models.
However, these approaches require substantially increased computational
resources, with most computation wasted on exploring low-diversity branches
where the model already exhibits high confidence. We observe that a small
subset of uncertain reasoning steps has a disproportionately large impact on
final prediction accuracy, and branching at these points tends to yield
higher-quality and more diverse candidate reasoning steps. Therefore, we
introduce Entropy-Gated Branching: a novel inference technique that dynamically
allocates computational resources by selectively expanding prediction sequences
only at points of high uncertainty. Our method leverages entropy as a gating
mechanism to identify when branching is most beneficial, coupled with an
external feedback model to rank and prune candidate branches. Empirical results
on mathematical and financial reasoning benchmarks show that this strategy
improves accuracy by 22.6% over standard inference while operating 37% faster
than conventional beam search with similar or higher performance. Our results
show that dynamic resource allocation during inference can substantially
improve both efficiency and effectiveness, offering a more scalable pathway to
enhanced LLM reasoning capabilities.
♻ ☆ Persistent Homology of Topic Networks for the Prediction of Reader Curiosity
Manuel D. S. Hopp, Vincent Labatut, Arthur Amalvy, Richard Dufour, Hannah Stone, Hayley Jach, Kou Murayama
Reader curiosity, the drive to seek information, is crucial for textual
engagement, yet remains relatively underexplored in NLP. Building on
Loewenstein's Information Gap Theory, we introduce a framework that models
reader curiosity by quantifying semantic information gaps within a text's
semantic structure. Our approach leverages BERTopic-inspired topic modeling and
persistent homology to analyze the evolving topology (connected components,
cycles, voids) of a dynamic semantic network derived from text segments,
treating these features as proxies for information gaps. To empirically
evaluate this pipeline, we collect reader curiosity ratings from participants
(n = 49) as they read S. Collins's ''The Hunger Games'' novel. We then use the
topological features from our pipeline as independent variables to predict
these ratings, and experimentally show that they significantly improve
curiosity prediction compared to a baseline model (73% vs. 30% explained
deviance), validating our approach. This pipeline offers a new computational
method for analyzing text structure and its relation to reader engagement.
comment: Original paper with an improved and extended appendix
♻ ☆ Uncertainty Quantification in Retrieval Augmented Question Answering
Retrieval augmented Question Answering (QA) helps QA models overcome
knowledge gaps by incorporating retrieved evidence, typically a set of
passages, alongside the question at test time. Previous studies show that this
approach improves QA performance and reduces hallucinations, without, however,
assessing whether the retrieved passages are indeed useful at answering
correctly. In this work, we propose to quantify the uncertainty of a QA model
via estimating the utility of the passages it is provided with. We train a
lightweight neural model to predict passage utility for a target QA model and
show that while simple information theoretic metrics can predict answer
correctness up to a certain extent, our approach efficiently approximates or
outperforms more expensive sampling-based methods. Code and data are available
at https://github.com/lauhaide/ragu.
comment: TMLR (09/2025)
♻ ☆ Thinking with Many Minds: Using Large Language Models for Multi-Perspective Problem-Solving
Complex problem-solving requires cognitive flexibility--the capacity to
entertain multiple perspectives while preserving their distinctiveness. This
flexibility replicates the "wisdom of crowds" within a single individual,
allowing them to "think with many minds." While mental simulation enables
imagined deliberation, cognitive constraints limit its effectiveness. We
propose synthetic deliberation, a Large Language Model (LLM)-based method that
simulates discourse between agents embodying diverse perspectives, as a
solution. Using a custom GPT-based model, we showcase its benefits: concurrent
processing of multiple viewpoints without cognitive degradation, parallel
exploration of perspectives, and precise control over viewpoint synthesis. By
externalizing the deliberative process and distributing cognitive labor between
parallel search and integration, synthetic deliberation transcends mental
simulation's limitations. This approach shows promise for strategic planning,
policymaking, and conflict resolution.
comment: 36 pages, 1 appendix
♻ ☆ LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning
Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit
substantially from chain-of-thought (CoT) reasoning, yet pushing their
performance typically requires vast data, large model sizes, and full-parameter
fine-tuning. While parameter-efficient fine-tuning (PEFT) helps reduce cost,
most existing approaches primarily address domain adaptation or layer-wise
allocation rather than explicitly tailoring data and parameters to different
response demands. Inspired by "Thinking, Fast and Slow," which characterizes
two distinct modes of thought-System 1 (fast, intuitive, often automatic) and
System 2 (slower, more deliberative and analytic)-we draw an analogy that
different "subregions" of an LLM's parameters might similarly specialize for
tasks that demand quick, intuitive responses versus those requiring multi-step
logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework
that partitions both data and parameters by System 1 or System 2 demands, using
fewer yet more focused parameters for each task. Specifically, we classify task
data via multi-model role-playing and voting, and partition parameters based on
importance scoring, then adopt a two-stage fine-tuning strategy of training
System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and
intuition and refine System 2 tasks with reinforcement learning (RL) to
reinforce deeper logical deliberation next. Extensive experiments show that the
two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while
matching or surpassing SOTA PEFT baselines.
comment: 12 pages
♻ ☆ An Ontology-Driven Graph RAG for Legal Norms: A Structural, Temporal, and Deterministic Approach
Retrieval-Augmented Generation (RAG) systems in the legal domain face a
critical challenge: standard, flat-text retrieval is blind to the hierarchical,
diachronic, and causal structure of law, leading to anachronistic and
unreliable answers. This paper introduces the Structure-Aware Temporal Graph
RAG (SAT-Graph RAG), an ontology-driven framework designed to overcome these
limitations by explicitly modeling the formal structure and diachronic nature
of legal norms. We ground our knowledge graph in a formal, LRMoo-inspired model
that distinguishes abstract legal Works from their versioned Expressions. We
model temporal states as efficient aggregations that reuse the versioned
expressions (CTVs) of unchanged components, and we reify legislative events as
first-class Action nodes to make causality explicit and queryable. This
structured backbone enables a unified, planner-guided query strategy that
applies explicit policies to deterministically resolve complex requests for (i)
point-in-time retrieval, (ii) hierarchical impact analysis, and (iii) auditable
provenance reconstruction. Through a case study on the Brazilian Constitution,
we demonstrate how this approach provides a verifiable, temporally-correct
substrate for LLMs, enabling higher-order analytical capabilities while
drastically reducing the risk of factual errors. The result is a practical
framework for building more trustworthy and explainable legal AI systems.
comment: Major revision for clarity and academic precision. Updated title and
abstract. Refined core terminology, contributions, related work, and shifted
the implementation to a conceptual architecture. Added new arguments to
strengthen the paper's thesis
♻ ☆ MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder and LLM Fusion
Kosei Uemura, David Guzmán, Quang Phuoc Nguyen, Jesujoba Oluwadara Alabi, En-shiun Annie Lee, David Ifeoluwa Adelani
Large language models excel in English but still struggle with complex
reasoning in many low-resource languages (LRLs). Existing encoder-plus-decoder
methods such as LangBridge and MindMerger raise accuracy on mid and
high-resource languages, yet they leave a large gap on LRLs. We present MERLIN,
a two-stage model-stacking framework that applies a curriculum learning
strategy -- from general bilingual bitext to task-specific data -- and adapts
only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves
exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini.
It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp),
demonstrating effectiveness across both low and high-resource settings.
comment: under submission
♻ ☆ Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B
In-Context Learning (ICL) is an intriguing ability of large language models
(LLMs). Despite a substantial amount of work on its behavioral aspects and how
it emerges in miniature setups, it remains unclear which mechanism assembles
task information from the individual examples in a fewshot prompt. We use
causal interventions to identify information flow in Gemma-2 2B for five
naturalistic ICL tasks. We find that the model infers task information using a
two-step strategy we call contextualize-then-aggregate: In the lower layers,
the model builds up representations of individual fewshot examples, which are
contextualized by preceding examples through connections between fewshot input
and output tokens across the sequence. In the higher layers, these
representations are aggregated to identify the task and prepare prediction of
the next output. The importance of the contextualization step differs between
tasks, and it may become more important in the presence of ambiguous examples.
Overall, by providing rigorous causal analysis, our results shed light on the
mechanisms through which ICL happens in language models.
♻ ☆ CritiQ: Mining Data Quality Criteria from Human Preferences ACL 2025
Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui
Language model heavily depends on high-quality data for optimal performance.
Existing approaches rely on manually designed heuristics, the perplexity of
existing models, training classifiers, or careful prompt engineering, which
require significant expert experience and human annotation effort while
introduce biases. We introduce CritiQ, a novel data selection method that
automatically mines criteria from human preferences for data quality with only
~30 human-annotated pairs and performs efficient data selection. The main
component, CritiQ Flow, employs a manager agent to evolve quality criteria and
worker agents to make pairwise judgments. We build a knowledge base that
extracts quality criteria from previous work to boost CritiQ Flow. Compared to
perplexity- and classifier- based methods, verbal criteria are more
interpretable and possess reusable value. After deriving the criteria, we train
the CritiQ Scorer to give quality scores and perform efficient data selection.
We demonstrate the effectiveness of our method in the code, math, and logic
domains, achieving high accuracy on human-annotated test sets. To validate the
quality of the selected data, we continually train Llama 3.1 models and observe
improved performance on downstream tasks compared to uniform sampling. Ablation
studies validate the benefits of the knowledge base and the reflection process.
We analyze how criteria evolve and the effectiveness of majority voting.
comment: to be published in ACL 2025, Code is available at
https://github.com/KYLN24/CritiQ
♻ ☆ FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training
Full-duplex dialog models aim to listen and speak simultaneously, delivering
rapid responses to dynamic user input. Among different solutions to full
duplexity, a native solution merges multiple channels in each time step,
achieving the lowest latency. However, prevailing designs break down the
textual monologue sentences for word-level alignment with audio streams, which
degrades language modeling abilities. To help address this issue, we introduce
natural monologues, which are composed by continuous sentences and waiting
intervals, mimicking humanoid cognitive behavior in dialogs. We find a proper
training paradigm to be critical for semantically aligning natural monologues
with audio. To this end, we develop a dual training paradigm that alternates
the position of the monologues, either leading or trailing the audio, across
different training stages. A combination of our natural monologue and dual
training strategy is applied in developing FLM-Audio, our 7B spoken dialog
chatbot with native full-duplexity. As confirmed by experimental results,
FLM-Audio achieves superior response qualities and chatting experiences while
requiring significantly less training data.
♻ ☆ Improving Alignment in LVLMs with Debiased Self-Judgment EMNLP 2025
The rapid advancements in Large Language Models (LLMs) and Large
Visual-Language Models (LVLMs) have opened up new opportunities for integrating
visual and linguistic modalities. However, effectively aligning these
modalities remains challenging, often leading to hallucinations--where
generated outputs are not grounded in the visual input--and raising safety
concerns across various domains. Existing alignment methods, such as
instruction tuning and preference tuning, often rely on external datasets,
human annotations, or complex post-processing, which limit scalability and
increase costs. To address these challenges, we propose a novel approach that
generates the debiased self-judgment score, a self-evaluation metric created
internally by the model without relying on external resources. This enables the
model to autonomously improve alignment. Our method enhances both decoding
strategies and preference tuning processes, resulting in reduced
hallucinations, enhanced safety, and improved overall capability. Empirical
results show that our approach significantly outperforms traditional methods,
offering a more effective solution for aligning LVLMs.
comment: EMNLP 2025 Findings
♻ ☆ Generative Data Refinement: Just Ask for Better Data
For a fixed parameter size, the capabilities of large models are primarily
determined by the quality and quantity of its training data. Consequently,
training datasets now grow faster than the rate at which new data is indexed on
the web, leading to projected data exhaustion over the next decade. Much more
data exists as user-generated content that is not publicly indexed, but
incorporating such data comes with considerable risks, such as leaking private
information and other undesirable content. We introduce a framework, Generative
Data Refinement (GDR), for using pretrained generative models to transform a
dataset with undesirable content into a refined dataset that is more suitable
for training. Our experiments show that GDR can outperform industry-grade
solutions for dataset anonymization, as well as enable direct detoxification of
highly unsafe datasets. Moreover, we show that by generating synthetic data
that is conditioned on each example in the real dataset, GDR's refined outputs
naturally match the diversity of web scale datasets, and thereby avoid the
often challenging task of generating diverse synthetic data via model
prompting. The simplicity and effectiveness of GDR make it a powerful tool for
scaling up the total stock of training data for frontier models.
♻ ☆ Culturally-Nuanced Story Generation for Reasoning in Low-Resource Languages: The Case of Javanese and Sundanese
Culturally grounded commonsense reasoning is underexplored in low-resource
languages due to scarce data and costly native annotation. We test whether
large language models (LLMs) can generate culturally nuanced narratives for
such settings. Focusing on Javanese and Sundanese, we compare three data
creation strategies: (1) LLM-assisted stories prompted with cultural cues, (2)
machine translation from Indonesian benchmarks, and (3) native-written stories.
Human evaluation finds LLM stories match natives on cultural fidelity but lag
in coherence and correctness. We fine-tune models on each dataset and evaluate
on a human-authored test set for classification and generation. LLM-generated
data yields higher downstream performance than machine-translated and
Indonesian human-authored training data. We release a high-quality benchmark of
culturally grounded commonsense stories in Javanese and Sundanese to support
future work.
♻ ☆ RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution
Reinforcement learning from human feedback (RLHF) offers a promising approach
to aligning large language models (LLMs) with human preferences. Typically, a
reward model is trained or supplied to act as a proxy for humans in evaluating
generated responses during the reinforcement training phase. However, current
reward models operate as sequence-to-one models, allocating a single, sparse,
and delayed reward to an entire output sequence. This approach may overlook the
significant contributions of individual tokens toward the desired outcome. To
this end, we propose a more fine-grained, token-level guidance approach for RL
training. Specifically, we introduce RED, a novel reward redistribition method
that evaluates and assigns specific credit to each token using an off-the-shelf
reward model. Utilizing these fine-grained rewards enhances the model's
understanding of language nuances, leading to more precise performance
improvements. Notably, our method does not require modifying the reward model
or introducing additional training steps, thereby incurring minimal
computational costs. Experimental results across diverse datasets and tasks
demonstrate the superiority of our approach.
♻ ☆ PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions
Recent advancements in Large Language Models (LLMs) demonstrate remarkable
capabilities across various fields. These developments have led to more direct
communication between humans and LLMs in various situations, such as social
companionship and psychological support. However, LLMs often exhibit
limitations in emotional perception and social competence during real-world
conversations. These limitations partly originate from their inability to adapt
their communication style and emotional expression to different social and task
contexts. In this work, we introduce PersonaFuse, a novel LLM post-training
framework that enables LLMs to adapt and express different personalities for
varying situations. Inspired by Trait Activation Theory and the Big Five
personality model, PersonaFuse employs a Mixture-of-Expert architecture that
combines persona adapters with a dynamic routing network, enabling contextual
trait expression. Experimental results show that PersonaFuse substantially
outperforms baseline models across multiple dimensions of social-emotional
intelligence. Importantly, these gains are achieved without sacrificing general
reasoning ability or model safety, which remain common limitations of direct
prompting and supervised fine-tuning approaches. PersonaFuse also delivers
consistent improvements in downstream human-centered applications, such as
mental health counseling and review-based customer service. Finally, human
preference evaluations against leading LLMs, including GPT-4o and DeepSeek,
demonstrate that PersonaFuse achieves competitive response quality despite its
comparatively smaller model size. These findings demonstrate that PersonaFuse
offers a theoretically grounded and practical approach for developing
social-emotional enhanced LLMs, marking a significant advancement toward more
human-centric AI systems.
♻ ☆ MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
Large language models (LLMs) possess broad world knowledge and strong
general-purpose reasoning ability, yet they struggle to learn from many
in-context examples on standard machine learning (ML) tasks, that is, to
leverage many-shot demonstrations purely via in-context learning (ICL) without
gradient descent. We introduce MachineLearningLM, a portable
continued-pretraining framework that equips a general-purpose LLM with robust
in-context ML capability while preserving its general knowledge and reasoning
for broader chat workflows.
Our pretraining procedure synthesizes ML tasks from millions of structural
causal models (SCMs), spanning shot counts up to 1,024. We begin with a
random-forest teacher, distilling tree-based decision strategies into the LLM
to strengthen robustness in numerical modeling. All tasks are serialized with a
token-efficient prompt, enabling 3x to 6x more examples per context window and
delivering up to 50x amortized throughput via batch inference.
Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8),
MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an
average of about 15% on out-of-distribution tabular classification across
finance, physics, biology, and healthcare domains. It exhibits a striking
many-shot scaling law: accuracy increases monotonically as in-context
demonstrations grow from 8 to 1,024. Without any task-specific training, it
attains random-forest-level accuracy across hundreds of shots. General chat
capabilities, including knowledge and reasoning, are preserved: it achieves
75.4% on MMLU.
♻ ☆ Scalable Evaluation of Online Facilitation Strategies via Synthetic Simulation of Discussions
Limited large-scale evaluations exist for facilitation strategies of online
discussions due to significant costs associated with human involvement. An
effective solution is synthetic discussion simulations using Large Language
Models (LLMs) to create initial pilot experiments. We propose design principles
based on existing methodologies for synthetic discussion generation. Based on
these principles, we propose a simple, generalizable, LLM-driven methodology to
prototype the development of LLM facilitators by generating synthetic data
without human involvement, and which surpasses current baselines. We use our
methodology to test whether current Social Science strategies for facilitation
can improve the performance of LLM facilitators. We find that, while LLM
facilitators significantly improve synthetic discussions, there is no evidence
that the application of these strategies leads to further improvements in
discussion quality. In an effort to aid research in the field of facilitation,
we release a large, publicly available dataset containing LLM-generated and
LLM-annotated discussions using multiple open-source models. This dataset can
be used for LLM facilitator finetuning as well as behavioral analysis of
current out-of-the-box LLMs in the task. We also release an open-source python
framework that efficiently implements our methodology at great scale.
comment: 15 pages, 3 tables, 12 figures
♻ ☆ A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions ISCA
Automated speaking assessment (ASA) on opinion expressions is often hampered
by the scarcity of labeled recordings, which restricts prompt diversity and
undermines scoring reliability. To address this challenge, we propose a novel
training paradigm that leverages a large language models (LLM) to generate
diverse responses of a given proficiency level, converts responses into
synthesized speech via speaker-aware text-to-speech synthesis, and employs a
dynamic importance loss to adaptively reweight training instances based on
feature distribution differences between synthesized and real speech.
Subsequently, a multimodal large language model integrates aligned textual
features with speech signals to predict proficiency scores directly.
Experiments conducted on the LTTC dataset show that our approach outperforms
methods relying on real data or conventional augmentation, effectively
mitigating low-resource constraints and enabling ASA on opinion expressions
with cross-modal information.
comment: submitted to the ISCA SLaTE-2025 Workshop
♻ ☆ T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables
Jie Zhang, Changzai Pan, Kaiwen Wei, Sishi Xiong, Yu Zhao, Xiangyu Li, Jiaxin Peng, Xiaoyan Gu, Jian Yang, Wenhan Chang, Zhenhe Wu, Jiang Zhong, Shuangyong Song, Yongxiang Li, Xuelong Li
Extensive research has been conducted to explore the capabilities of large
language models (LLMs) in table reasoning. However, the essential task of
transforming tables information into reports remains a significant challenge
for industrial applications. This task is plagued by two critical issues: 1)
the complexity and diversity of tables lead to suboptimal reasoning outcomes;
and 2) existing table benchmarks lack the capacity to adequately assess the
practical application of this task. To fill this gap, we propose the
table-to-report task and construct a bilingual benchmark named T2R-bench, where
the key information flow from the tables to the reports for this task. The
benchmark comprises 457 industrial tables, all derived from real-world
scenarios and encompassing 19 industry domains as well as 4 types of industrial
tables. Furthermore, we propose an evaluation criteria to fairly measure the
quality of report generation. The experiments on 25 widely-used LLMs reveal
that even state-of-the-art models like Deepseek-R1 only achieves performance
with 62.71 overall score, indicating that LLMs still have room for improvement
on T2R-bench.
♻ ☆ Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval
Reducing the key-value (KV) cache burden in Large Language Models (LLMs)
significantly accelerates inference. Dynamically selecting critical KV caches
during decoding helps maintain performance. Existing methods use random linear
hashing to identify important tokens, but this approach is inefficient due to
the orthogonal distribution of queries and keys within two narrow cones in
LLMs. We introduce Spotlight Attention, a novel method that employs non-linear
hashing functions to optimize the embedding distribution of queries and keys,
enhancing coding efficiency and robustness. We also developed a lightweight,
stable training framework using a Bradley-Terry ranking-based loss, enabling
optimization of the non-linear hashing module on GPUs with 16GB memory in 8
hours. Experimental results show that Spotlight Attention drastically improves
retrieval precision while shortening the length of the hash code at least
5$\times$ compared to traditional linear hashing. Finally, we exploit the
computational advantages of bitwise operations by implementing specialized CUDA
kernels, achieving hashing retrieval for 512K tokens in under 100$\mu$s on a
single A100 GPU, with end-to-end throughput up to 3$\times$ higher than vanilla
decoding.
♻ ☆ MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue EMNLP 2025
Yujia Chen, Changsong Li, Yiming Wang, Tianjie Ju, Qingqing Xiao, Nan Zhang, Zifan Kong, Peng Wang, Binyu Yan
Mental health issues are worsening in today's competitive society, such as
depression and anxiety. Traditional healings like counseling and chatbots fail
to engage effectively, they often provide generic responses lacking emotional
depth. Although large language models (LLMs) have the potential to create more
human-like interactions, they still struggle to capture subtle emotions. This
requires LLMs to be equipped with human-like adaptability and warmth. To fill
this gap, we propose the MIND (Multi-agent INner Dialogue), a novel paradigm
that provides more immersive psychological healing environments. Considering
the strong generative and role-playing ability of LLM agents, we predefine an
interactive healing framework and assign LLM agents different roles within the
framework to engage in interactive inner dialogues with users, thereby
providing an immersive healing experience. We conduct extensive human
experiments in various real-world healing dimensions, and find that MIND
provides a more user-friendly experience than traditional paradigms. This
demonstrates that MIND effectively leverages the significant potential of LLMs
in psychological healing.
comment: Accepted by EMNLP 2025 Findings
♻ ☆ MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond
Muhammad Huzaifah, Geyu Lin, Tianchi Liu, Hardik B. Sailor, Kye Min Tan, Tarun K. Vangani, Qiongqiong Wang, Jeremy H. M. Wong, Jinyang Wu, Nancy F. Chen, Ai Ti Aw
This technical report describes the MERaLiON-SpeechEncoder, a foundation
model designed to support a wide range of downstream speech applications.
Developed as part of Singapore's National Multimodal Large Language Model
Programme, the MERaLiON-SpeechEncoder is tailored to address the speech
processing needs in Singapore and the surrounding Southeast Asian region. The
model currently supports mainly English, including the variety spoken in
Singapore. We are actively expanding our datasets to gradually cover other
languages in subsequent releases. The MERaLiON-SpeechEncoder was pre-trained
from scratch on 200,000 hours of unlabelled speech data using a self-supervised
learning approach based on masked language modelling. We describe our training
procedure and hyperparameter tuning experiments in detail below. Our evaluation
demonstrates improvements to spontaneous and Singapore speech benchmarks for
speech recognition, while remaining competitive to other state-of-the-art
speech encoders across ten other speech tasks. We commit to releasing our
model, supporting broader research endeavours, both in Singapore and beyond.
♻ ☆ Enhancing Few-Shot Transfer Learning with Optimized Multi-Task Prompt Tuning through Modular Prompt Composition
In recent years, multi-task prompt tuning has garnered considerable attention
for its inherent modularity and potential to enhance parameter-efficient
transfer learning across diverse tasks. This paper aims to analyze and improve
the performance of multiple tasks by facilitating the transfer of knowledge
between their corresponding prompts in a multi-task setting. Our proposed
approach decomposes the prompt for each target task into a combination of
shared prompts (source prompts) and a task-specific prompt (private prompt).
During training, the source prompts undergo fine-tuning and are integrated with
the private prompt to drive the target prompt for each task. We present and
compare multiple methods for combining source prompts to construct the target
prompt, analyzing the roles of both source and private prompts within each
method. We investigate their contributions to task performance and offer
flexible, adjustable configurations based on these insights to optimize
performance. Our empirical findings clearly showcase improvements in accuracy
and robustness compared to the conventional practice of prompt tuning and
related works. Notably, our results substantially outperform other methods in
the field in few-shot settings, demonstrating superior performance in various
tasks across GLUE benchmark, among other tasks. This achievement is attained
with a significantly reduced amount of training data, making our method a
promising one for few-shot settings.
♻ ☆ SimMark: A Robust Sentence-Level Similarity-Based Watermarking Algorithm for Large Language Models EMNLP 25
The widespread adoption of large language models (LLMs) necessitates reliable
methods to detect LLM-generated text. We introduce SimMark, a robust
sentence-level watermarking algorithm that makes LLMs' outputs traceable
without requiring access to model internals, making it compatible with both
open and API-based LLMs. By leveraging the similarity of semantic sentence
embeddings combined with rejection sampling to embed detectable statistical
patterns imperceptible to humans, and employing a soft counting mechanism,
SimMark achieves robustness against paraphrasing attacks. Experimental results
demonstrate that SimMark sets a new benchmark for robust watermarking of
LLM-generated content, surpassing prior sentence-level watermarking techniques
in robustness, sampling efficiency, and applicability across diverse domains,
all while maintaining the text quality and fluency.
comment: Accepted to EMNLP 25 main
♻ ☆ VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification
Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, Insik Shin
Large Foundation Models (LFMs) have unlocked new possibilities in
human-computer interaction, particularly with the rise of mobile Graphical User
Interface (GUI) Agents capable of interacting with mobile GUIs. These agents
allow users to automate complex mobile tasks through simple natural language
instructions. However, the inherent probabilistic nature of LFMs, coupled with
the ambiguity and context-dependence of mobile tasks, makes LFM-based
automation unreliable and prone to errors. To address this critical challenge,
we introduce VeriSafe Agent (VSA): a formal verification system that serves as
a logically grounded safeguard for Mobile GUI Agents. VSA deterministically
ensures that an agent's actions strictly align with user intent before
executing the action. At its core, VSA introduces a novel autoformalization
technique that translates natural language user instructions into a formally
verifiable specification. This enables runtime, rule-based verification of
agent's actions, detecting erroneous actions even before they take effect. To
the best of our knowledge, VSA is the first attempt to bring the rigor of
formal verification to GUI agents, bridging the gap between LFM-driven actions
and formal software verification. We implement VSA using off-the-shelf LFM
services (GPT-4o) and evaluate its performance on 300 user instructions across
18 widely used mobile apps. The results demonstrate that VSA achieves
94.33%-98.33% accuracy in verifying agent actions, outperforming existing
LFM-based verification methods by 30.00%-16.33%, and increases the GUI agent's
task completion rate by 90%-130%.
♻ ☆ Merge-of-Thought Distillation
Zhanming Shen, Zeyu Qin, Zenan Huang, Hao Chen, Jiaqi Hu, Yihong Zhuang, Guoshan Lu, Gang Chen, Junbo Zhao
Efficient reasoning distillation for long chain-of-thought (CoT) models is
increasingly constrained by the assumption of a single oracle teacher, despite
practical availability of multiple candidate teachers and growing CoT corpora.
We revisit teacher selection and observe that different students have different
"best teachers," and even for the same student the best teacher can vary across
datasets. Therefore, to unify multiple teachers' reasoning abilities into
student with overcoming conflicts among various teachers' supervision, we
propose Merge-of-Thought Distillation (MoT), a lightweight framework that
alternates between teacher-specific supervised fine-tuning branches and
weight-space merging of the resulting student variants. On competition math
benchmarks, using only about 200 high-quality CoT samples, applying MoT to a
Qwen3-14B student surpasses strong models including DEEPSEEK-R1, QWEN3-30B-A3B,
QWEN3-32B, and OPENAI-O1, demonstrating substantial gains. Besides, MoT
consistently outperforms the best single-teacher distillation and the naive
multi-teacher union, raises the performance ceiling while mitigating
overfitting, and shows robustness to distribution-shifted and peer-level
teachers. Moreover, MoT reduces catastrophic forgetting, improves general
reasoning beyond mathematics and even cultivates a better teacher, indicating
that consensus-filtered reasoning features transfer broadly. These results
position MoT as a simple, scalable route to efficiently distilling long CoT
capabilities from diverse teachers into compact students.
♻ ☆ OTESGN: Optimal Transport-Enhanced Syntactic-Semantic Graph Networks for Aspect-Based Sentiment Analysis
Aspect-based sentiment analysis (ABSA) aims to identify aspect terms and
determine their sentiment polarity. While dependency trees combined with
contextual semantics provide structural cues, existing approaches often rely on
dot-product similarity and fixed graphs, which limit their ability to capture
nonlinear associations and adapt to noisy contexts. To address these
limitations, we propose the Optimal Transport-Enhanced Syntactic-Semantic Graph
Network (OTESGN), a model that jointly integrates structural and distributional
signals. Specifically, a Syntactic Graph-Aware Attention module models global
dependencies with syntax-guided masking, while a Semantic Optimal Transport
Attention module formulates aspect-opinion association as a distribution
matching problem solved via the Sinkhorn algorithm. An Adaptive Attention
Fusion mechanism balances heterogeneous features, and contrastive
regularization enhances robustness. Extensive experiments on three benchmark
datasets (Rest14, Laptop14, and Twitter) demonstrate that OTESGN delivers
state-of-the-art performance. Notably, it surpasses competitive baselines by up
to +1.30 Macro-F1 on Laptop14 and +1.01 on Twitter. Ablation studies and
visualization analyses further highlight OTESGN's ability to capture
fine-grained sentiment associations and suppress noise from irrelevant context.
♻ ☆ Optimizing Length Compression in Large Reasoning Models
Large Reasoning Models (LRMs) have achieved remarkable success, yet they
often suffer from producing unnecessary and verbose reasoning chains. We
identify a core aspect of this issue as "invalid thinking" -- models tend to
repeatedly double-check their work after having derived the correct answer. To
address this specific inefficiency, we move beyond the general principles of
Efficacy and Efficiency to propose two new, fine-grained principles: Brevity,
which advocates for eliminating redundancy, and Sufficiency, which ensures
critical reasoning steps are preserved. Guided by these principles, we
introduce LC-R1, a post-training method based on Group Relative Policy
Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for
overall conciseness and a Compress Reward that is specifically designed to
remove the invalid portion of the thinking process. Extensive experiments on
multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant
reduction in sequence length (~50%) with only a marginal (~2%) drop in
accuracy, achieving a favorable trade-off point on the Pareto frontier that
prioritizes high compression. Our analysis further validates the robustness of
LC-R1 and provides valuable insights for developing more powerful yet
computationally efficient LRMs. Our code is released at
https://github.com/zxiangx/LC-R1.
comment: 16 pages, 7 figures, 4 tables
♻ ☆ SWI: Speaking with Intent in Large Language Models
Intent, typically clearly formulated and planned, functions as a cognitive
framework for communication and problem-solving. This paper introduces the
concept of Speaking with Intent (SWI) in large language models (LLMs), where
the explicitly generated intent encapsulates the model's underlying intention
and provides high-level planning to guide subsequent analysis and action. By
emulating deliberate and purposeful thoughts in the human mind, SWI is
hypothesized to enhance the reasoning capabilities and generation quality of
LLMs. Extensive experiments on text summarization, multi-task question
answering, and mathematical reasoning benchmarks consistently demonstrate the
effectiveness and generalizability of Speaking with Intent over direct
generation without explicit intent. Further analysis corroborates the
generalizability of SWI under different experimental settings. Moreover, human
evaluations verify the coherence, effectiveness, and interpretability of the
intent produced by SWI. The promising results in enhancing LLMs with explicit
intents pave a new avenue for boosting LLMs' generation and reasoning abilities
with cognitive notions.
comment: Code: https://github.com/YuweiYin/SWI