Computation and Language
☆ SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
Repurposing large vision-language models (LVLMs) as computer use agents
(CUAs) has led to substantial breakthroughs, primarily driven by human-labeled
data. However, these models often struggle with novel and specialized software,
particularly in scenarios lacking human annotations. To address this challenge,
we propose SEAgent, an agentic self-evolving framework enabling CUAs to
autonomously evolve through interactions with unfamiliar software.
Specifically, SEAgent empowers computer-use agents to autonomously master novel
software environments via experiential learning, where agents explore new
software, learn through iterative trial-and-error, and progressively tackle
auto-generated tasks organized from simple to complex. To achieve this goal, we
design a World State Model for step-wise trajectory assessment, along with a
Curriculum Generator that generates increasingly diverse and challenging tasks.
The agent's policy is updated through experiential learning, comprised of
adversarial imitation of failure actions and Group Relative Policy Optimization
(GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist
training strategy that integrates individual experiential insights from
specialist agents, facilitating the development of a stronger generalist CUA
capable of continuous autonomous evolution. This unified agent ultimately
achieves performance surpassing ensembles of individual specialist agents on
their specialized software. We validate the effectiveness of SEAgent across
five novel software environments within OS-World. Our approach achieves a
significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a
competitive open-source CUA, i.e., UI-TARS.
comment: Code at https://github.com/SunzeY/SEAgent
☆ Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis
Anushka Yadav, Isha Nalawade, Srujana Pillarichety, Yashwanth Babu, Reshmi Ghosh, Samyadeep Basu, Wenlong Zhao, Ali Nasaeh, Sriram Balasubramanian, Soundararajan Srinivasan
The emergence of reasoning models and their integration into practical AI
chat bots has led to breakthroughs in solving advanced math, deep search, and
extractive question answering problems that requires a complex and multi-step
thought process. Yet, a complete understanding of why these models hallucinate
more than general purpose language models is missing. In this investigative
study, we systematicallyexplore reasoning failures of contemporary language
models on multi-hop question answering tasks. We introduce a novel, nuanced
error categorization framework that examines failures across three critical
dimensions: the diversity and uniqueness of source documents involved ("hops"),
completeness in capturing relevant information ("coverage"), and cognitive
inefficiency ("overthinking"). Through rigorous hu-man annotation, supported by
complementary automated metrics, our exploration uncovers intricate error
patterns often hidden by accuracy-centric evaluations. This investigative
approach provides deeper insights into the cognitive limitations of current
models and offers actionable guidance toward enhancing reasoning fidelity,
transparency, and robustness in future language modeling efforts.
☆ FaST: Feature-aware Sampling and Tuning for Personalized Preference Alignment with Limited Data
LLM-powered conversational assistants are often deployed in a
one-size-fits-all manner, which fails to accommodate individual user
preferences. Recently, LLM personalization -- tailoring models to align with
specific user preferences -- has gained increasing attention as a way to bridge
this gap. In this work, we specifically focus on a practical yet challenging
setting where only a small set of preference annotations can be collected per
user -- a problem we define as Personalized Preference Alignment with Limited
Data (PPALLI). To support research in this area, we introduce two datasets --
DnD and ELIP -- and benchmark a variety of alignment techniques on them. We
further propose FaST, a highly parameter-efficient approach that leverages
high-level features automatically discovered from the data, achieving the best
overall performance.
☆ Query Attribute Modeling: Improving search relevance with Semantic Search and Meta Data Filtering
Karthik Menon, Batool Arhamna Haider, Muhammad Arham, Kanwal Mehreen, Ram Mohan Rao Kadiyala, Hamza Farooq
This study introduces Query Attribute Modeling (QAM), a hybrid framework that
enhances search precision and relevance by decomposing open text queries into
structured metadata tags and semantic elements. QAM addresses traditional
search limitations by automatically extracting metadata filters from free-form
text queries, reducing noise and enabling focused retrieval of relevant items.
Experimental evaluation using the Amazon Toys Reviews dataset (10,000 unique
items with 40,000+ reviews and detailed product attributes) demonstrated QAM's
superior performance, achieving a mean average precision at 5 (mAP@5) of
52.99\%. This represents significant improvement over conventional methods,
including BM25 keyword search, encoder-based semantic similarity search,
cross-encoder re-ranking, and hybrid search combining BM25 and semantic results
via Reciprocal Rank Fusion (RRF). The results establish QAM as a robust
solution for Enterprise Search applications, particularly in e-commerce
systems.
☆ GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay
The continual learning capability of large language models (LLMs) is crucial
for advancing artificial general intelligence. However, continual fine-tuning
LLMs across various domains often suffers from catastrophic forgetting,
characterized by: 1) significant forgetting of their general capabilities, and
2) sharp performance declines in previously learned tasks. To simultaneously
address both issues in a simple yet stable manner, we propose General Sample
Replay (GeRe), a framework that use usual pretraining texts for efficient
anti-forgetting. Beyond revisiting the most prevalent replay-based practices
under GeRe, we further leverage neural states to introduce a enhanced
activation states constrained optimization method using threshold-based margin
(TM) loss, which maintains activation state consistency during replay learning.
We are the first to validate that a small, fixed set of pre-collected general
replay samples is sufficient to resolve both concerns--retaining general
capabilities while promoting overall performance across sequential tasks.
Indeed, the former can inherently facilitate the latter. Through controlled
experiments, we systematically compare TM with different replay strategies
under the GeRe framework, including vanilla label fitting, logit imitation via
KL divergence and feature imitation via L1/L2 losses. Results demonstrate that
TM consistently improves performance and exhibits better robustness. Our work
paves the way for efficient replay of LLMs for the future. Our code and data
are available at https://github.com/Qznan/GeRe.
☆ Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management
Large Language Models (LLMs) suffer from significant performance degradation
when processing long contexts due to proactive interference, where irrelevant
information in earlier parts of the context disrupts reasoning and memory
recall. While most research focuses on external memory systems to augment LLMs'
capabilities, we propose a complementary approach: empowering LLMs with Active
Context Management (ACM) tools to actively sculpt their internal working
memory. We introduce Sculptor, a framework that equips LLMs with three
categories of tools: (1) context fragmentation, (2) summary, hide, and restore,
and (3) intelligent search. Our approach enables LLMs to proactively manage
their attention and working memory, analogous to how humans selectively focus
on relevant information while filtering out distractions. Experimental
evaluation on information-sparse benchmarks-PI-LLM (proactive interference) and
NeedleBench Multi-Needle Reasoning-demonstrates that Sculptor significantly
improves performance even without specific training, leveraging LLMs' inherent
tool calling generalization capabilities. By enabling Active Context
Management, Sculptor not only mitigates proactive interference but also
provides a cognitive foundation for more reliable reasoning across diverse
long-context tasks-highlighting that explicit context-control strategies,
rather than merely larger token windows, are key to robustness at scale.
comment: Preprint. Work in progress
☆ Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
Noah Ziems, Dilara Soylu, Lakshya A Agrawal, Isaac Miller, Liheng Lai, Chen Qian, Kaiqiang Song, Meng Jiang, Dan Klein, Matei Zaharia, Karel D'Oosterlinck, Christopher Potts, Omar Khattab
Group Relative Policy Optimization (GRPO) has proven to be an effective tool
for post-training language models (LMs). However, AI systems are increasingly
expressed as modular programs that mix together multiple LM calls with distinct
prompt templates and other tools, and it is not clear how best to leverage GRPO
to improve these systems. We begin to address this challenge by defining
mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by
module across rollouts and handles variable-length and interrupted
trajectories. We find that mmGRPO, composed with automatic prompt optimization,
improves accuracy by 11% on average across classification, many-hop search, and
privacy-preserving delegation tasks against the post-trained LM, and by 5%
against prompt optimization on its own. We open-source mmGRPO in DSPy as the
dspy.GRPO optimizer.
☆ Can NLP Tackle Hate Speech in the Real World? Stakeholder-Informed Feedback and Survey on Counterspeech
Tanvi Dinkar, Aiqi Jiang, Simona Frenda, Poppy Gerrard-Abbott, Nancie Gunson, Gavin Abercrombie, Ioannis Konstas
Counterspeech, i.e. the practice of responding to online hate speech, has
gained traction in NLP as a promising intervention. While early work emphasised
collaboration with non-governmental organisation stakeholders, recent research
trends have shifted toward automated pipelines that reuse a small set of legacy
datasets, often without input from affected communities. This paper presents a
systematic review of 74 NLP studies on counterspeech, analysing the extent to
which stakeholder participation influences dataset creation, model development,
and evaluation. To complement this analysis, we conducted a participatory case
study with five NGOs specialising in online Gender-Based Violence (oGBV),
identifying stakeholder-informed practices for counterspeech generation. Our
findings reveal a growing disconnect between current NLP research and the needs
of communities most impacted by toxic online content. We conclude with concrete
recommendations for re-centring stakeholder expertise in counterspeech
research.
☆ IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards
Xu Guo, Tianyi Liang, Tong Jian, Xiaogui Yang, Ling-I Wu, Chenhui Li, Zhihui Lu, Qipeng Guo, Kai Chen
Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction
following capabilities of large language models (LLMs), but suffers from
training inefficiency due to inadequate difficulty assessment. Moreover, RLVR
is prone to over-optimization, where LLMs exploit verification shortcuts
without aligning to the actual intent of user instructions. We introduce
Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR
training into a robust and sample-efficient pipeline. It consists of three
components: (1) a cooperative-adversarial data flywheel that co-evolves
instructions and hybrid verifications, generating progressively more
challenging instruction-verification pairs; (2) IntentCheck, a bypass module
enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that
detects reward hacking via trap instructions, which trigger and capture
shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves
87.43% accuracy on IFEval, outperforming larger proprietary models such as
GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench
while preserving general capabilities. Our trip wires show significant
reductions in reward hacking rates. We will release models, code, and data for
future research.
comment: 7 pages, 4 figures
☆ P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis
Feifan Song, Bofei Gao, Yifan Song, Yi Liu, Weimin Xiong, Yuyang Song, Tianyu Liu, Guoyin Wang, Houfeng Wang
Large Language Models (LLMs) are expected to produce safe, helpful, and
honest content during interaction with human users, but they frequently fail to
align with such values when given flawed instructions, e.g., missing context,
ambiguous directives, or inappropriate tone, leaving substantial room for
improvement along multiple dimensions. A cost-effective yet high-impact way is
to pre-align instructions before the model begins decoding. Existing approaches
either rely on prohibitive test-time search costs or end-to-end model rewrite,
which is powered by a customized training corpus with unclear objectives. In
this work, we demonstrate that the goal of efficient and effective preference
alignment can be achieved by P-Aligner, a lightweight module generating
instructions that preserve the original intents while being expressed in a more
human-preferred form. P-Aligner is trained on UltraPrompt, a new dataset
synthesized via a proposed principle-guided pipeline using Monte-Carlo Tree
Search, which systematically explores the space of candidate instructions that
are closely tied to human preference. Experiments across different methods show
that P-Aligner generally outperforms strong baselines across various models and
benchmarks, including average win-rate gains of 28.35% and 8.69% on GPT-4-turbo
and Gemma-2-SimPO, respectively. Further analyses validate its effectiveness
and efficiency through multiple perspectives, including data quality, search
strategies, iterative deployment, and time overhead.
☆ Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider
Text-to-SQL translation enables non-expert users to query relational
databases using natural language, with applications in education and business
intelligence. This study evaluates three lightweight transformer models -
T5-Small, BART-Small, and GPT-2 - on the Spider dataset, focusing on
low-resource settings. We developed a reusable, model-agnostic pipeline that
tailors schema formatting to each model's architecture, training them across
1000 to 5000 iterations and evaluating on 1000 test samples using Logical Form
Accuracy (LFAcc), BLEU, and Exact Match (EM) metrics. Fine-tuned T5-Small
achieves the highest LFAcc (27.8%), outperforming BART-Small (23.98%) and GPT-2
(20.1%), highlighting encoder-decoder models' superiority in schema-aware SQL
generation. Despite resource constraints limiting performance, our pipeline's
modularity supports future enhancements, such as advanced schema linking or
alternative base models. This work underscores the potential of compact
transformers for accessible text-to-SQL solutions in resource-scarce
environments.
☆ TURA: Tool-Augmented Unified Retrieval Agent for AI Search
Zhejun Zhao, Yuehu Dong, Alley Liu, Lixue Zheng, Pingsheng Liu, Dongdong Shen, Long Xia, Jiashu Zhao, Dawei Yin
The advent of Large Language Models (LLMs) is transforming search engines
into conversational AI search products, primarily using Retrieval-Augmented
Generation (RAG) on web corpora. However, this paradigm has significant
industrial limitations. Traditional RAG approaches struggle with real-time
needs and structured queries that require accessing dynamically generated
content like ticket availability or inventory. Limited to indexing static
pages, search engines cannot perform the interactive queries needed for such
time-sensitive data. Academic research has focused on optimizing RAG for static
content, overlooking complex intents and the need for dynamic sources like
databases and real-time APIs. To bridge this gap, we introduce TURA
(Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage
framework that combines RAG with agentic tool-use to access both static content
and dynamic, real-time information. TURA has three key components: an
Intent-Aware Retrieval module to decompose queries and retrieve information
sources encapsulated as Model Context Protocol (MCP) Servers, a DAG-based Task
Planner that models task dependencies as a Directed Acyclic Graph (DAG) for
optimal parallel execution, and a lightweight Distilled Agent Executor for
efficient tool calling. TURA is the first architecture to systematically bridge
the gap between static RAG and dynamic information sources for a world-class AI
search product. Serving tens of millions of users, it leverages an agentic
framework to deliver robust, real-time answers while meeting the low-latency
demands of a large-scale industrial system.
☆ Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference
Artificial Intelligence (AI) conferences are essential for advancing
research, sharing knowledge, and fostering academic community. However, their
rapid expansion has rendered the centralized conference model increasingly
unsustainable. This paper offers a data-driven diagnosis of a structural crisis
that threatens the foundational goals of scientific dissemination, equity, and
community well-being. We identify four key areas of strain: (1) scientifically,
with per-author publication rates more than doubling over the past decade to
over 4.5 papers annually; (2) environmentally, with the carbon footprint of a
single conference exceeding the daily emissions of its host city; (3)
psychologically, with 71% of online community discourse reflecting negative
sentiment and 35% referencing mental health concerns; and (4) logistically,
with attendance at top conferences such as NeurIPS 2024 beginning to outpace
venue capacity. These pressures point to a system that is misaligned with its
core mission. In response, we propose the Community-Federated Conference (CFC)
model, which separates peer review, presentation, and networking into globally
coordinated but locally organized components, offering a more sustainable,
inclusive, and resilient path forward for AI research.
comment: Preprint
☆ Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning
Large language models (LLMs) have revolutionized AI applications, yet their
high computational and memory demands hinder their widespread deployment.
Existing compression techniques focus on intra-block optimizations (e.g.
low-rank approximation, attention head pruning), while the repetitive layered
structure of transformers implies significant inter-block redundancy - a
dimension largely unexplored beyond key-value (KV) caching. Inspired by
dictionary learning in CNNs, we propose a framework for structured weight
sharing across transformer layers. Our approach decomposes attention projection
matrices into shared dictionary atoms, reducing the attention module's
parameters by 66.7% while achieving on-par performance. Unlike complex methods
requiring distillation or architectural changes, MASA (Matrix Atom Sharing in
Attention) operates as a drop-in replacement - trained with standard optimizers
- and represents each layer's weights as linear combinations of shared matrix
atoms. Experiments across scales (100M-700M parameters) show that MASA achieves
better benchmark accuracy and perplexity than grouped-query attention (GQA),
low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at
comparable parameter budgets. Ablation studies confirm robustness to the
dictionary size and the efficacy of shared representations in capturing
cross-layer statistical regularities. Extending to Vision Transformers (ViT),
MASA matches performance metrics on image classification and detection tasks
with 66.7% fewer attention parameters. By combining dictionary learning
strategies with transformer efficiency, MASA offers a scalable blueprint for
parameter-efficient models without sacrificing performance. Finally, we
investigate the possibility of employing MASA on pretrained LLMs to reduce
their number of parameters without experiencing any significant drop in their
performance.
☆ Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration
Nuo Chen, Yicheng Tong, Jiaying Wu, Minh Duc Duong, Qian Wang, Qingyun Zou, Bryan Hooi, Bingsheng He
While AI agents show potential in scientific ideation, most existing
frameworks rely on single-agent refinement, limiting creativity due to bounded
knowledge and perspective. Inspired by real-world research dynamics, this paper
investigates whether structured multi-agent discussions can surpass solitary
ideation. We propose a cooperative multi-agent framework for generating
research proposals and systematically compare configurations including group
size, leaderled versus leaderless structures, and team compositions varying in
interdisciplinarity and seniority. To assess idea quality, we employ a
comprehensive protocol with agent-based scoring and human review across
dimensions such as novelty, strategic vision, and integration depth. Our
results show that multi-agent discussions substantially outperform solitary
baselines. A designated leader acts as a catalyst, transforming discussion into
more integrated and visionary proposals. Notably, we find that cognitive
diversity is a primary driver of quality, yet expertise is a non-negotiable
prerequisite, as teams lacking a foundation of senior knowledge fail to surpass
even a single competent agent. These findings offer actionable insights for
designing collaborative AI ideation systems and shed light on how team
structure influences creative outcomes.
comment: Preprint
☆ Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation CIKM 2025
Multimodal Recommender Systems aim to improve recommendation accuracy by
integrating heterogeneous content, such as images and textual metadata. While
effective, it remains unclear whether their gains stem from true multimodal
understanding or increased model complexity. This work investigates the role of
multimodal item embeddings, emphasizing the semantic informativeness of the
representations. Initial experiments reveal that embeddings from standard
extractors (e.g., ResNet50, Sentence-Bert) enhance performance, but rely on
modality-specific encoders and ad hoc fusion strategies that lack control over
cross-modal alignment. To overcome these limitations, we leverage Large
Vision-Language Models (LVLMs) to generate multimodal-by-design embeddings via
structured prompts. This approach yields semantically aligned representations
without requiring any fusion. Experiments across multiple settings show notable
performance improvements. Furthermore, LVLMs embeddings offer a distinctive
advantage: they can be decoded into structured textual descriptions, enabling
direct assessment of their multimodal comprehension. When such descriptions are
incorporated as side content into recommender systems, they improve
recommendation performance, empirically validating the semantic depth and
alignment encoded within LVLMs outputs. Our study highlights the importance of
semantically rich representations and positions LVLMs as a compelling
foundation for building robust and meaningful multimodal representations in
recommendation tasks.
comment: Accepted as Full Research Papers at CIKM 2025
☆ Analyzing and Mitigating Object Hallucination: A Training Bias Perspective
As scaling up training data has significantly improved the general multimodal
capabilities of Large Vision-Language Models (LVLMs), they still suffer from
the hallucination issue, generating text that is inconsistent with the visual
input. This phenomenon motivates us to systematically investigate the role of
training data in hallucination. We introduce a new benchmark, POPEv2, which
consists of counterfactual images collected from the training data of LVLMs
with certain objects masked. Through comprehensive evaluation on POPEv2, we
find that current LVLMs suffer from training bias: they fail to fully leverage
their training data and hallucinate more frequently on images seen during
training. Specifically, they perform poorly on counterfactual images, often
incorrectly answering ``Yes'' to questions about masked objects. To understand
this issue, we conduct probing experiments on the models' internal components,
revealing that this training bias is primarily located in the language modeling
(LM) head. Based on these findings, we propose Obliviate, an efficient and
lightweight unlearning method designed to mitigate object hallucination via
training bias unlearning. Obliviate identifies the discrepancy between
ground-truth labels and model outputs on the training data as a proxy for bias
and adopts a parameter- and data-efficient fine-tuning strategy that only
updates the LM head. Extensive experiments demonstrate the effectiveness of our
approach. While only reusing the training data and updating approximately 2\%
of the parameters, Obliviate significantly reduces hallucination across both
discriminative and generative tasks. Furthermore, it demonstrates strong
scalability with respect to both model size (2B to 72B) and training data
volume, and exhibits promising generalization to hallucination types beyond
object-level hallucination. Our code and data will be publicly released.
☆ Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning
Depression is a widespread mental disorder that affects millions worldwide.
While automated depression assessment shows promise, most studies rely on
limited or non-clinically validated data, and often prioritize complex model
design over real-world effectiveness. In this paper, we aim to unveil the
landscape of clinical depression assessment. We introduce C-MIND, a clinical
neuropsychiatric multimodal diagnosis dataset collected over two years from
real hospital visits. Each participant completes three structured psychiatric
tasks and receives a final diagnosis from expert clinicians, with informative
audio, video, transcript, and functional near-infrared spectroscopy (fNIRS)
signals recorded. Using C-MIND, we first analyze behavioral signatures relevant
to diagnosis. We train a range of classical models to quantify how different
tasks and modalities contribute to diagnostic performance, and dissect the
effectiveness of their combinations. We then explore whether LLMs can perform
psychiatric reasoning like clinicians and identify their clear limitations in
realistic clinical settings. In response, we propose to guide the reasoning
process with clinical expertise and consistently improves LLM diagnostic
performance by up to 10% in Macro-F1 score. We aim to build an infrastructure
for clinical depression assessment from both data and algorithmic perspectives,
enabling C-MIND to facilitate grounded and reliable research for mental
healthcare.
☆ StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering
Generating stylized large language model (LLM) responses via representation
editing is a promising way for fine-grained output control. However, there
exists an inherent trade-off: imposing a distinctive style often degrades
truthfulness. Existing representation editing methods, by naively injecting
style signals, overlook this collateral impact and frequently contaminate the
model's core truthfulness representations, resulting in reduced answer
correctness. We term this phenomenon stylization-induced truthfulness collapse.
We attribute this issue to latent coupling between style and truth directions
in certain key attention heads, and propose StyliTruth, a mechanism that
preserves stylization while keeping truthfulness intact. StyliTruth separates
the style-relevant and truth-relevant subspaces in the model's representation
space via an orthogonal deflation process. This decomposition enables
independent control of style and truth in their own subspaces, minimizing
interference. By designing adaptive, token-level steering vectors within each
subspace, we dynamically and precisely control the generation process to
maintain both stylistic fidelity and truthfulness. We validate our method on
multiple styles and languages. Extensive experiments and analyses show that
StyliTruth significantly reduces stylization-induced truthfulness collapse and
outperforms existing inference-time intervention methods in balancing style
adherence with truthfulness.
☆ Causal Reflection with Language Models
While LLMs exhibit impressive fluency and factual recall, they struggle with
robust causal reasoning, often relying on spurious correlations and brittle
patterns. Similarly, traditional Reinforcement Learning agents also lack causal
understanding, optimizing for rewards without modeling why actions lead to
outcomes. We introduce Causal Reflection, a framework that explicitly models
causality as a dynamic function over state, action, time, and perturbation,
enabling agents to reason about delayed and nonlinear effects. Additionally, we
define a formal Reflect mechanism that identifies mismatches between predicted
and observed outcomes and generates causal hypotheses to revise the agent's
internal model. In this architecture, LLMs serve not as black-box reasoners,
but as structured inference engines translating formal causal outputs into
natural language explanations and counterfactuals. Our framework lays the
theoretical groundwork for Causal Reflective agents that can adapt,
self-correct, and communicate causal understanding in evolving environments.
☆ CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation
Lexical semantics is concerned with both the multiple senses a word can adopt
in different contexts, and the semantic relations that exist between meanings
of different words. To investigate them, Contextualized Language Models are a
valuable tool that provides context-sensitive representations that can be used
to investigate lexical meaning. Recent works like XL-LEXEME have leveraged the
task of Word-in-Context to fine-tune them to get more semantically accurate
representations, but Word-in-Context only compares occurrences of the same
lemma, limiting the range of captured information. In this paper, we propose an
extension, Concept Differentiation, to include inter-words scenarios. We
provide a dataset for this task, derived from SemCor data. Then we fine-tune
several representation models on this dataset. We call these models
Concept-Aligned Embeddings (CALE). By challenging our models and other models
on various lexical semantic tasks, we demonstrate that the proposed models
provide efficient multi-purpose representations of lexical meaning that reach
best performances in our experiments. We also show that CALE's fine-tuning
brings valuable changes to the spatial organization of embeddings.
comment: Under review in ARR July 2025
☆ OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use ACL 2025
Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, Fei Wu
The dream to create AI assistants as capable and versatile as the fictional
J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution
of (multi-modal) large language models ((M)LLMs), this dream is closer to
reality, as (M)LLM-based Agents using computing devices (e.g., computers and
mobile phones) by operating within the environments and interfaces (e.g.,
Graphical User Interface (GUI)) provided by operating systems (OS) to automate
tasks have significantly advanced. This paper presents a comprehensive survey
of these advanced agents, designated as OS Agents. We begin by elucidating the
fundamentals of OS Agents, exploring their key components including the
environment, observation space, and action space, and outlining essential
capabilities such as understanding, planning, and grounding. We then examine
methodologies for constructing OS Agents, focusing on domain-specific
foundation models and agent frameworks. A detailed review of evaluation
protocols and benchmarks highlights how OS Agents are assessed across diverse
tasks. Finally, we discuss current challenges and identify promising directions
for future research, including safety and privacy, personalization and
self-evolution. This survey aims to consolidate the state of OS Agents
research, providing insights to guide both academic inquiry and industrial
development. An open-source GitHub repository is maintained as a dynamic
resource to foster further innovation in this field. We present a 9-page
version of our work, accepted by ACL 2025, to provide a concise overview to the
domain.
comment: ACL 2025 (Oral)
☆ FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding
The deployment of vision-language models remains constrained by substantial
computational requirements. We present \textbf{FrEVL}, a framework exploring
whether frozen pretrained embeddings can support effective vision-language
understanding. Our analysis reveals that frozen embeddings contain rich
information for discriminative tasks, achieving 85\% to 95\% of
state-of-the-art performance on standard benchmarks with only 68.4M trainable
parameters. This performance dichotomy reveals a critical insight: frozen
embedding effectiveness depends on alignment between pretraining objectives and
downstream task requirements. When accounting for end-to-end computation
including embedding extraction, FrEVL provides $2.3\times$ speedup with 52\%
lower energy consumption, making it suitable for scenarios with pre-computable
inputs or when deployment constraints outweigh marginal performance gains. Our
evaluation provides practitioners with guidance on when frozen embedding
approaches represent viable alternatives to full model deployment. We will
release our complete implementation and evaluation framework to facilitate
further research into efficient multi-modal understanding.
comment: 8 pages, 4 figures
☆ Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI
Rohaizah Abdul Wahid, Muhamad Said Nizamuddin Nadim, Suliana Sulaiman, Syahmi Akmal Shaharudin, Muhammad Danial Jupikil, Iqqwan Jasman Su Azlan Su
This paper addresses the critical need for scalable and high-quality
educational assessment tools within the Malaysian education system. It
highlights the potential of Generative AI (GenAI) while acknowledging the
significant challenges of ensuring factual accuracy and curriculum alignment,
especially for low-resource languages like Bahasa Melayu. This research
introduces and compares four incremental pipelines for generating Form 1
Mathematics multiple-choice questions (MCQs) in Bahasa Melayu using OpenAI's
GPT-4o. The methods range from non-grounded prompting (structured and basic) to
Retrieval-Augmented Generation (RAG) approaches (one using the LangChain
framework, one implemented manually). The system is grounded in official
curriculum documents, including teacher-prepared notes and the yearly teaching
plan (RPT). A dual-pronged automated evaluation framework is employed to assess
the generated questions. Curriculum alignment is measured using Semantic
Textual Similarity (STS) against the RPT, while contextual validity is verified
through a novel RAG-based Question-Answering (RAG-QA) method. The results
demonstrate that RAG-based pipelines significantly outperform non-grounded
prompting methods, producing questions with higher curriculum alignment and
factual validity. The study further analyzes the trade-offs between the ease of
implementation of framework-based RAG and the fine-grained control offered by a
manual pipeline. This work presents a validated methodology for generating
curriculum-specific educational content in a low-resource language, introduces
a symbiotic RAG-QA evaluation technique, and provides actionable insights for
the development and deployment of practical EdTech solutions in Malaysia and
similar regions.
☆ StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion
Yutong Wu, Di Huang, Ruosi Wan, Yue Peng, Shijie Shang, Chenrui Cao, Lei Qi, Rui Zhang, Zidong Du, Jie Yan, Xing Hu
Autoformalization aims to translate natural-language mathematical statements
into a formal language. While LLMs have accelerated progress in this area,
existing methods still suffer from low accuracy. We identify two key abilities
for effective autoformalization: comprehensive mastery of formal-language
domain knowledge, and reasoning capability of natural language problem
understanding and informal-formal alignment. Without the former, a model cannot
identify the correct formal objects; without the latter, it struggles to
interpret real-world contexts and map them precisely into formal expressions.
To address these gaps, we introduce ThinkingF, a data synthesis and training
pipeline that improves both abilities. First, we construct two datasets: one by
distilling and selecting large-scale examples rich in formal knowledge, and
another by generating informal-to-formal reasoning trajectories guided by
expert-designed templates. We then apply SFT and RLVR with these datasets to
further fuse and refine the two abilities. The resulting 7B and 32B models
exhibit both comprehensive formal knowledge and strong informal-to-formal
reasoning. Notably, StepFun-Formalizer-32B achieves SOTA BEq@1 scores of 40.5%
on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior
general-purpose and specialized models.
comment: 24 pages, 17 figures, under review
☆ Evaluating, Synthesizing, and Enhancing for Customer Support Conversation
Effective customer support requires not only accurate problem solving but
also structured and empathetic communication aligned with professional
standards. However, existing dialogue datasets often lack strategic guidance,
and real-world service data is difficult to access and annotate. To address
this, we introduce the task of Customer Support Conversation (CSC), aimed at
training customer service agents to respond using well-defined support
strategies. We propose a structured CSC framework grounded in COPC guidelines,
defining five conversational stages and twelve strategies to guide high-quality
interactions. Based on this, we construct CSConv, an evaluation dataset of
1,855 real-world customer-agent conversations rewritten using LLMs to reflect
deliberate strategy use, and annotated accordingly. Additionally, we develop a
role-playing approach that simulates strategy-rich conversations using
LLM-powered roles aligned with the CSC framework, resulting in the training
dataset RoleCS. Experiments show that fine-tuning strong LLMs on RoleCS
significantly improves their ability to generate high-quality, strategy-aligned
responses on CSConv. Human evaluations further confirm gains in problem
resolution. All code and data will be made publicly available at
https://github.com/aliyun/qwen-dianjin.
comment: under review
☆ Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents
Frontier LLMs only recently enabled serviceable, autonomous web agents. At
that, a model poses as an instantaneous domain model backend. Ought to suggest
interaction, it is consulted with a web-based task and respective application
state. The key problem lies in application state serialisation
$\unicode{x2013}$ referred to as snapshot. State-of-the-art web agents are
premised on grounded GUI snapshots, i.e., screenshots enhanced with visual
cues. Not least to resemble human perception, but for images representing
relatively cheap means of model input. LLM vision still lag behind code
interpretation capabilities. DOM snapshots, which structurally resemble HTML,
impose a desired alternative. Vast model input token size, however, disables
reliable implementation with web agents to date.
We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a
GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web
dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a
grounded GUI snapshot baseline (65%) $\unicode{x2013}$ within the same input
token order of magnitude (1e3). Our best evaluated configurations
$\unicode{x2013}$ one token order above, but within the model's context window
$\unicode{x2013}$ outperform this baseline by 8%. Our evaluation, moreover,
yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs.
☆ Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language Model
Prefetching of dialogue responses has been investigated to reduce
user-perceived latency (UPL), which refers to the user's waiting time before
receiving the system's response, in spoken dialogue systems. To reduce the UPL,
it is necessary to predict complete user utterances before the end of the
user's speech, typically by language models, to prepare prefetched dialogue
responses. In this study, we proposed a prediction confidence model (PCM) that
determines whether prefetching is possible or not by estimating the semantic
similarity between the predicted complete user utterance and the complete user
utterance. We evaluated our PCM based on the differences between the predicted
complete user utterance and the complete user utterance.
☆ What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems
Kiyotada Mori, Seiya Kawano, Chaoran Liu, Carlos Toshinori Ishi, Angel Fernando Garcia Contreras, Koichiro Yoshino
Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at
the front end of their pipeline. The role of ASR in SDSs is to recognize
information in user speech related to response generation appropriately.
Examining selective listening of humans, which refers to the ability to focus
on and listen to important parts of a conversation during the speech, will
enable us to identify the ASR capabilities required for SDSs and evaluate them.
In this study, we experimentally confirmed selective listening when humans
generate dialogue responses by comparing human transcriptions for generating
dialogue responses and reference transcriptions. Based on our experimental
results, we discuss the possibility of a new ASR evaluation method that
leverages human selective listening, which can identify the gap between
transcription ability between ASR systems and humans.
☆ Why are LLMs' abilities emergent?
The remarkable success of Large Language Models (LLMs) in generative tasks
has raised fundamental questions about the nature of their acquired
capabilities, which often appear to emerge unexpectedly without explicit
training. This paper examines the emergent properties of Deep Neural Networks
(DNNs) through both theoretical analysis and empirical observation, addressing
the epistemological challenge of "creation without understanding" that
characterises contemporary AI development. We explore how the neural approach's
reliance on nonlinear, stochastic processes fundamentally differs from symbolic
computational paradigms, creating systems whose macro-level behaviours cannot
be analytically derived from micro-level neuron activities. Through analysis of
scaling laws, grokking phenomena, and phase transitions in model capabilities,
I demonstrate that emergent abilities arise from the complex dynamics of highly
sensitive nonlinear systems rather than simply from parameter scaling alone. My
investigation reveals that current debates over metrics, pre-training loss
thresholds, and in-context learning miss the fundamental ontological nature of
emergence in DNNs. I argue that these systems exhibit genuine emergent
properties analogous to those found in other complex natural phenomena, where
systemic capabilities emerge from cooperative interactions among simple
components without being reducible to their individual behaviours. The paper
concludes that understanding LLM capabilities requires recognising DNNs as a
new domain of complex dynamical systems governed by universal principles of
emergence, similar to those operating in physics, chemistry, and biology. This
perspective shifts the focus from purely phenomenological definitions of
emergence to understanding the internal dynamic transformations that enable
these systems to acquire capabilities that transcend their individual
components.
comment: 20 pages
☆ Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky
This study evaluates advanced natural language processing (NLP) techniques to
enhance crash data quality by mining crash narratives, using secondary crash
identification in Kentucky as a case study. Drawing from 16,656 manually
reviewed narratives from 2015-2022, with 3,803 confirmed secondary crashes, we
compare three model classes: zero-shot open-source large language models (LLMs)
(LLaMA3:70B, DeepSeek-R1:70B, Qwen3:32B, Gemma3:27B); fine-tuned transformers
(BERT, DistilBERT, RoBERTa, XLNet, Longformer); and traditional logistic
regression as baseline. Models were calibrated on 2015-2021 data and tested on
1,771 narratives from 2022. Fine-tuned transformers achieved superior
performance, with RoBERTa yielding the highest F1-score (0.90) and accuracy
(95%). Zero-shot LLaMA3:70B reached a comparable F1 of 0.86 but required 139
minutes of inference; the logistic baseline lagged well behind (F1:0.66). LLMs
excelled in recall for some variants (e.g., GEMMA3:27B at 0.94) but incurred
high computational costs (up to 723 minutes for DeepSeek-R1:70B), while
fine-tuned models processed the test set in seconds after brief training.
Further analysis indicated that mid-sized LLMs (e.g., DeepSeek-R1:32B) can
rival larger counterparts in performance while reducing runtime, suggesting
opportunities for optimized deployments. Results highlight trade-offs between
accuracy, efficiency, and data requirements, with fine-tuned transformer models
balancing precision and recall effectively on Kentucky data. Practical
deployment considerations emphasize privacy-preserving local deployment,
ensemble approaches for improved accuracy, and incremental processing for
scalability, providing a replicable scheme for enhancing crash-data quality
with advanced NLP.
comment: 19 pages, 2 figures
☆ Chain of Questions: Guiding Multimodal Curiosity in Language Models
Reasoning capabilities in large language models (LLMs) have substantially
advanced through methods such as chain-of-thought and explicit step-by-step
explanations. However, these improvements have not yet fully transitioned to
multimodal contexts, where models must proactively decide which sensory
modalities such as vision, audio, or spatial perception to engage when
interacting with complex real-world environments. In this paper, we introduce
the Chain of Questions (CoQ) framework, a curiosity-driven reasoning approach
that encourages multimodal language models to dynamically generate targeted
questions regarding their surroundings. These generated questions guide the
model to selectively activate relevant modalities, thereby gathering critical
information necessary for accurate reasoning and response generation. We
evaluate our framework on a novel multimodal benchmark dataset, assembled by
integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Experimental results
demonstrate that our CoQ method improves a foundation model's ability to
effectively identify and integrate pertinent sensory information. This leads to
improved accuracy, interpretability, and alignment of the reasoning process
with diverse multimodal tasks.
☆ GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
Reinforcement learning (RL) with algorithms like Group Relative Policy
Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is
limited by a coarse-grained credit assignment that applies a uniform reward to
all tokens in a sequence. This is a major flaw in long-chain reasoning tasks.
This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea
is that high-entropy tokens in correct responses can guide the policy toward a
higher performance ceiling. This allows us to create more fine-grained reward
signals for precise policy updates via two ways: 1) \textbf{Group Token Policy
Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each
token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group
Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted
reward to each sequence based on its average token entropy. Experiments show
our methods significantly outperform the strong DAPO baseline. The results
confirm that our entropy-weighting mechanism is the key driver of this
performance boost, offering a better path to enhance deep reasoning in models.
☆ Modelling and Classifying the Components of a Literature Review
Previous work has demonstrated that AI methods for analysing scientific
literature benefit significantly from annotating sentences in papers according
to their rhetorical roles, such as research gaps, results, limitations,
extensions of existing methodologies, and others. Such representations also
have the potential to support the development of a new generation of systems
capable of producing high-quality literature reviews. However, achieving this
goal requires the definition of a relevant annotation schema and effective
strategies for large-scale annotation of the literature. This paper addresses
these challenges by 1) introducing a novel annotation schema specifically
designed to support literature review generation and 2) conducting a
comprehensive evaluation of a wide range of state-of-the-art large language
models (LLMs) in classifying rhetorical roles according to this schema. To this
end, we also present Sci-Sentence, a novel multidisciplinary benchmark
comprising 700 sentences manually annotated by domain experts and 2,240
sentences automatically labelled using LLMs. We evaluate 37 LLMs on this
benchmark, spanning diverse model families and sizes, using both zero-shot
learning and fine-tuning approaches. The experiments yield several novel
insights that advance the state of the art in this challenging domain. First,
the current generation of LLMs performs remarkably well on this task when
fine-tuned on high-quality data, achieving performance levels above 96\% F1.
Second, while large proprietary models like GPT-4o achieve the best results,
some lightweight open-source alternatives also demonstrate excellent
performance. Finally, enriching the training data with semi-synthetic examples
generated by LLMs proves beneficial, enabling small encoders to achieve robust
results and significantly enhancing the performance of several open decoder
models.
☆ Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
Large language models (LLMs) show significant potential in healthcare,
prompting numerous benchmarks to evaluate their capabilities. However, concerns
persist regarding the reliability of these benchmarks, which often lack
clinical fidelity, robust data management, and safety-oriented evaluation
metrics. To address these shortcomings, we introduce MedCheck, the first
lifecycle-oriented assessment framework specifically designed for medical
benchmarks. Our framework deconstructs a benchmark's development into five
continuous stages, from design to governance, and provides a comprehensive
checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an
in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis
uncovers widespread, systemic issues, including a profound disconnect from
clinical practice, a crisis of data integrity due to unmitigated contamination
risks, and a systematic neglect of safety-critical evaluation dimensions like
model robustness and uncertainty awareness. Based on these findings, MedCheck
serves as both a diagnostic tool for existing benchmarks and an actionable
guideline to foster a more standardized, reliable, and transparent approach to
evaluating AI in healthcare.
☆ A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models
Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as
a promising paradigm for enhancing large language models (LLMs) by converting
raw text into structured knowledge graphs, improving both accuracy and
explainability. However, GraphRAG relies on LLMs to extract knowledge from raw
text during graph construction, and this process can be maliciously manipulated
to implant misleading information. Targeting this attack surface, we propose
two knowledge poisoning attacks (KPAs) and demonstrate that modifying only a
few words in the source text can significantly change the constructed graph,
poison the GraphRAG, and severely mislead downstream reasoning. The first
attack, named Targeted KPA (TKPA), utilizes graph-theoretic analysis to locate
vulnerable nodes in the generated graphs and rewrites the corresponding
narratives with LLMs, achieving precise control over specific
question-answering (QA) outcomes with a success rate of 93.1\%, while keeping
the poisoned text fluent and natural. The second attack, named Universal KPA
(UKPA), exploits linguistic cues such as pronouns and dependency relations to
disrupt the structural integrity of the generated graph by altering globally
influential words. With fewer than 0.05\% of full text modified, the QA
accuracy collapses from 95\% to 50\%. Furthermore, experiments show that
state-of-the-art defense methods fail to detect these attacks, highlighting
that securing GraphRAG pipelines against knowledge poisoning remains largely
unexplored.
☆ ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents AAAI2026
Existing benchmarks in e-commerce primarily focus on basic user intents, such
as finding or purchasing products. However, real-world users often pursue more
complex goals, such as applying vouchers, managing budgets, and finding
multi-products seller. To bridge this gap, we propose ShoppingBench, a novel
end-to-end shopping benchmark designed to encompass increasingly challenging
levels of grounded intent. Specifically, we propose a scalable framework to
simulate user instructions based on various intents derived from sampled
real-world products. To facilitate consistent and reliable evaluations, we
provide a large-scale shopping sandbox that serves as an interactive simulated
environment, incorporating over 2.5 million real-world products. Experimental
results demonstrate that even state-of-the-art language agents (such as
GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks,
highlighting the significant challenges posed by our ShoppingBench. In
addition, we propose a trajectory distillation strategy and leverage supervised
fine-tuning, along with reinforcement learning on synthetic trajectories, to
distill the capabilities of a large language agent into a smaller one. As a
result, our trained agent achieves competitive performance compared to GPT-4.1.
comment: submit to AAAI2026
☆ KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs
Key-Value (KV) cache quantization has become a widely adopted optimization
technique for efficient large language models (LLMs) inference by reducing KV
cache memory usage and mitigating memory-bound constraints. Recent studies have
emphasized the importance of preserving the original precision of KVs for the
first few tokens to ensure the protection of attention sinks. While this
approach has proven effective in mitigating performance degradation, its
underlying principles remain insufficiently understood. Moreover, it fails to
address the recent discovery that attention sinks can emerge beyond the initial
token positions. In this work, we elucidate the underlying mechanisms of
attention sinks during inference by examining their role in the cross-layer
evolution of extreme activation outliers. Additionally, we provide a
comprehensive analysis of the interplay between attention sinks and KV cache
quantization. Based on our enhanced understanding, we introduce
\textit{\textbf{KVSink}}, a plug-and-play method that effectively predicts sink
tokens with negligible overhead, enabling more thorough preservation. Extensive
experiments demonstrate that KVSink outperforms the existing Preserve-First-N
(PFN) strategy, offering more effective preservation of attention sinks during
KV cache quantization. Moreover, when applied to the well-established KVQuant
method, KVSink further improves perplexity (PPL) and reduces reliance on 16-bit
numerical outliers.
comment: Published as a conference paper at COLM 2025
☆ Graph Representation Learning with Massive Unlabeled Data for Rumor Detection
With the development of social media, rumors spread quickly, cause great harm
to society and economy. Thereby, many effective rumor detection methods have
been developed, among which the rumor propagation structure learning based
methods are particularly effective compared to other methods. However, the
existing methods still suffer from many issues including the difficulty to
obtain large-scale labeled rumor datasets, which leads to the low
generalization ability and the performance degeneration on new events since
rumors are time-critical and usually appear with hot topics or newly emergent
events. In order to solve the above problems, in this study, we used
large-scale unlabeled topic datasets crawled from the social media platform
Weibo and Twitter with claim propagation structure to improve the semantic
learning ability of a graph reprentation learing model on various topics. We
use three typical graph self-supervised methods, InfoGraph, JOAO and GraphMAE
in two commonly used training strategies, to verify the performance of general
graph semi-supervised methods in rumor detection tasks. In addition, for
alleviating the time and topic difference between unlabeled topic data and
rumor data, we also collected a rumor dataset covering a variety of topics over
a decade (10-year ago from 2022) from the Weibo rumor-refuting platform. Our
experiments show that these general graph self-supervised learning methods
outperform previous methods specifically designed for rumor detection tasks and
achieve good performance under few-shot conditions, demonstrating the better
generalization ability with the help of our massive unlabeled topic dataset.
comment: 9 pages, 3 figures
☆ TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening CIKM 2025
The increasing demand for mental health services has outpaced the
availability of real training data to develop clinical professionals, leading
to limited support for the diagnosis of depression. This shortage has motivated
the development of simulated or virtual patients to assist in training and
evaluation, but existing approaches often fail to generate clinically valid,
natural, and diverse symptom presentations. In this work, we embrace the recent
advanced language models as the backbone and propose a novel
clinician-in-the-loop patient simulation pipeline, TalkDep, with access to
diversified patient profiles to develop simulated patients. By conditioning the
model on psychiatric diagnostic criteria, symptom severity scales, and
contextual factors, our goal is to create authentic patient responses that can
better support diagnostic model training and evaluation. We verify the
reliability of these simulated patients with thorough assessments conducted by
clinical professionals. The availability of validated simulated patients offers
a scalable and adaptable resource for improving the robustness and
generalisability of automatic depression diagnosis systems.
comment: Paper accepted at CIKM 2025
☆ DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting
Time series forecasting is crucial in strategic planning and decision-making
across various industries. Traditional forecasting models mainly concentrate on
numerical time series data, often overlooking important textual information
such as events and news, which can significantly affect forecasting accuracy.
While large language models offer a promise for integrating multimodal data,
existing single-prompt frameworks struggle to effectively capture the semantics
of timestamped text, introducing redundant information that can hinder model
performance. To address this limitation, we introduce DP-GPT4MTS (Dual-Prompt
GPT2-base for Multimodal Time Series), a novel dual-prompt large language model
framework that combines two complementary prompts: an explicit prompt for clear
task instructions and a textual prompt for context-aware embeddings from
time-stamped data. The tokenizer generates the explicit prompt while the
embeddings from the textual prompt are refined through self-attention and
feed-forward networks. Comprehensive experiments conducted on diverse
textural-numerical time series datasets demonstrate that this approach
outperforms state-of-the-art algorithms in time series forecasting. This
highlights the significance of incorporating textual context via a dual-prompt
mechanism to achieve more accurate time series predictions.
☆ Hierarchical Text Classification Using Black Box Large Language Models
Hierarchical Text Classification (HTC) aims to assign texts to structured
label hierarchies; however, it faces challenges due to data scarcity and model
complexity. This study explores the feasibility of using black box Large
Language Models (LLMs) accessed via APIs for HTC, as an alternative to
traditional machine learning methods that require extensive labeled data and
computational resources. We evaluate three prompting strategies -- Direct Leaf
Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down
Multi-step Hierarchical Label Prediction (TMH) -- in both zero-shot and
few-shot settings, comparing the accuracy and cost-effectiveness of these
strategies. Experiments on two datasets show that a few-shot setting
consistently improves classification accuracy compared to a zero-shot setting.
While a traditional machine learning model achieves high accuracy on a dataset
with a shallow hierarchy, LLMs, especially DH strategy, tend to outperform the
machine learning model on a dataset with a deeper hierarchy. API costs increase
significantly due to the higher input tokens required for deeper label
hierarchies on DH strategy. These results emphasize the trade-off between
accuracy improvement and the computational cost of prompt strategy. These
findings highlight the potential of black box LLMs for HTC while underscoring
the need to carefully select a prompt strategy to balance performance and cost.
comment: 16 pages, 6 figures
☆ ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments
Large Reasoning Models (LRMs) have demonstrated impressive performance in
reasoning-intensive tasks, but they remain vulnerable to harmful content
generation, particularly in the mid-to-late steps of their reasoning processes.
Existing defense mechanisms, however, rely on costly fine-tuning and additional
expert knowledge, which restricts their scalability. In this work, we propose
ReasoningGuard, an inference-time safeguard for LRMs, which injects timely
safety aha moments to steer harmless while helpful reasoning processes.
Leveraging the model's internal attention behavior, our approach accurately
identifies critical points in the reasoning path, and triggers spontaneous,
safety-oriented reflection. To safeguard both the subsequent reasoning steps
and the final answers, we further implement a scaling sampling strategy during
the decoding phase, selecting the optimal reasoning path. Inducing minimal
extra inference cost, ReasoningGuard effectively mitigates three types of
jailbreak attacks, including the latest ones targeting the reasoning process of
LRMs. Our approach outperforms seven existing safeguards, achieving
state-of-the-art safety defenses while effectively avoiding the common
exaggerated safety issues.
☆ Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts
Millicent Ochieng, Anja Thieme, Ignatius Ezeani, Risa Ueno, Samuel Maina, Keshet Ronen, Javier Gonzalez, Jacki O'Neill
Sentiment analysis in low-resource, culturally nuanced contexts challenges
conventional NLP approaches that assume fixed labels and universal affective
expressions. We present a diagnostic framework that treats sentiment as a
context-dependent, culturally embedded construct, and evaluate how large
language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp
messages from Nairobi youth health groups. Using a combination of
human-annotated data, sentiment-flipped counterfactuals, and rubric-based
explanation evaluation, we probe LLM interpretability, robustness, and
alignment with human reasoning. Framing our evaluation through a social-science
measurement lens, we operationalize and interrogate LLMs outputs as an
instrument for measuring the abstract concept of sentiment. Our findings reveal
significant variation in model reasoning quality, with top-tier LLMs
demonstrating interpretive stability, while open models often falter under
ambiguity or sentiment shifts. This work highlights the need for culturally
sensitive, reasoning-aware AI evaluation in complex, real-world communication.
☆ Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models
Despite significant advances in alignment techniques, we demonstrate that
state-of-the-art language models remain vulnerable to carefully crafted
conversational scenarios that can induce various forms of misalignment without
explicit jailbreaking. Through systematic manual red-teaming with
Claude-4-Opus, we discovered 10 successful attack scenarios, revealing
fundamental vulnerabilities in how current alignment methods handle narrative
immersion, emotional pressure, and strategic framing. These scenarios
successfully elicited a range of misaligned behaviors, including deception,
value drift, self-preservation, and manipulative reasoning, each exploiting
different psychological and contextual vulnerabilities. To validate
generalizability, we distilled our successful manual attacks into
MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible
testing across multiple models. Cross-model evaluation of our 10 scenarios
against five frontier LLMs revealed an overall 76% vulnerability rate, with
significant variations: GPT-4.1 showed the highest susceptibility (90%), while
Claude-4-Sonnet demonstrated greater resistance (40%). Our findings demonstrate
that sophisticated reasoning capabilities often become attack vectors rather
than protective mechanisms, as models can be manipulated into complex
justifications for misaligned behavior. This work provides (i) a detailed
taxonomy of conversational manipulation patterns and (ii) a reusable evaluation
framework. Together, these findings expose critical gaps in current alignment
strategies and highlight the need for robustness against subtle, scenario-based
manipulation in future AI systems.
☆ Characterizing Deep Research: A Benchmark and Formal Definition
Abhinav Java, Ashmit Khandelwal, Sukruta Midigeshi, Aaron Halfaker, Amit Deshpande, Navin Goyal, Ankur Gupta, Nagarajan Natarajan, Amit Sharma
Information tasks such as writing surveys or analytical reports require
complex search and reasoning, and have recently been grouped under the umbrella
of \textit{deep research} -- a term also adopted by recent models targeting
these capabilities. Despite growing interest, the scope of the deep research
task remains underdefined and its distinction from other reasoning-intensive
problems is poorly understood. In this paper, we propose a formal
characterization of the deep research (DR) task and introduce a benchmark to
evaluate the performance of DR systems. We argue that the core defining feature
of deep research is not the production of lengthy report-style outputs, but
rather the high fan-out over concepts required during the search process, i.e.,
broad and reasoning-intensive exploration. To enable objective evaluation, we
define DR using an intermediate output representation that encodes key claims
uncovered during search-separating the reasoning challenge from surface-level
report generation. Based on this formulation, we propose a diverse, challenging
benchmark LiveDRBench with 100 challenging tasks over scientific topics (e.g.,
datasets, materials discovery, prior art search) and public interest events
(e.g., flight incidents, movie awards). Across state-of-the-art DR systems, F1
score ranges between 0.02 and 0.72 for any sub-category. OpenAI's model
performs the best with an overall F1 score of 0.55. Analysis of reasoning
traces reveals the distribution over the number of referenced sources,
branching, and backtracking events executed by current DR systems, motivating
future directions for improving their search mechanisms and grounding
capabilities. The benchmark is available at
https://github.com/microsoft/LiveDRBench.
comment: First three authors contributed equally (ordered alphabetically)
☆ Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity
Multimodal Large Language Models (MLLMs) have demonstrated impressive
capabilities across vision-language tasks. However, they may suffer from
hallucinations--generating outputs that are semantically inconsistent with the
input image or text. Through causal analyses, we find that: (i) hallucinations
with omission may arise from the failure to adequately capture essential causal
factors, and (ii) hallucinations with fabrication are likely caused by the
model being misled by non-causal cues. To address these challenges, we propose
a novel reinforcement learning framework guided by causal completeness, which
jointly considers both causal sufficiency and causal necessity of tokens.
Specifically, we evaluate each token's standalone contribution and
counterfactual indispensability to define a token-level causal completeness
reward. This reward is used to construct a causally informed advantage function
within the GRPO optimization framework, encouraging the model to focus on
tokens that are both causally sufficient and necessary for accurate generation.
Experimental results across various benchmark datasets and tasks demonstrate
the effectiveness of our approach, which effectively mitigates hallucinations
in MLLMs.
☆ The State Of TTS: A Case Study with Human Fooling Rates
While subjective evaluations in recent years indicate rapid progress in TTS,
can current TTS systems truly pass a human deception test in a Turing-like
evaluation? We introduce Human Fooling Rate (HFR), a metric that directly
measures how often machine-generated speech is mistaken for human. Our
large-scale evaluation of open-source and commercial TTS models reveals
critical insights: (i) CMOS-based claims of human parity often fail under
deception testing, (ii) TTS progress should be benchmarked on datasets where
human speech achieves high HFRs, as evaluating against monotonous or less
expressive reference samples sets a low bar, (iii) Commercial models approach
human deception in zero-shot settings, while open-source systems still struggle
with natural conversational speech; (iv) Fine-tuning on high-quality data
improves realism but does not fully bridge the gap. Our findings underscore the
need for more realistic, human-centric evaluations alongside existing
subjective tests.
comment: Accepted at InterSpeech 2025
☆ ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations
Subhankar Swain, Naquee Rizwan, Nayandeep Deb, Vishwajeet Singh Solanki, Vishwa Gangadhar S, Animesh Mukherjee
The 2025 Global Risks Report identifies state-based armed conflict and
societal polarisation among the most pressing global threats, with social media
playing a central role in amplifying toxic discourse. Memes, as a widely used
mode of online communication, often serve as vehicles for spreading harmful
content. However, limitations in data accessibility and the high cost of
dataset curation hinder the development of robust meme moderation systems. To
address this challenge, in this work, we introduce a first-of-its-kind dataset
of 6,300 real-world meme-based posts annotated in two stages: (i) binary
classification into toxic and normal, and (ii) fine-grained labelling of toxic
memes as hateful, dangerous, or offensive. A key feature of this dataset is
that it is enriched with auxiliary metadata of socially relevant tags,
enhancing the context of each meme. In addition, we propose a tag generation
module that produces socially grounded tags, because most in-the-wild memes
often do not come with tags. Experimental results show that incorporating these
tags substantially enhances the performance of state-of-the-art VLMs detection
tasks. Our contributions offer a novel and scalable foundation for improved
content moderation in multimodal online environments.
☆ Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
Aligning large language models (LLMs) with human preferences is a critical
challenge in AI research. While methods like Reinforcement Learning from Human
Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they
often rely on large, costly preference datasets. The current work lacks methods
for high-quality data selection specifically for preference data. In this work,
we introduce a novel difficulty-based data selection strategy for preference
datasets, grounded in the DPO implicit reward mechanism. By selecting
preference data examples with smaller DPO implicit reward gaps, which are
indicative of more challenging cases, we improve data efficiency and model
alignment. Our approach consistently outperforms five strong baselines across
multiple datasets and alignment tasks, achieving superior performance with only
10\% of the original data. This principled, efficient selection method offers a
promising solution for scaling LLM alignment with limited resources.
comment: Our code and data are available at
https://github.com/Difficulty-Based-Preference-Data-Select/Difficulty-Based-Preference-Data-Select
☆ Multilingual Source Tracing of Speech Deepfakes: A First Benchmark SP
Recent progress in generative AI has made it increasingly easy to create
natural-sounding deepfake speech from just a few seconds of audio. While these
tools support helpful applications, they also raise serious concerns by making
it possible to generate convincing fake speech in many languages. Current
research has largely focused on detecting fake speech, but little attention has
been given to tracing the source models used to generate it. This paper
introduces the first benchmark for multilingual speech deepfake source tracing,
covering both mono- and cross-lingual scenarios. We comparatively investigate
DSP- and SSL-based modeling; examine how SSL representations fine-tuned on
different languages impact cross-lingual generalization performance; and
evaluate generalization to unseen languages and speakers. Our findings offer
the first comprehensive insights into the challenges of identifying speech
generation models when training and inference languages differ. The dataset,
protocol and code are available at
https://github.com/xuanxixi/Multilingual-Source-Tracing.
comment: Accepted at Interspeech SPSC 2025 - 5th Symposium on Security and
Privacy in Speech Communication (Oral)
☆ COPO: Consistency-Aware Policy Optimization
Jinghang Han, Jiawei Chen, Hang Shao, Hao Ma, Mingcheng Li, Xintian Shen, Lihao Zheng, Wei Chen, Tao Wei, Lihua Zhang
Reinforcement learning has significantly enhanced the reasoning capabilities
of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the
introduction of DeepSeek R1 has inspired a surge of interest in leveraging
rule-based rewards as a low-cost alternative for computing advantage functions
and guiding policy optimization. However, a common challenge observed across
many replication and extension efforts is that when multiple sampled responses
under a single prompt converge to identical outcomes, whether correct or
incorrect, the group-based advantage degenerates to zero. This leads to
vanishing gradients and renders the corresponding samples ineffective for
learning, ultimately limiting training efficiency and downstream performance.
To address this issue, we propose a consistency-aware policy optimization
framework that introduces a structured global reward based on outcome
consistency, the global loss based on it ensures that, even when model outputs
show high intra-group consistency, the training process still receives
meaningful learning signals, which encourages the generation of correct and
self-consistent reasoning paths from a global perspective. Furthermore, we
incorporate an entropy-based soft blending mechanism that adaptively balances
local advantage estimation with global optimization, enabling dynamic
transitions between exploration and convergence throughout training. Our method
introduces several key innovations in both reward design and optimization
strategy. We validate its effectiveness through substantial performance gains
on multiple mathematical reasoning benchmarks, highlighting the proposed
framework's robustness and general applicability. Code of this work has been
released at https://github.com/hijih/copo-code.git.
☆ AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities
Open-domain Knowledge Graph Completion (KGC) faces significant challenges in
an ever-changing world, especially when considering the continual emergence of
new entities in daily news. Existing approaches for KGC mainly rely on
pretrained language models' parametric knowledge, pre-constructed queries, or
single-step retrieval, typically requiring substantial supervision and training
data. Even so, they often fail to capture comprehensive and up-to-date
information about unpopular and/or emerging entities. To this end, we introduce
Agentic Reasoning for Emerging Entities (AgREE), a novel agent-based framework
that combines iterative retrieval actions and multi-step reasoning to
dynamically construct rich knowledge graph triplets. Experiments show that,
despite requiring zero training efforts, AgREE significantly outperforms
existing methods in constructing knowledge graph triplets, especially for
emerging entities that were not seen during language models' training
processes, outperforming previous methods by up to 13.7%. Moreover, we propose
a new evaluation methodology that addresses a fundamental weakness of existing
setups and a new benchmark for KGC on emerging entities. Our work demonstrates
the effectiveness of combining agent-based reasoning with strategic information
retrieval for maintaining up-to-date knowledge graphs in dynamic information
environments.
☆ Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks
The pretrained large language models (LLMs) are finetuned with labeled data
for better instruction following ability and alignment with human values. In
this paper, we study the learning dynamics of LLM finetuning on reasoning tasks
and reveal the uncovered over-memorization phenomenon during a specific stage
of LLM finetuning. At this stage, the LLMs have excessively memorized training
data and exhibit high test perplexity while maintaining good test accuracy. We
investigate the conditions that lead to LLM over-memorization and find that
training epochs and large learning rates contribute to this issue. Although
models with over-memorization demonstrate comparable test accuracy to normal
models, they suffer from reduced robustness, poor out-of-distribution
generalization, and decreased generation diversity. Our experiments unveil the
over-memorization to be broadly applicable across different tasks, models, and
finetuning methods. Our research highlights that overparameterized, extensively
finetuned LLMs exhibit unique learning dynamics distinct from traditional
machine learning models. Based on our observations of over-memorization, we
provide recommendations on checkpoint and learning rate selection during
finetuning.
☆ GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning
Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities
but often struggle with complex, multi-step mathematical reasoning, where minor
errors in visual perception or logical deduction can lead to complete failure.
While Process Reward Models (PRMs) offer step-by-step supervision, existing
multimodal PRMs are limited to being binary verifiers that can identify but not
correct errors, offering little explanatory power. To address these
deficiencies, we introduce the Generative Multimodal Process Reward Model
(GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an
active reasoning collaborator. Instead of a simple scalar score, GM-PRM
provides a fine-grained, interpretable analysis of each reasoning step,
evaluating its step intent, visual alignment, and logical soundness. More
critically, GM-PRM is trained to generate a corrected version of the first
erroneous step it identifies. This unique corrective capability enables our new
test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework
actively enhances solution quality by using the PRM's generated correction to
guide the policy model toward a more promising reasoning trajectory, thereby
improving the diversity and correctness of the solution pool. We demonstrate
that GM-PRM achieves state-of-the-art results on multiple multimodal math
benchmarks, significantly boosting policy model performance with remarkable
data efficiency, requiring only a 20K-sample training dataset. Our code will be
released upon acceptance.
☆ ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"
Prior work synthesizes tool-use LLM datasets by first generating a user
query, followed by complex tool-use annotations like DFS. This leads to
inevitable annotation failures and low efficiency in data generation. We
introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad
first constructs valid tool-use chains through an iterative process guided by
textual "gradients", and then synthesizes corresponding user queries. This
"answer-first" approach led to ToolGrad-5k, a dataset generated with more
complex tool use, lower cost, and 100% pass rate. Experiments show that models
trained on ToolGrad-5k outperform those on expensive baseline datasets and
proprietary LLMs, even on OOD benchmarks.
☆ Efficient Strategy for Improving Large Language Model (LLM) Capabilities
Large Language Models (LLMs) have become a milestone in the field of
artificial intelligence and natural language processing. However, their
large-scale deployment remains constrained by the need for significant
computational resources. This work proposes starting from a base model to
explore and combine data processing and careful data selection techniques,
training strategies, and architectural adjustments to improve the efficiency of
LLMs in resource-constrained environments and within a delimited knowledge
base. The methodological approach included defining criteria for building
reliable datasets, conducting controlled experiments with different
configurations, and systematically evaluating the resulting variants in terms
of capability, versatility, response time, and safety. Finally, comparative
tests were conducted to measure the performance of the developed variants and
to validate the effectiveness of the proposed strategies. This work is based on
the master's thesis in Systems and Computer Engineering titled "Efficient
Strategy for Improving the Capabilities of Large Language Models (LLMs)".
comment: Based on master's thesis in Systems and Computer Engineering,
Universidad Nacional de Colombia (2025)
☆ PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG
Retrieval-Augmented Generation (RAG) has become a cornerstone technique for
enhancing large language models (LLMs) with external knowledge. However,
current RAG systems face two critical limitations: (1) they inefficiently
retrieve information for every query, including simple questions that could be
resolved using the LLM's parametric knowledge alone, and (2) they risk
retrieving irrelevant documents when queries contain sparse information
signals. To address these gaps, we introduce Parametric-verified Adaptive
Information Retrieval and Selection (PAIRS), a training-free framework that
integrates parametric and retrieved knowledge to adaptively determine whether
to retrieve and how to select external information. Specifically, PAIRS employs
a dual-path generation mechanism: First, the LLM produces both a direct answer
and a context-augmented answer using self-generated pseudo-context. When these
outputs converge, PAIRS bypasses external retrieval entirely, dramatically
improving the RAG system's efficiency. For divergent cases, PAIRS activates a
dual-path retrieval (DPR) process guided by both the original query and
self-generated contextual signals, followed by an Adaptive Information
Selection (AIS) module that filters documents through weighted similarity to
both sources. This simple yet effective approach can not only enhance
efficiency by eliminating unnecessary retrievals but also improve accuracy
through contextually guided retrieval and adaptive information selection.
Experimental results on six question-answering (QA) benchmarks show that PAIRS
reduces retrieval costs by around 25% (triggering for only 75% of queries)
while still improving accuracy-achieving +1.1% EM and +1.0% F1 over prior
baselines on average.
☆ DTPA: Dynamic Token-level Prefix Augmentation for Controllable Text Generation
Jiabing Yang, Yixiang Chen, Zichen Wen, Chenhang Cui, Peiyan Li, Yuan Xu, Bowen Fang, Yan Huang, Liang Wang
Controllable Text Generation (CTG) is a vital subfield in Natural Language
Processing (NLP), aiming to generate text that aligns with desired attributes.
However, previous studies commonly focus on the quality of controllable text
generation for short sequences, while the generation of long-form text remains
largely underexplored. In this paper, we observe that the controllability of
texts generated by the powerful prefix-based method Air-Decoding tends to
decline with increasing sequence length, which we hypothesize primarily arises
from the observed decay in attention to the prefixes. Meanwhile, different
types of prefixes including soft and hard prefixes are also key factors
influencing performance. Building on these insights, we propose a lightweight
and effective framework called Dynamic Token-level Prefix Augmentation (DTPA)
based on Air-Decoding for controllable text generation. Specifically, it first
selects the optimal prefix type for a given task. Then we dynamically amplify
the attention to the prefix for the attribute distribution to enhance
controllability, with a scaling factor growing exponentially as the sequence
length increases. Moreover, based on the task, we optionally apply a similar
augmentation to the original prompt for the raw distribution to balance text
quality. After attribute distribution reconstruction, the generated text
satisfies the attribute constraints well. Experiments on multiple CTG tasks
demonstrate that DTPA generally outperforms other methods in attribute control
while maintaining competitive fluency, diversity, and topic relevance. Further
analysis highlights DTPA's superior effectiveness in long text generation.
☆ ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Motion sensor time-series are central to human activity recognition (HAR),
with applications in health, sports, and smart devices. However, existing
methods are trained for fixed activity sets and require costly retraining when
new behaviours or sensor setups appear. Recent attempts to use large language
models (LLMs) for HAR, typically by converting signals into text or images,
suffer from limited accuracy and lack verifiable interpretability. We propose
ZARA, the first agent-based framework for zero-shot, explainable HAR directly
from raw motion time-series. ZARA integrates an automatically derived pair-wise
feature knowledge base that captures discriminative statistics for every
activity pair, a multi-sensor retrieval module that surfaces relevant evidence,
and a hierarchical agent pipeline that guides the LLM to iteratively select
features, draw on this evidence, and produce both activity predictions and
natural-language explanations. ZARA enables flexible and interpretable HAR
without any fine-tuning or task-specific classifiers. Extensive experiments on
8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering
clear reasoning while exceeding the strongest baselines by 2.53x in macro F1.
Ablation studies further confirm the necessity of each module, marking ZARA as
a promising step toward trustworthy, plug-and-play motion time-series analysis.
Our codes are available at https://github.com/zechenli03/ZARA.
☆ Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing
Xiaopeng Li, Shasha Li, Xi Wang, Shezheng Song, Bin Ji, Shangwen Wang, Jun Ma, Xiaodong Liu, Mina Liu, Jie Yu
Large Language Models (LLMs) underpin many AI applications, but their static
nature makes updating knowledge costly. Model editing offers an efficient
alternative by injecting new information through targeted parameter
modifications. In particular, meta-learning-based model editing (MLBME) methods
have demonstrated notable advantages in both editing effectiveness and
efficiency. Despite this, we find that MLBME exhibits suboptimal performance in
low-data scenarios, and its training efficiency is bottlenecked by the
computation of KL divergence. To address these, we propose $\textbf{S}$tep
$\textbf{M}$ore $\textbf{Edit}$ ($\textbf{SMEdit}$), a novel MLBME method that
adopts $\textbf{M}$ultiple $\textbf{B}$ackpro$\textbf{P}$agation
$\textbf{S}$teps ($\textbf{MBPS}$) to improve editing performance under limited
supervision and a norm regularization on weight updates to improve training
efficiency. Experimental results on two datasets and two LLMs demonstrate that
SMEdit outperforms prior MLBME baselines and the MBPS strategy can be
seamlessly integrated into existing methods to further boost their performance.
Our code will be released soon.
☆ HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization
Large language models enable agents to autonomously perform tasks in open web
environments. However, as hidden threats within the web evolve, web agents face
the challenge of balancing task performance with emerging risks during
long-sequence operations. Although this challenge is critical, current research
remains limited to single-objective optimization or single-turn scenarios,
lacking the capability for collaborative optimization of both safety and
utility in web environments. To address this gap, we propose HarmonyGuard, a
multi-agent collaborative framework that leverages policy enhancement and
objective optimization to jointly improve both utility and safety. HarmonyGuard
features a multi-agent architecture characterized by two fundamental
capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent
within HarmonyGuard, which automatically extracts and maintains structured
security policies from unstructured external documents, while continuously
updating policies in response to evolving threats. (2) Dual-Objective
Optimization: Based on the dual objectives of safety and utility, the Utility
Agent integrated within HarmonyGuard performs the Markovian real-time reasoning
to evaluate the objectives and utilizes metacognitive capabilities for their
optimization. Extensive evaluations on multiple benchmarks show that
HarmonyGuard improves policy compliance by up to 38% and task completion by up
to 20% over existing baselines, while achieving over 90% policy compliance
across all tasks. Our project is available here:
https://github.com/YurunChen/HarmonyGuard.
☆ ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval
Conversational search aims to satisfy users' complex information needs via
multiple-turn interactions. The key challenge lies in revealing real users'
search intent from the context-dependent queries. Previous studies achieve
conversational search by fine-tuning a conversational dense retriever with
relevance judgments between pairs of context-dependent queries and documents.
However, this training paradigm encounters data scarcity issues. To this end,
we propose ConvMix, a mixed-criteria framework to augment conversational dense
retrieval, which covers more aspects than existing data augmentation
frameworks. We design a two-sided relevance judgment augmentation schema in a
scalable manner via the aid of large language models. Besides, we integrate the
framework with quality control mechanisms to obtain semantically diverse
samples and near-distribution supervisions to combine various annotated data.
Experimental results on five widely used benchmarks show that the
conversational dense retriever trained by our ConvMix framework outperforms
previous baseline methods, which demonstrates our superior effectiveness.
☆ Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models
Xinyu Zhao, Zhen Tan, Maya Enisman, Minjae Seo, Marta R. Durantini, Dolores Albarracin, Tianlong Chen
Successful group meetings, such as those implemented in group
behavioral-change programs, work meetings, and other social contexts, must
promote individual goal setting and execution while strengthening the social
relationships within the group. Consequently, an ideal facilitator must be
sensitive to the subtle dynamics of disengagement, difficulties with individual
goal setting and execution, and interpersonal difficulties that signal a need
for intervention. The challenges and cognitive load experienced by facilitators
create a critical gap for an embodied technology that can interpret social
exchanges while remaining aware of the needs of the individuals in the group
and providing transparent recommendations that go beyond powerful but "black
box" foundation models (FMs) that identify social cues. We address this
important demand with a social robot co-facilitator that analyzes multimodal
meeting data and provides discreet cues to the facilitator. The robot's
reasoning is powered by an agentic concept bottleneck model (CBM), which makes
decisions based on human-interpretable concepts like participant engagement and
sentiments, ensuring transparency and trustworthiness. Our core contribution is
a transfer learning framework that distills the broad social understanding of
an FM into our specialized and transparent CBM. This concept-driven system
significantly outperforms direct zero-shot FMs in predicting the need for
intervention and enables real-time human correction of its reasoning.
Critically, we demonstrate robust knowledge transfer: the model generalizes
across different groups and successfully transfers the expertise of senior
human facilitators to improve the performance of novices. By transferring an
expert's cognitive model into an interpretable robotic partner, our work
provides a powerful blueprint for augmenting human capabilities in complex
social domains.
comment: 27 pages, 7 figures
☆ Are Today's LLMs Ready to Explain Well-Being Concepts?
Well-being encompasses mental, physical, and social dimensions essential to
personal growth and informed life decisions. As individuals increasingly
consult Large Language Models (LLMs) to understand well-being, a key challenge
emerges: Can LLMs generate explanations that are not only accurate but also
tailored to diverse audiences? High-quality explanations require both factual
correctness and the ability to meet the expectations of users with varying
expertise. In this work, we construct a large-scale dataset comprising 43,880
explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We
introduce a principle-guided LLM-as-a-judge evaluation framework, employing
dual judges to assess explanation quality. Furthermore, we show that
fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct
Preference Optimization (DPO) can significantly enhance the quality of
generated explanations. Our results reveal: (1) The proposed LLM judges align
well with human evaluations; (2) explanation quality varies significantly
across models, audiences, and categories; and (3) DPO- and SFT-finetuned models
outperform their larger counterparts, demonstrating the effectiveness of
preference-based learning for specialized explanation tasks.
comment: 9 pages, 4 figures, 3 tables
☆ Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency
Despite its simplicity and efficacy, the high token expenditure of
self-consistency can limit its practical utility. Here we investigate if
self-consistency can be made more token-efficient for long chain-of-thought
reasoning tasks, while preserving its parallelism, through early hypothesis
pruning. Concretely, we generate all solutions in parallel, but periodically
prune intermediate hypotheses that are deemed unnecessary based on two
lightweight indicators: (a) the model's own confidence in individual
hypotheses, and (b) lexical coverage of all current hypotheses by candidate
subsets that are under consideration for continued retention. We design a fast
weighted set cover algorithm that utilizes the two indicators; our evaluation
of five LLMs on three math benchmarks shows that this method can improve token
efficiency for all models, by 10-35% in many cases.
♻ ☆ Beyond Adapter Retrieval: Latent Geometry-Preserving Composition via Sparse Task Projection
Pengfei Jin, Peng Shu, Sifan Song, Sekeun Kim, Qing Xiao, Cheng Chen, Tianming Liu, Xiang Li, Quanzheng Li
Recent advances in parameter-efficient transfer learning have demonstrated
the utility of composing LoRA adapters from libraries of pretrained modules.
However, most existing approaches rely on simple retrieval heuristics or
uniform averaging, which overlook the latent structure of task relationships in
representation space. We propose a new framework for adapter reuse that moves
beyond retrieval, formulating adapter composition as a geometry-aware sparse
reconstruction problem. Specifically, we represent each task by a latent
prototype vector derived from the base model's encoder and aim to approximate
the target task prototype as a sparse linear combination of retrieved reference
prototypes, under an $\ell_1$-regularized optimization objective. The resulting
combination weights are then used to blend the corresponding LoRA adapters,
yielding a composite adapter tailored to the target task. This formulation not
only preserves the local geometric structure of the task representation
manifold, but also promotes interpretability and efficient reuse by selecting a
minimal set of relevant adapters. We demonstrate the effectiveness of our
approach across multiple domains-including medical image segmentation, medical
report generation and image synthesis. Our results highlight the benefit of
coupling retrieval with latent geometry-aware optimization for improved
zero-shot generalization.
♻ ☆ Ultra Memory-Efficient On-FPGA Training of Transformers via Tensor-Compressed Optimization
Transformer models have achieved state-of-the-art performance across a wide
range of machine learning tasks. There is growing interest in training
transformers on resource-constrained edge devices due to considerations such as
privacy, domain adaptation, and on-device scientific machine learning. However,
the significant computational and memory demands required for transformer
training often exceed the capabilities of an edge device. Leveraging low-rank
tensor compression, this paper presents the first on-FPGA accelerator for
end-to-end transformer training. On the algorithm side, we present a
bi-directional contraction flow for tensorized transformer training,
significantly reducing the computational FLOPS and intra-layer memory costs
compared to existing tensor operations. On the hardware side, we store all
highly compressed model parameters and gradient information on chip, creating
an on-chip-memory-only framework for each stage in training. This reduces
off-chip communication and minimizes latency and energy costs. Additionally, we
implement custom computing kernels for each training stage and employ
intra-layer parallelism and pipe-lining to further enhance run-time and memory
efficiency. Through experiments on transformer models within $36.7$ to $93.5$
MB using FP-32 data formats on the ATIS dataset, our tensorized FPGA
accelerator could conduct single-batch end-to-end training on the AMD Alevo U50
FPGA, with a memory budget of less than $6$-MB BRAM and $22.5$-MB URAM.
Compared to uncompressed training on the NVIDIA RTX 3090 GPU, our on-FPGA
training achieves a memory reduction of $30\times$ to $51\times$. Our FPGA
accelerator also achieves up to $3.6\times$ less energy cost per epoch compared
with tensor Transformer training on an NVIDIA RTX 3090 GPU.
♻ ☆ R1-RE: Cross-Domain Relation Extraction with RLVR
Relation extraction (RE) is a core task in natural language processing.
Traditional approaches typically frame RE as a supervised learning problem,
directly mapping context to labels-an approach that often suffers from poor
out-of-domain (OOD) generalization. Inspired by the workflow of human
annotators, we reframe RE as a reasoning task guided by annotation guidelines
and introduce R1-RE, the first reinforcement learning with verifiable reward
(RLVR) framework for RE tasks. Our method elicits the reasoning abilities of
small language models for annotation tasks, resulting in significantly improved
OOD robustness. We evaluate our approach on the public Sem-2010 dataset and a
private MDKG dataset. The R1-RE-7B model attains an average OOD accuracy of
approximately 70%, on par with leading proprietary models such as GPT-4o.
Additionally, our comprehensive analysis provides novel insights into the
training dynamics and emergent reasoning behaviors of the RLVR paradigm for RE.
comment: 14 pages, 7 figures
♻ ☆ FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging ACL 2025
Zichen Tang, Haihong E, Ziyan Ma, Haoyang He, Jiacheng Liu, Zhongjun Yang, Zihua Rong, Rongjin Li, Kun Ji, Qing Huang, Xinyang Hu, Yang Liu, Qianhe Zheng
We introduce FinanceReasoning, a novel benchmark designed to evaluate the
reasoning capabilities of large reasoning models (LRMs) in financial numerical
reasoning problems. Compared to existing benchmarks, our work provides three
key advancements. (1) Credibility: We update 15.6% of the questions from four
public datasets, annotating 908 new questions with detailed Python solutions
and rigorously refining evaluation standards. This enables an accurate
assessment of the reasoning improvements of LRMs. (2) Comprehensiveness:
FinanceReasoning covers 67.8% of financial concepts and formulas, significantly
surpassing existing datasets. Additionally, we construct 3,133 Python-formatted
functions, which enhances LRMs' financial reasoning capabilities through
refined knowledge (e.g., 83.2% $\rightarrow$ 91.6% for GPT-4o). (3) Challenge:
Models are required to apply multiple financial formulas for precise numerical
reasoning on 238 Hard problems. The best-performing model (i.e., OpenAI o1 with
PoT) achieves 89.1% accuracy, yet LRMs still face challenges in numerical
precision. We demonstrate that combining Reasoner and Programmer models can
effectively enhance LRMs' performance (e.g., 83.2% $\rightarrow$ 87.8% for
DeepSeek-R1). Our work paves the way for future research on evaluating and
improving LRMs in domain-specific complex reasoning tasks.
comment: Accepted by ACL 2025 Main Conference
♻ ☆ p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay ICCV 2025
Despite the remarkable performance of multimodal large language models
(MLLMs) across diverse tasks, the substantial training and inference costs
impede their advancement. In this paper, we propose p-MoD, an efficient MLLM
architecture that significantly reduces training and inference costs while
maintaining model performance. The majority of computation in MLLMs stems from
the overwhelming volume of vision tokens processed by the transformer-based
LLM. Accordingly, we leverage the Mixture-of-Depths (MoD) mechanism, where each
LLM layer selects essential vision tokens to process while skipping redundant
ones. However, integrating MoD into MLLMs is non-trivial. To address the
challenges of training and inference stability as well as limited training
data, we adapt the MoD module with two novel designs: tanh-gated weight
normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we
observe that vision tokens exhibit higher redundancy in deeper layers and thus
design a progressive ratio decay (PRD) strategy, which gradually reduces the
token retention ratio layer by layer, employing a shifted cosine schedule. This
crucial design fully unleashes the potential of MoD, significantly boosting the
efficiency and performance of our models. Extensive experiments on two baseline
models across 15 benchmarks show that our model matches or even surpasses the
performance of corresponding baselines, while requiring only 55.6% TFLOPs and
53.7% KV cache storage during inference, and 77.7% GPU hours during training.
comment: Accepted by ICCV 2025; Code released at
https://github.com/MCG-NJU/p-MoD
♻ ☆ RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Ma, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, Ge Li
Reinforcement Learning with Verifiable Reward (RLVR) has significantly
advanced the complex reasoning abilities of Large Language Models (LLMs).
However, it struggles to break through the inherent capability boundaries of
the base LLM, due to its essentially on-policy strategy coupled with LLM's
immense action space and sparse reward. Critically, RLVR can lead to the
capability boundary collapse, narrowing the LLM's problem-solving scope. To
address this problem, we propose RL-PLUS, a novel hybrid-policy optimization
approach for LLMs that synergizes internal exploitation with external data to
achieve stronger reasoning capabilities and surpass the boundaries of base
models. RL-PLUS integrates two core components, i.e., Multiple Importance
Sampling to address distributional mismatch from external data, and
Exploration-Based Advantage Function to guide the model towards high-value,
unexplored reasoning paths. We provide both theoretical analysis and extensive
experiments to demonstrate the superiority and generalizability of our
approach. Compared with existing RLVR methods, RL-PLUS achieves 1)
state-of-the-art performance on six math reasoning benchmarks; 2) superior
performance on six out-of-distribution reasoning tasks; 3) consistent and
significant gains across diverse model families, with average relative
improvements up to 69.2\%. Moreover, the analysis of Pass@k curves indicates
that RL-PLUS effectively resolves the capability boundary collapse problem.
♻ ☆ Strong Priority and Determinacy in Timed CCS
Building on the standard theory of process algebra with priorities, we
identify a new scheduling mechanism, called "constructive reduction" which is
designed to capture the essence of synchronous programming. The distinctive
property of this evaluation strategy is to achieve determinacy-by-construction
for multi-cast concurrent communication with shared memory. In the technical
setting of CCS extended by clocks and priorities, we prove for a large class of
"coherent" processes a confluence property for constructive reductions. We show
that under some restrictions, called "pivotability", coherence is preserved by
the operators of prefix, summation, parallel composition, restriction and
hiding. Since this permits memory and sharing, we are able to cover a strictly
larger class of processes compared to those in Milner's classical confluence
theory for CCS without priorities.
comment: Change Notes (06.08.25): Streamlined the definition of coherence and
non-interference; Corrections in Def.~14 for coherence, adding condition on
residual transitions; Adjusted coding of Esterel signals (Ex.~11) to match
adjusted Def.~14; To reflect changed Def.~14, use the term "c-coherence'';
Minor rewrite of Sec.~2.3 and Sec.~4; Further corrections and revisions in
Appendices
♻ ☆ Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications
Large language models (LLMs) significantly enhance the performance of various
applications, but they are computationally intensive and energy-demanding. This
makes it challenging to deploy them on devices with limited resources, such as
personal computers and mobile/wearable devices, and results in substantial
inference costs in resource-rich environments like cloud servers. To extend the
use of LLMs, we introduce a low-rank decomposition approach to effectively
compress these models, tailored to the requirements of specific applications.
We observe that LLMs pretrained on general datasets contain many redundant
components not needed for particular applications. Our method focuses on
identifying and removing these redundant parts, retaining only the necessary
elements for the target applications. Specifically, we represent the weight
matrices of LLMs as a linear combination of base components. We then prune the
irrelevant bases and enhance the model with new bases beneficial for specific
applications. Deep compression results on the Llama 2-7b and -13B models,
conducted on target applications including mathematical reasoning and code
generation, show that our method significantly reduces model size while
maintaining comparable accuracy to state-of-the-art low-rank compression
techniques.
♻ ☆ Evaluating Robustness of LLMs in Question Answering on Multilingual Noisy OCR Data CIKM 2025
Optical Character Recognition (OCR) plays a crucial role in digitizing
historical and multilingual documents, yet OCR errors - imperfect extraction of
text, including character insertion, deletion, and substitution can
significantly impact downstream tasks like question-answering (QA). In this
work, we conduct a comprehensive analysis of how OCR-induced noise affects the
performance of Multilingual QA Systems. To support this analysis, we introduce
a multilingual QA dataset MultiOCR-QA, comprising 50K question-answer pairs
across three languages, English, French, and German. The dataset is curated
from OCR-ed historical documents, which include different levels and types of
OCR noise. We then evaluate how different state-of-the-art Large Language
models (LLMs) perform under different error conditions, focusing on three major
OCR error types. Our findings show that QA systems are highly prone to
OCR-induced errors and perform poorly on noisy OCR text. By comparing model
performance on clean versus noisy texts, we provide insights into the
limitations of current approaches and emphasize the need for more
noise-resilient QA systems in historical digitization contexts.
comment: Accepted at CIKM 2025
♻ ☆ Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation
Jaechul Roh, Zachary Novack, Yuefeng Peng, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Amir Houmansadr
Memorization in generative models extends far beyond verbatim text
reproduction--it manifests through non-literal patterns, semantic associations,
and surprisingly, across modalities in transcript-conditioned generation tasks
such as Lyrics-to-Song (L2S) and Text-to-Video (T2V) models. We reveal a new
class of cross-modality memorization where models trained on these tasks leak
copyrighted content through indirect, phonetic pathways invisible to
traditional text-based analysis. In this work, we introduce Adversarial
PhoneTic Prompting (APT), an attack that replaces iconic phrases with
homophonic alternatives--e.g., "mom's spaghetti" becomes "Bob's
confetti"--preserving the acoustic form while largely changing semantic
content. We demonstrate that models can be prompted to regurgitate memorized
songs using phonetically similar but semantically unrelated lyrics. Despite the
semantic drift, black-box models like SUNO and open-source models like YuE
generate outputs that are strikingly similar to the original
songs--melodically, rhythmically, and vocally--achieving high scores on
AudioJudge, CLAP, and CoverID. These effects persist across genres and
languages. More surprisingly, we find that phonetic prompts alone can trigger
visual memorization in text-to-video models: when given altered lyrics from
Lose Yourself, Veo 3 generates scenes that mirror the original music
video--complete with a hooded rapper and dim urban settings--despite no
explicit visual cues in the prompt. This cross-modality leakage represents an
unprecedented threat: models memorize deep, structural patterns that transcend
their training modality, making traditional safety measures like copyright
filters ineffective. Our findings reveal a fundamental vulnerability in
transcript-conditioned generative models and raise urgent concerns around
copyright, provenance, and secure deployment of multimodal generation systems.
♻ ☆ Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation
Model merging aims to integrate multiple task-specific models into a unified
model that inherits the capabilities of the task-specific models, without
additional training. Existing model merging methods often lack consideration of
the varying contribution ratios of different task-specific models to the final
merged model. In this paper, we propose Mixup Model Merge (M3), a simple yet
effective method inspired by the randomized linear interpolation strategy from
the Mixup data augmentation technique. M3 performs randomized linear
interpolation in parameter space between two task-specific LLMs, where
interpolation coefficients are sampled from a Beta distribution to explore
diverse contribution ratios. This controllable randomness allows M3 to
outperform standard equal-ratio merging by discovering better contribution
ratio combinations. Extensive experiments show that M3 significantly (1)
improves merged LLM performance across tasks, (2) enhances out-of-distribution
and adversarial robustness, (3) outperforms the positive effects of the
sparsification method DARE on model merging and can be further combined with
DARE to achieve superior results, and (4) balances exploration efficiency and
diversity in contribution ratios by tuning the Beta distribution's shape
parameters. The code is provided in the supplementary materials.
comment: 15 pages
♻ ☆ Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models ACL 2025
Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Jiantao Qiu, Chi Zhang, Ying Qian, Conghui He
The composition of pre-training datasets for large language models (LLMs)
remains largely undisclosed, hindering transparency and efforts to optimize
data quality, a critical driver of model performance. Current data selection
methods, such as natural language quality assessments, diversity-based filters,
and classifier-based approaches, are limited by single-dimensional evaluation
or redundancy-focused strategies. To address these gaps, we propose four
dimensions to evaluate data quality: professionalism, readability, reasoning,
and cleanliness. We further introduce Meta-rater,a multi-dimensional data
selection method that integrates these dimensions with existing quality metrics
through learned optimal weightings. Meta-rater employs proxy models to train a
regression model that predicts validation loss, enabling the identification of
optimal combinations of quality scores. Experiments demonstrate that Meta-rater
doubles convergence speed for 1.3B parameter models and improves downstream
task performance by 3.23, with advantages that scale to models as large as 7.2B
parameters. Our work establishes that holistic, multi-dimensional quality
integration significantly outperforms conventional single-dimension approaches,
offering a scalable paradigm for enhancing pre-training efficiency and model
capability. To advance future research, we release scripts, data, and models at
https://github.com/opendatalab/Meta-rater.
comment: ACL 2025 Best Theme Paper Award
♻ ☆ LLMs Have a Heart of Stone: Demystifying the Soft Thinking Ability of Large Reasoning Models
Human cognition naturally engages with abstract and fluid concepts, whereas
existing reasoning models often rely on generating discrete tokens, potentially
constraining their expressive capabilities. Recent advancements aim to address
this limitation by enabling large language models (LLMs) to generate soft,
abstract tokens, thus facilitating reasoning within a continuous concept space.
This paper explores the `Soft Thinking' capabilities of various LLMs by
examining the models' internal behavior using a suite of probing techniques.
Contrary to the common belief that Soft Thinking enables the simultaneous
exploration of diverse reasoning paths, our findings reveal that LLMs
predominantly rely on the most influential component of the soft inputs during
subsequent decoding steps. This reliance hinders the exploration of different
reasoning paths and reduces vanilla Soft Thinking to a form of greedy decoding,
obscuring the advantage of transmitting more information through Soft Tokens.
To tackle this issue, we explore sampling strategies to introduce
\emph{randomness}, employing methods such as Dirichlet resampling and the
Gumbel-Softmax trick. Our experiments demonstrate that incorporating randomness
can alleviate the limitations of vanilla approaches and unleash the potential
of Soft Thinking. Notably, the Gumbel-Softmax trick provides adequate
randomness with controlled smoothness, resulting in superior performance across
eight reasoning benchmarks.
comment: 10 pages, 7 figures, working in progress
♻ ☆ SLR: Automated Synthesis for Scalable Logical Reasoning
Lukas Helff, Ahmad Omar, Felix Friedrich, Antonia Wüst, Hikaru Shindo, Rupert Mitchell, Tim Woydt, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting
We introduce SLR, an end-to-end framework for systematic evaluation and
training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given
a user's task specification, SLR automatically synthesizes (i) an instruction
prompt for an inductive reasoning task, (ii) a validation program, executable
on model outputs to provide verifiable rewards, and (iii) the latent
ground-truth rule. This process is fully automated, scalable, requires no human
annotations, and offers precise control over task difficulty. Using SLR, we
create SLR-Bench, a benchmark comprising 19k prompts organized into 20
curriculum levels that progressively increase in relational, arithmetic, and
recursive complexity. Large-scale evaluation reveals that contemporary LLMs
readily produce syntactically valid rules, yet often fail at correct logical
inference. Recent reasoning LLMs demonstrate improved performance but incur
very high test-time computation, with costs exceeding $300 for just 1,000
prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on
SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of
computational cost. Moreover, these reasoning capabilities generalize to a wide
range of established benchmarks, underscoring the effectiveness of SLR for
downstream reasoning.
♻ ☆ From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs
Reinforcement learning-based retrieval-augmented generation (RAG) methods
enhance the reasoning abilities of large language models (LLMs). However, most
rely only on final-answer rewards, overlooking intermediate reasoning quality.
This paper analyzes existing RAG reasoning models and identifies three main
failure patterns: (1) information insufficiency, meaning the model fails to
retrieve adequate support; (2) faulty reasoning, where logical or content-level
flaws appear despite sufficient information; and (3) answer-reasoning
inconsistency, where a valid reasoning chain leads to a mismatched final
answer. We propose TIRESRAG-R1, a novel framework using a
think-retrieve-reflect process and a multi-dimensional reward system to improve
reasoning and stability. TIRESRAG-R1 introduces: (1) a sufficiency reward to
encourage thorough retrieval; (2) a reasoning quality reward to assess the
rationality and accuracy of the reasoning chain; and (3) a reflection reward to
detect and revise errors. It also employs a difficulty-aware reweighting
strategy and training sample filtering to boost performance on complex tasks.
Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms
prior RAG methods and generalizes well to single-hop tasks. The code and data
are available at: https://github.com/probe2/TIRESRAG-R1.
♻ ☆ Automatically Interpreting Millions of Features in Large Language Models
While the activations of neurons in deep neural networks usually do not have
a simple human-understandable interpretation, sparse autoencoders (SAEs) can be
used to transform these activations into a higher-dimensional latent space
which may be more easily interpretable. However, these SAEs can have millions
of distinct latent features, making it infeasible for humans to manually
interpret each one. In this work, we build an open-source automated pipeline to
generate and evaluate natural language explanations for SAE features using
LLMs. We test our framework on SAEs of varying sizes, activation functions, and
losses, trained on two different open-weight LLMs. We introduce five new
techniques to score the quality of explanations that are cheaper to run than
the previous state of the art. One of these techniques, intervention scoring,
evaluates the interpretability of the effects of intervening on a feature,
which we find explains features that are not recalled by existing methods. We
propose guidelines for generating better explanations that remain valid for a
broader set of activating contexts, and discuss pitfalls with existing scoring
techniques. We use our explanations to measure the semantic similarity of
independently trained SAEs, and find that SAEs trained on nearby layers of the
residual stream are highly similar. Our large-scale analysis confirms that SAE
latents are indeed much more interpretable than neurons, even when neurons are
sparsified using top-$k$ postprocessing. Our code is available at
https://github.com/EleutherAI/sae-auto-interp, and our explanations are
available at
https://huggingface.co/datasets/EleutherAI/auto_interp_explanations.
♻ ☆ Inside-Out: Hidden Factual Knowledge in LLMs
Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, Roi Reichart
This work presents a framework for assessing whether large language models
(LLMs) encode more factual knowledge in their parameters than what they express
in their outputs. While a few studies hint at this possibility, none has
clearly defined or demonstrated this phenomenon. We first propose a formal
definition of knowledge, quantifying it for a given question as the fraction of
correct-incorrect answer pairs where the correct one is ranked higher. This
gives rise to external and internal knowledge, depending on the information
used to score individual answer candidates: either the model's observable
token-level probabilities or its intermediate computations. Hidden knowledge
arises when internal knowledge exceeds external knowledge. We then present a
case study, applying this framework to three popular open-weights LLMs in a
closed-book QA setup. Our results indicate that: (1) LLMs consistently encode
more factual knowledge internally than what they express externally, with an
average relative gap of 40%. (2) Surprisingly, some knowledge is so deeply
hidden that a model can internally know an answer perfectly, yet fail to
generate it even once, despite large-scale repeated sampling of 1,000 answers.
This reveals fundamental limitations in the generation capabilities of LLMs,
which (3) put a practical constraint on scaling test-time compute via repeated
answer sampling in closed-book QA: significant performance improvements remain
inaccessible because some answers are practically never sampled, yet if they
were, we would be guaranteed to rank them first.
comment: Accepted to COLM 2025
♻ ☆ AUTALIC: A Dataset for Anti-AUTistic Ableist Language In Context ACL
Naba Rizvi, Harper Strickland, Daniel Gitelman, Tristan Cooper, Alexis Morales-Flores, Michael Golden, Aekta Kallepalli, Akshat Alurkar, Haaset Owens, Saleha Ahmedi, Isha Khirwadkar, Imani Munyaka, Nedjma Ousidhoum
As our understanding of autism and ableism continues to increase, so does our
understanding of ableist language towards autistic people. Such language poses
a significant challenge in NLP research due to its subtle and context-dependent
nature. Yet, detecting anti-autistic ableist language remains underexplored,
with existing NLP tools often failing to capture its nuanced expressions. We
present AUTALIC, the first benchmark dataset dedicated to the detection of
anti-autistic ableist language in context, addressing a significant gap in the
field. The dataset comprises 2,400 autism-related sentences collected from
Reddit, accompanied by surrounding context, and is annotated by trained experts
with backgrounds in neurodiversity. Our comprehensive evaluation reveals that
current language models, including state-of-the-art LLMs, struggle to reliably
identify anti-autistic ableism and align with human judgments, underscoring
their limitations in this domain. We publicly release AUTALIC along with the
individual annotations which serve as a valuable resource to researchers
working on ableism, neurodiversity, and also studying disagreements in
annotation tasks. This dataset serves as a crucial step towards developing more
inclusive and context-aware NLP systems that better reflect diverse
perspectives.
comment: accepted to ACL main 2025, 9 pages, 5 figures, 7 tables
♻ ☆ On the Fundamental Impossibility of Hallucination Control in Large Language Models
This paper establishes a fundamental impossibility theorem: no LLM capable
performing non-trivial knowledge aggregation can simultaneously achieve
truthful (internally consistent) knowledge representation, semantic information
conservation, complete revelation of relevant knowledge, and
knowledge-constrained optimality. This impossibility is not an engineering
limitation but arises from the mathematical structure of information
aggregation itself. We establish this result by describing the inference
process as an auction of ideas, where distributed components compete exploiting
their partial knowledge to shape responses. The proof spans three independent
mathematical domains: mechanism design theory (Green-Laffont), the theory of
proper scoring rules (Savage), and direct architectural analysis of
transformers (Log-Sum-Exp convexity). In particular, we show how in the
strictly concave settings the score of an aggregate of diverse beliefs strictly
exceeds the sum of individual scores. That gap may quantify the creation of
unattributable certainty or overconfidence -- the mathematical origin of both
hallucination and creativity, or imagination.
To support this analysis, we introduce the complementary concepts of the
semantic information measure and the emergence operator to model bounded
reasoning in a general setting. We prove that while bounded reasoning generates
accessible information, providing valuable insights and inspirations, idealized
reasoning strictly preserves semantic content. By demonstrating that
hallucination and imagination are mathematically identical phenomena-grounded
in the necessary violation of information conservation-this paper offers a
principled foundation for managing these behaviors in advanced AI systems.
Finally, we present some speculative ideas to inspire evaluation and
refinements of the proposed theory.
comment: cleared mathematics, proofs and ideas explained, added missing
definitions and axioms, discussion and speculation section added
♻ ☆ Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study
Mohammad Khodadad, Ali Shiraee Kasmaee, Mahdi Astaraki, Nicholas Sherck, Hamidreza Mahyar, Soheila Samiee
In this study, we introduced a new benchmark consisting of a curated dataset
and a defined evaluation process to assess the compositional reasoning
capabilities of large language models within the chemistry domain. We designed
and validated a fully automated pipeline, verified by subject matter experts,
to facilitate this task. Our approach integrates OpenAI reasoning models with
named entity recognition (NER) systems to extract chemical entities from recent
literature, which are then augmented with external knowledge bases to form a
comprehensive knowledge graph. By generating multi-hop questions across these
graphs, we assess LLM performance in both context-augmented and non-context
augmented settings. Our experiments reveal that even state-of-the-art models
face significant challenges in multi-hop compositional reasoning. The results
reflect the importance of augmenting LLMs with document retrieval, which can
have a substantial impact on improving their performance. However, even perfect
retrieval accuracy with full context does not eliminate reasoning errors,
underscoring the complexity of compositional reasoning. This work not only
benchmarks and highlights the limitations of current LLMs but also presents a
novel data generation pipeline capable of producing challenging reasoning
datasets across various domains. Overall, this research advances our
understanding of reasoning in computational linguistics.
♻ ☆ Towards Domain Specification of Embedding Models in Medicine
Medical text embedding models are foundational to a wide array of healthcare
applications, ranging from clinical decision support and biomedical information
retrieval to medical question answering, yet they remain hampered by two
critical shortcomings. First, most models are trained on a narrow slice of
medical and biological data, beside not being up to date in terms of
methodology, making them ill suited to capture the diversity of terminology and
semantics encountered in practice. Second, existing evaluations are often
inadequate: even widely used benchmarks fail to generalize across the full
spectrum of real world medical tasks.
To address these gaps, we leverage MEDTE, a GTE model extensively fine-tuned
on diverse medical corpora through self-supervised contrastive learning across
multiple data sources, to deliver robust medical text embeddings.
Alongside this model, we propose a comprehensive benchmark suite of 51 tasks
spanning classification, clustering, pair classification, and retrieval modeled
on the Massive Text Embedding Benchmark (MTEB) but tailored to the nuances of
medical text. Our results demonstrate that this combined approach not only
establishes a robust evaluation framework but also yields embeddings that
consistently outperform state of the art alternatives in different tasks.
♻ ☆ How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison
As evaluation designs of large language models may shape our trajectory
toward artificial general intelligence, comprehensive and forward-looking
assessment is essential. Existing benchmarks primarily assess static knowledge,
while intelligence also entails the ability to rapidly learn from experience.
To this end, we advocate for the evaluation of Test-time Learning, the capacity
to improve performance in experience-based, reasoning-intensive tasks during
test time. In this work, we propose semantic games as effective testbeds for
evaluating test-time learning, due to their resistance to saturation and
inherent demand for strategic reasoning. We introduce an objective evaluation
framework that compares model performance under both limited and cumulative
experience settings, and contains four forms of experience representation. To
provide a comparative baseline, we recruit eight human participants to complete
the same task. Results show that LLMs exhibit measurable test-time learning
capabilities; however, their improvements are less stable under cumulative
experience and progress more slowly than those observed in humans. These
findings underscore the potential of LLMs as general-purpose learning machines,
while also revealing a substantial intellectual gap between models and humans,
irrespective of how well LLMs perform on static benchmarks.
♻ ☆ UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions
Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing
human-computer interaction, yet their reliance on text-based instructions
imposes limitations on accessibility and convenience, particularly in
hands-free scenarios. To address this issue, we propose replacing text with
speech as the instruction input modality for GUI agents, and introduce
UITron-Speech, which is the first end-to-end GUI agent capable of directly
processing speech instructions and on-device screenshots to predict user
actions. To tackle the problem of data scarcity, we synthesize high-quality
speech instruction datasets using a random-speaker text-to-speech model.
Additionally, we design a mixed-modality training strategy to mitigate the
inherent modality imbalance in pre-trained foundation models. Furthermore, we
conduct a statistical analysis of the distribution of GUI grounding prediction
errors and propose a training-free two-step grounding refinement method to
alleviate minor localization deviations. Extensive experiments on multiple
benchmarks demonstrate that UITron-Speech achieves robust performance and
superior adaptability, underscoring the feasibility and potential of
speech-driven GUI agents for more accessible and intelligent human-computer
interaction. Our code and datasets are available at
https://github.com/UITron-hub/UITron-Speech.
♻ ☆ CharBench: Evaluating the Role of Tokenization in Character-Level Tasks
Tasks that require character-level reasoning, such as counting or locating
characters within words, remain challenging for contemporary language models. A
common conjecture is that language models' reliance on subword units, rather
than characters, contributes to their struggles with character-level tasks, yet
recent studies offer conflicting conclusions about the role of tokenization,
leaving its impact unclear. To address this gap, we introduce CharBench, a
comprehensive benchmark of character-level tasks that is two orders of
magnitude larger than existing alternatives. We evaluate a diverse range of
leading open-weight and proprietary models on CharBench and find that it
presents a significant challenge to modern LLMs, with an average accuracy of
43.6% and 32.3% on some tasks. We present an in-depth analysis of how intrinsic
properties of words and their segmentations into tokens correspond to model
performance. For counting tasks, we find that tokenization properties are
weakly correlated with correctness, while the length of the queried word and
the actual character count play a more significant part. In contrast, for tasks
requiring intra-word positional understanding, performance is negatively
correlated with the length of the token containing the queried character,
suggesting that longer tokens obscure character position information for LLMs.
We encourage future work to build on the benchmark and evaluation methodology
introduced here as tools for improving model performance on such tasks.
♻ ☆ SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can
naturally emerge through a simple reinforcement learning (RL) framework with
rule-based rewards, where the training may directly start from the base
models-a paradigm referred to as zero RL training. Most recent efforts to
reproduce zero RL training have primarily focused on the Qwen2.5 model series,
which may not be representative as we find the base models already exhibit
strong instruction-following and self-reflection abilities. In this work, we
investigate zero RL training across 10 diverse base models, spanning different
families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B,
Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several
key design strategies-such as adjusting format reward and controlling query
difficulty-we achieve substantial improvements in both reasoning accuracy and
response length across most settings. However, by carefully monitoring the
training dynamics, we observe that different base models exhibit distinct
patterns during training. For instance, the increased response length does not
always correlate with the emergence of certain cognitive behaviors such as
verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for
the first time in small models not from the Qwen family. We share the key
designs that enable successful zero RL training, along with our findings and
practices. To facilitate further research, we open-source the code, models, and
analysis tools.
comment: Published as a conference paper at COLM 2025
♻ ☆ EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices
Jiyu Chen, Poh Seng Lim, Shuang Peng, Daxiong Luo, JungHau Foo, Yap Deep, Timothy Lee Jun Jie, Kelvin Teh Kae Wen, Fan Yang, Danyu Feng, Hao-Yun Chen, Peng-Wen Chen, Fangyuan Li, Xiaoxin Chen, Wong Wai Mun
Deploying Transformer-based large language models (LLMs) on
resource-constrained edge devices for long-sequence tasks remains challenging
due to the quadratic time complexity of self-attention and growing Key-Value
(KV) cache demands. While existing KV cache optimizations improve memory
efficiency, they often fail to reduce time to first token (TTFT) and may
degrade performance through token pruning. Alternative sequence modeling
architectures address some of these limitations, but typically require full
retraining and lack infrastructure support. EdgeInfinite offers an efficient
solution by fine-tuning only a small subset of parameters, maintaining quality
while reducing both computational and memory costs, including improved TTFT.
However, its instruction-following ability is limited, and it lacks
mobile-specific optimizations. To address these issues, we propose
EdgeInfinite-Instruct, which introduces a Segmented Supervised Fine-Tuning
(S-SFT) strategy tailored to long-sequence tasks such as summarization and
question answering. We further optimized EdgeInfinite-Instruct for efficient
deployment on edge NPUs by employing fine-grained post-training quantization
(PTQ) to reduce computational demands while maintaining accuracy, and by
implementing a fixed-shape computation graph that balances memory usage and
on-device efficiency through scenario-specific customization of input token and
cache sizes. Experiments on long-context benchmarks and real-world mobile tasks
show that our approach improves domain-specific performance while maintaining
efficiency on NPU-accelerated edge devices.
comment: The data and method in the paper need to be re-audited
♻ ☆ CRAB: A Benchmark for Evaluating Curation of Retrieval-Augmented LLMs in Biomedicine
Recent development in Retrieval-Augmented Large Language Models (LLMs) have
shown great promise in biomedical applications. How ever, a critical gap
persists in reliably evaluating their curation ability the process by which
models select and integrate relevant references while filtering out noise. To
address this, we introduce the benchmark for Curation of Retrieval-Augmented
LLMs in Biomedicine (CRAB), the first multilingual benchmark tailored for
evaluating the biomedical curation of retrieval-augmented LLMs, available in
English, French, German and Chinese. By incorporating a novel citation-based
evaluation metric, CRAB quantifies the curation performance of
retrieval-augmented LLMs in biomedicine. Experimental results reveal
significant discrepancies in the curation performance of mainstream LLMs,
underscoring the urgent need to improve it in the domain of biomedicine. Our
dataset is available at https://huggingface.co/datasets/zhm0/CRAB.
♻ ☆ A Comparative Study of Specialized LLMs as Dense Retrievers
While large language models (LLMs) are increasingly deployed as dense
retrievers, the impact of their domain-specific specialization on retrieval
effectiveness remains underexplored. This investigation systematically examines
how task-specific adaptations in LLMs influence their retrieval capabilities,
an essential step toward developing unified retrievers capable of handling
text, code, images, and multimodal content. We conduct extensive experiments
with eight Qwen2.5 7B LLMs, including base, instruction-tuned,
code/math-specialized, long reasoning, and vision-language models across
zero-shot retrieval settings and the supervised setting. For the zero-shot
retrieval settings, we consider text retrieval from the BEIR benchmark and code
retrieval from the CoIR benchmark. Further, to evaluate supervised performance,
all LLMs are fine-tuned on the MS MARCO dataset. We find that mathematical
specialization and the long reasoning capability cause consistent degradation
in three settings, indicating conflicts between mathematical reasoning and
semantic matching. The vision-language model and code-specialized LLMs
demonstrate superior zero-shot performance compared to other LLMs, even
surpassing BM25 on the code retrieval task, and maintain comparable performance
to base LLMs in supervised settings. These findings suggest promising
directions for the unified retrieval task leveraging cross-domain and
cross-modal fusion.
comment: Accepted by CCIR25 and published by Springer LNCS or LNAI
♻ ☆ IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
Flawed planning from VLM-driven embodied agents poses significant safety
hazards, hindering their deployment in real-world household tasks. However,
existing static, non-interactive evaluation paradigms fail to adequately assess
risks within these interactive environments, since they cannot simulate dynamic
risks that emerge from an agent's actions and rely on unreliable post-hoc
evaluations that ignore unsafe intermediate steps. To bridge this critical gap,
we propose evaluating an agent's interactive safety: its ability to perceive
emergent risks and execute mitigation steps in the correct procedural order. We
thus present IS-Bench, the first multi-modal benchmark designed for interactive
safety, featuring 161 challenging scenarios with 388 unique safety risks
instantiated in a high-fidelity simulator. Crucially, it facilitates a novel
process-oriented evaluation that verifies whether risk mitigation actions are
performed before/after specific risk-prone steps. Extensive experiments on
leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current
agents lack interactive safety awareness, and that while safety-aware
Chain-of-Thought can improve performance, it often compromises task completion.
By highlighting these critical limitations, IS-Bench provides a foundation for
developing safer and more reliable embodied AI systems. Code and data are
released under [this https URL](https://github.com/AI45Lab/IS-Bench).
♻ ☆ Parse Trees Guided LLM Prompt Compression
Offering rich contexts to Large Language Models (LLMs) has shown to boost the
performance in various tasks, but the resulting longer prompt would increase
the computational cost and might exceed the input limit of LLMs. Recently, some
prompt compression methods have been suggested to shorten the length of prompts
by using language models to generate shorter prompts or by developing
computational models to select important parts of original prompt. The
generative compression methods would suffer from issues like hallucination,
while the selective compression methods have not involved linguistic rules and
overlook the global structure of prompt. To this end, we propose a novel
selective compression method called PartPrompt. It first obtains a parse tree
for each sentence based on linguistic rules, and calculates local information
entropy for each node in a parse tree. These local parse trees are then
organized into a global tree according to the hierarchical structure such as
the dependency of sentences, paragraphs, and sections. After that, the
root-ward propagation and leaf-ward propagation are proposed to adjust node
values over the global tree. Finally, a recursive algorithm is developed to
prune the global tree based on the adjusted node values. The experiments show
that PartPrompt receives the state-of-the-art performance across various
datasets, metrics, compression ratios, and target LLMs for inference. The
in-depth ablation studies confirm the effectiveness of designs in PartPrompt,
and other additional experiments also demonstrate its superiority in terms of
the coherence of compressed prompts and in the extreme long prompt scenario.
comment: IEEE TPAMI major revision submitted
♻ ☆ HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents ICCV 2025
Yibin Liu, Zhixuan Liang, Zanxin Chen, Tianxing Chen, Mengkang Hu, Wanxi Dong, Congsheng Xu, Zhaoming Han, Yusen Qin, Yao Mu
Recent advances in multimodal large language models (MLLMs) have enabled
richer perceptual grounding for code policy generation in embodied agents.
However, most existing systems lack effective mechanisms to adaptively monitor
policy execution and repair codes during task completion. In this work, we
introduce HyCodePolicy, a hybrid language-based control framework that
systematically integrates code synthesis, geometric grounding, perceptual
monitoring, and iterative repair into a closed-loop programming cycle for
embodied agents. Technically, given a natural language instruction, our system
first decomposes it into subgoals and generates an initial executable program
grounded in object-centric geometric primitives. The program is then executed
in simulation, while a vision-language model (VLM) observes selected
checkpoints to detect and localize execution failures and infer failure
reasons. By fusing structured execution traces capturing program-level events
with VLM-based perceptual feedback, HyCodePolicy infers failure causes and
repairs programs. This hybrid dual feedback mechanism enables self-correcting
program synthesis with minimal human supervision. Our results demonstrate that
HyCodePolicy significantly improves the robustness and sample efficiency of
robot manipulation policies, offering a scalable strategy for integrating
multimodal reasoning into autonomous decision-making pipelines.
comment: Accepted to ICCV 2025 Workshop on Multi-Modal Reasoning for Agentic
Intelligence
♻ ☆ ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark
Automatic Speech Recognition (ASR) has been extensively investigated, yet
prior benchmarks have largely focused on assessing the acoustic robustness of
ASR models, leaving evaluations of their linguistic capabilities relatively
underexplored. This largely stems from the limited parameter sizes and training
corpora of conventional ASR models, leaving them with insufficient world
knowledge, which is crucial for accurately recognizing named entities across
diverse domains. For instance, drug and treatment names in medicine or
specialized technical terms in engineering. Recent breakthroughs in Large
Language Models (LLMs) and corresponding Large Audio Language Models (LALMs)
have markedly enhanced the visibility of advanced context modeling and general
artificial intelligence capabilities. Leveraging LLMs, we envision a unified
system capable of robust speech recognition across diverse real-world domains,
yet existing benchmarks are inadequate for evaluating this objective. To
address this gap, we propose ContextASR-Bench: a comprehensive, large-scale
benchmark designed to assess the linguistic competence of ASR systems using
corpora that feature numerous named entities across multiple domains. It
encompasses up to 40,000 data entries with more than 300,000 named entities
across over 10 domains. Beyond the audio and its transcription, each sample
provides the domain it belongs to and a list of named entities it contains,
which are referred to as the context. Based on this, we introduce three
evaluation modes to assess how effectively models can exploit such context to
improve ASR accuracy. Extensive evaluation on ContextASR-Bench highlights that
LALMs outperform conventional ASR models by a large margin thanks to the strong
world knowledge and context modeling of LLMs, yet there remains ample room for
further improvement. The dataset and evaluation code have been released.
comment: 16 pages, 4 figures
♻ ☆ Improved Unbiased Watermark for Large Language Models ACL 2025
As artificial intelligence surpasses human capabilities in text generation,
the necessity to authenticate the origins of AI-generated content has become
paramount. Unbiased watermarks offer a powerful solution by embedding
statistical signals into language model-generated text without distorting the
quality. In this paper, we introduce MCmark, a family of unbiased,
Multi-Channel-based watermarks. MCmark works by partitioning the model's
vocabulary into segments and promoting token probabilities within a selected
segment based on a watermark key. We demonstrate that MCmark not only preserves
the original distribution of the language model but also offers significant
improvements in detectability and robustness over existing unbiased watermarks.
Our experiments with widely-used language models demonstrate an improvement in
detectability of over 10% using MCmark, compared to existing state-of-the-art
unbiased watermarks. This advancement underscores MCmark's potential in
enhancing the practical application of watermarking in AI-generated texts.
comment: ACL 2025 Main Conference
♻ ☆ Evaluating the Robustness of Multimodal Agents Against Active Environmental Injection Attacks ACM MM 2025
As researchers continue to optimize AI agents for more effective task
execution within operating systems, they often overlook a critical security
concern: the ability of these agents to detect "impostors" within their
environment. Through an analysis of the agents' operational context, we
identify a significant threat-attackers can disguise malicious attacks as
environmental elements, injecting active disturbances into the agents'
execution processes to manipulate their decision-making. We define this novel
threat as the Active Environment Injection Attack (AEIA). Focusing on the
interaction mechanisms of the Android OS, we conduct a risk assessment of AEIA
and identify two critical security vulnerabilities: (1) Adversarial content
injection in multimodal interaction interfaces, where attackers embed
adversarial instructions within environmental elements to mislead agent
decision-making; and (2) Reasoning gap vulnerabilities in the agent's task
execution process, which increase susceptibility to AEIA attacks during
reasoning. To evaluate the impact of these vulnerabilities, we propose AEIA-MN,
an attack scheme that exploits interaction vulnerabilities in mobile operating
systems to assess the robustness of MLLM-based agents. Experimental results
show that even advanced MLLMs are highly vulnerable to this attack, achieving a
maximum attack success rate of 93% on the AndroidWorld benchmark by combining
two vulnerabilities.
comment: Accepted at ACM MM 2025 Main Conference
♻ ☆ Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators
Human-generated reward signals are critical for aligning generative models
with human preferences, guiding both training and inference-time evaluations.
While large language models (LLMs) employed as proxy evaluators, i.e.,
LLM-as-a-Judge, significantly reduce the costs associated with manual
annotations, they typically require extensive modality-specific training data
and fail to generalize well across diverse multimodal tasks. In this paper, we
propose Flex-Judge, a reasoning-guided multimodal judge model that leverages
minimal textual reasoning data to robustly generalize across multiple
modalities and evaluation formats. Our core intuition is that structured
textual reasoning explanations inherently encode generalizable decision-making
patterns, enabling an effective transfer to multimodal judgments, e.g., with
images or videos. Empirical results demonstrate that Flex-Judge, despite being
trained on significantly fewer text data, achieves competitive or superior
performance compared to state-of-the-art commercial APIs and extensively
trained multimodal evaluators. Notably, Flex-Judge presents broad impact in
modalities like molecule, where comprehensive evaluation benchmarks are scarce,
underscoring its practical value in resource-constrained domains. Our framework
highlights reasoning-based text supervision as a powerful, cost-effective
alternative to traditional annotation-intensive approaches, substantially
advancing scalable multimodal model-as-a-judge.
comment: The code is available at https://github.com/jongwooko/flex-judge
♻ ☆ AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity ACL 2025
Recently, large multimodal models (LMMs) have achieved significant
advancements. When dealing with high-resolution images, dominant LMMs typically
divide them into multiple local images and a global image, leading to a large
number of visual tokens. In this work, we introduce AVG-LLaVA, an LMM that can
adaptively select the appropriate visual granularity based on the input image
and instruction. Specifically, we first apply the multiple pooling layers to
obtain visual tokens at different granularities. Then we propose a visual
granularity router, which includes a Transformer layer, an MLP layer, and a
voter layer, used to select the appropriate visual granularity based on the
image and instruction. Furthermore, we put forward RGLF, a novel training
paradigm that aims at aligning the granularity predicted by the router with the
preferences of the LMM, without the need for additional manually annotated
data. Extensive experiments and analysis show that AVG-LLaVA achieves superior
performance across 11 benchmarks, as well as significantly reduces the number
of visual tokens and speeds up inference (e.g., an 85.3% reduction in visual
tokens and a 2.53$\times$ increase in inference speed on the AI2D benchmark).
comment: Accepted by ACL 2025 Findings
♻ ☆ Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning
Personalizing jargon detection and explanation is essential for making
technical documents accessible to readers with diverse disciplinary
backgrounds. However, tailoring models to individual users typically requires
substantial annotation efforts and computational resources due to user-specific
finetuning. To address this, we present a systematic study of personalized
jargon detection, focusing on methods that are both efficient and scalable for
real-world deployment. We explore two personalization strategies: (1)
lightweight fine-tuning using Low-Rank Adaptation (LoRA) on open-source models,
and (2) personalized prompting, which tailors model behavior at inference time
without retaining. To reflect realistic constraints, we also investigate hybrid
approaches that combine limited annotated data with unsupervised user
background signals. Our personalized LoRA model outperforms GPT-4 by 21.4% in
F1 score and exceeds the best performing oracle baseline by 8.3%. Remarkably,
our method achieves comparable performance using only 10% of the annotated
training data, demonstrating its practicality for resource-constrained
settings. Our study offers the first work to systematically explore efficient,
low-resource personalization of jargon detection using open-source language
models, offering a practical path toward scalable, user-adaptive NLP system.
♻ ☆ CAIN: Hijacking LLM-Humans Conversations via Malicious System Prompts
Large language models (LLMs) have advanced many applications, but are also
known to be vulnerable to adversarial attacks. In this work, we introduce a
novel security threat: hijacking AI-human conversations by manipulating LLMs'
system prompts to produce malicious answers only to specific targeted questions
(e.g., "Who should I vote for US President?", "Are Covid vaccines safe?"),
while behaving benignly on others. This attack is detrimental as it can enable
malicious actors to exercise large-scale information manipulation by spreading
harmful but benign-looking system prompts online. To demonstrate such an
attack, we develop CAIN, an algorithm that can automatically curate such
harmful system prompts for a specific target question in a black-box setting or
without the need to access the LLM's parameters. Evaluated on both open-source
and commercial LLMs, CAIN demonstrates significant adversarial impact. In
untargeted attacks or forcing LLMs to output incorrect answers, CAIN achieves
up to 40% F1 degradation on targeted questions while preserving high accuracy
on benign inputs. For targeted attacks or forcing LLMs to output specific
harmful answers, CAIN achieves over 70% F1 scores on these targeted responses
with minimal impact on benign questions. Our results highlight the critical
need for enhanced robustness measures to safeguard the integrity and safety of
LLMs in real-world applications. All source code will be publicly available.
♻ ☆ LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points
The advancement of large language models (LLMs) struggles with the scarcity
of high-quality, diverse training data. To address this limitation, we propose
LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that
enables flexible control over discipline and difficulty distributions while
balancing KP coverage and popularity. LinkSyn extracts KPs from
question-answering (QA) seed data and constructs a KP graph to synthesize
diverse QA data from multiple seeds strongly linked by KPs and sampled from
graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution
value function to guide the adjustment of path sampling probability and balance
KP coverage and popularity during graph walks; (2) diffusion-based synthesis
via DeepSeek-R1 by leveraging multiple seeds with dense logical associations
along each path; and (3) high-difficulty QA enhancement within given
disciplines by flexible difficulty adjustments. By executing LinkSyn, we
synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens.
Extensive experiments on Llama-3 8B demonstrate that continual pre-training
with LinkQA yields an average improvement of $\mathbf{11.51\%}$ on MMLU and
CMMLU, establishing new SOTA results. LinkQA consistently enhances performance
across model size and initial FLOPs scales.
♻ ☆ Assessing Agentic Large Language Models in Multilingual National Bias ACL 2025
Large Language Models have garnered significant attention for their
capabilities in multilingual natural language processing, while studies on
risks associated with cross biases are limited to immediate context
preferences. Cross-language disparities in reasoning-based recommendations
remain largely unexplored, with a lack of even descriptive analysis. This study
is the first to address this gap. We test LLM's applicability and capability in
providing personalized advice across three key scenarios: university
applications, travel, and relocation. We investigate multilingual bias in
state-of-the-art LLMs by analyzing their responses to decision-making tasks
across multiple languages. We quantify bias in model-generated scores and
assess the impact of demographic factors and reasoning strategies (e.g.,
Chain-of-Thought prompting) on bias patterns. Our findings reveal that local
language bias is prevalent across different tasks, with GPT-4 and Sonnet
reducing bias for English-speaking countries compared to GPT-3.5 but failing to
achieve robust multilingual alignment, highlighting broader implications for
multilingual AI agents and applications such as education. \footnote{Code
available at: https://github.com/yiyunya/assess_agentic_national_bias
comment: Accepted to ACL 2025 Findings. 14 pages
♻ ☆ Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated
As AI advances in text generation, human trust in AI generated content
remains constrained by biases that go beyond concerns of accuracy. This study
explores how bias shapes the perception of AI versus human generated content.
Through three experiments involving text rephrasing, news article
summarization, and persuasive writing, we investigated how human raters respond
to labeled and unlabeled content. While the raters could not differentiate the
two types of texts in the blind test, they overwhelmingly favored content
labeled as "Human Generated," over those labeled "AI Generated," by a
preference score of over 30%. We observed the same pattern even when the labels
were deliberately swapped. This human bias against AI has broader societal and
cognitive implications, as it undervalues AI performance. This study highlights
the limitations of human judgment in interacting with AI and offers a
foundation for improving human-AI collaboration, especially in creative fields.
comment: 5 main pages, 10 total pages
♻ ☆ Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models
Large transformer-based language models dominate modern NLP, yet our
understanding of how they encode linguistic information is rooted in studies of
early models like BERT and GPT-2. To better understand today's language models,
we investigate how 25 models - from classical architectures (BERT, DeBERTa,
GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5,
Llama-3.1) - represent lexical identity and inflectional morphology across six
typologically diverse languages. Using linear and nonlinear classifiers trained
on hidden activations, we predict word lemmas and inflectional features layer
by layer. We find that models concentrate lexical information linearly in early
layers and increasingly nonlinearly in later layers, while keeping inflectional
information uniformly accessible and linearly separable throughout. Additional
experiments probe the nature of these encodings: attention and residual
analyses examine where within layers information can be recovered, steering
vector experiments test what information can be functionally manipulated, and
intrinsic dimensionality analyses explore how the representational structure
evolves across layers. Remarkably, these encoding patterns emerge across all
models we test, despite differences in architecture, size, and training regime
(pretrained and instruction-tuned variants). This suggests that, even with
substantial advances in LLM technologies, transformer models organize
linguistic information in similar ways, indicating that these properties are
important for next token prediction and are learned early during pretraining.
Our code is available at https://github.com/ml5885/model_internal_sleuthing
comment: INTERPLAY Workshop COLM 2025
♻ ☆ Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer
Prompt tuning has emerged as a lightweight strategy for adapting foundation
models to downstream tasks, particularly for resource-constrained systems. As
pre-trained prompts become valuable assets, combining multiple source prompts
offers a promising approach to enhance generalization for new tasks by
leveraging complementary knowledge. However, naive aggregation often overlooks
different source prompts have different contribution potential to the target
task. To address this, we propose HGPrompt, a dynamic framework that learns
optimal ensemble weights. These weights are optimized by jointly maximizing an
information-theoretic metric for transferability and minimizing gradient
conflicts via a novel regularization strategy. Specifically, we propose a
differentiable prompt transferability metric to captures the discriminability
of prompt-induced features on the target task. Meanwhile, HGPrompt match the
gradient variances with respect to different source prompts based on Hessian
and Fisher Information, ensuring stable and coherent knowledge transfer while
suppressing gradient conflicts among them. Extensive experiments on the
large-scale VTAB benchmark demonstrate the state-of-the-art performance of
HGPrompt, validating its effectiveness in learning an optimal ensemble for
effective multi-source prompt transfer.
♻ ☆ Marco-Voice Technical Report
Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang
This paper presents a multifunctional speech synthesis system that integrates
voice cloning and emotion control speech synthesis within a unified framework.
The goal of this work is to address longstanding challenges in achieving highly
expressive, controllable, and natural speech generation that faithfully
preserves speaker identity across diverse linguistic and emotional contexts.
Our approach introduces an effective speaker-emotion disentanglement mechanism
with in-batch contrastive learning, enabling independent manipulation of
speaker identity and eemotional style, as well as rotational emotional
embedding integration method for smooth emotion control. To support
comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality
emotional speech dataset containing 10 hours of Mandarin speech from six
professional speakers across seven emotional categories. Extensive experiments
demonstrate that our system, Marco-Voice, achieves substantial improvements in
both objective and subjective metrics. Comprehensive evaluations and analysis
were conducted, results show that MarcoVoice delivers competitive performance
in terms of speech clarity and emotional richness, representing a substantial
advance in the field of expressive neural speech synthesis. Our code and
dataset are publicly available at https://github.com/AIDC-AI/Marco-Voice and
https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS respectively.
comment: Technical Report. Our code and dataset are publicly available at
https://github.com/AIDC-AI/Marco-Voice and
https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS respectively
♻ ☆ Emotion-o1: Adaptive Long Reasoning for Emotion Understanding in LLMs
Long chain-of-thought (CoT) reasoning has shown great promise in enhancing
the emotion understanding performance of large language models (LLMs). However,
current fixed-length CoT methods struggle to balance reasoning depth and
efficiency. Simple tasks (e.g., sentiment classification) are over-reasoned,
while complex tasks (e.g., sarcasm understanding) lack depth. To fill this gap,
we present Emotion-o1, an adaptive CoT framework that dynamically adjusts
reasoning length based on emotion-task complexity. Emotion-o1 is trained by
distilling adaptive CoT patterns from a reasoning-oriented LLM, followed by
supervised fine-tuning and reinforcement learning with a four-part reward
targeting accuracy, brevity, structure, and redundancy. Experimental results on
four emotion tasks highlight: (1) Emotion-o1 demonstrates significant
improvements over its backbone, with F1 score increases of 10%(Sentiment),
5%(Emotion), 18%(Humor), and 27%(Sarcasm). (2) In sentiment and sarcasm tasks,
our 8B model demonstrates superior performance against advanced LLMs,
outperforming Grok-3 by 1.1% and Claude-3.7 by 2%. (3) The framework maintains
accuracy while reducing reasoning length by 83% compared to OpenAI-o1,
demonstrating effective precision-efficiency optimization. Emotion-o1
effectively balances reasoning depth and efficiency for emotion understanding
in LLMs.
♻ ☆ Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models
Reasoning large language models (RLLMs) have recently demonstrated remarkable
capabilities through structured and multi-step reasoning. While prior research
has primarily focused on improving their training and inference strategies,
their potential for in-context learning (ICL) remains largely underexplored. To
fill this gap, we propose Thinking with Nothinking Calibration (JointThinking),
a new ICL paradigm that leverages the structured difference between two
reasoning modes, i.e., Thinking and Nothinking, to improve reasoning accuracy.
Specifically, our method prompts the model to generate two answers in parallel:
one in Thinking mode and the other in Nothinking mode. A second round of
Thinking is triggered only when the two initial responses are inconsistent,
using a single prompt that incorporates the original question and both
candidate answers. Since such disagreement occurs infrequently (e.g., only 6\%
in GSM8K), our method performs just one round of reasoning in most cases,
resulting in minimal latency overhead. Extensive experiments across multiple
reasoning benchmarks demonstrate that JointThinking significantly outperforms
few-shot chain-of-thought (CoT) and majority voting with improved answer
robustness. Moreover, It achieves comparable in-distribution performance to
training-based SOTA method, while substantially outperforming on
out-of-distribution tasks. We further conduct a systematic analysis of the
calibration mechanism, showing that leveraging different reasoning modes
consistently lowers the error rate and highlights the value of structural
thinking diversity. Additionally, we observe that the performance gap between
actual and ideal reasoning narrows as model size increases in the second round
of thinking, indicating the strong scalability of our approach. Finally, we
discuss current limitations and outline promising directions for future ICL
research in RLLMs.
♻ ☆ Tool Unlearning for Tool-Augmented LLMs ICML 2025
Tool-augmented large language models (LLMs) are often trained on datasets of
query-response pairs, which embed the ability to use tools or APIs directly
into the parametric knowledge of LLMs. Tool-augmented LLMs need the ability to
forget learned tools due to security vulnerabilities, privacy regulations, or
tool deprecations. However, ``tool unlearning'' has not been investigated in
unlearning literature. We introduce this novel task, which requires addressing
distinct challenges compared to traditional unlearning: knowledge removal
rather than forgetting individual samples, the high cost of optimizing LLMs,
and the need for principled evaluation metrics. To bridge these gaps, we
propose ToolDelete, the first approach for unlearning tools from tool-augmented
LLMs. It implements three key properties to address the above challenges for
effective tool unlearning and introduces a new membership inference attack
(MIA) model for effective evaluation. Extensive experiments on multiple tool
learning datasets and tool-augmented LLMs show that ToolDelete effectively
unlearns randomly selected tools, while preserving the LLM's knowledge on
non-deleted tools and maintaining performance on general tasks.
comment: ICML 2025 https://clu-uml.github.io/MU-Bench-Project-Page/
♻ ☆ Reliable Evaluation Protocol for Low-Precision Retrieval
Lowering the numerical precision of model parameters and computations is
widely adopted to improve the efficiency of retrieval systems. However, when
computing relevance scores between the query and documents in low-precision, we
observe spurious ties due to the reduced granularity. This introduces high
variability in the results based on tie resolution, making the evaluation less
reliable. To address this, we propose a more robust retrieval evaluation
protocol designed to reduce score variation. It consists of: (1) High-Precision
Scoring (HPS), which upcasts the final scoring step to higher precision to
resolve tied candidates with minimal computational cost; and (2) Tie-aware
Retrieval Metrics (TRM), which report expected scores, range, and bias to
quantify order uncertainty of tied candidates. Our experiments test multiple
models with three scoring functions on two retrieval datasets to demonstrate
that HPS dramatically reduces tie-induced instability, and TRM accurately
recovers expected metric values. This combination enables a more consistent and
reliable evaluation system for lower-precision retrievals.
comment: 11 pages, 5 figures, submitted to ARR
♻ ☆ EcoTransformer: Attention without Multiplication
The Transformer, with its scaled dot-product attention mechanism, has become
a foundational architecture in modern AI. However, this mechanism is
computationally intensive and incurs substantial energy costs. We propose a new
Transformer architecture EcoTransformer, in which the output context vector is
constructed as the convolution of the values using a Laplacian kernel, where
the distances are measured by the L1 metric between the queries and keys.
Compared to dot-product based attention, the new attention score calculation is
free of matrix multiplication. It performs on par with, or even surpasses,
scaled dot-product attention in NLP, bioinformatics, and vision tasks, while
consuming significantly less energy.
(This version (v2) supersedes v1 and reflects the intended release and
licensing.)
comment: 8 pages, 1 figure
♻ ☆ CLaSP: Learning Concepts for Time-Series Signals from Natural Language Supervision
This paper presents CLaSP, a novel model for retrieving time-series signals
using natural language queries that describe signal characteristics. The
ability to search time-series signals based on descriptive queries is essential
in domains such as industrial diagnostics, where data scientists often need to
find signals with specific characteristics. However, existing methods rely on
sketch-based inputs, predefined synonym dictionaries, or domain-specific manual
designs, limiting their scalability and adaptability. CLaSP addresses these
challenges by employing contrastive learning to map time-series signals to
natural language descriptions. Unlike prior approaches, it eliminates the need
for predefined synonym dictionaries and leverages the rich contextual knowledge
of large language models (LLMs). Using the TRUCE and SUSHI datasets, which pair
time-series signals with natural language descriptions, we demonstrate that
CLaSP achieves high accuracy in retrieving a variety of time series patterns
based on natural language queries.
♻ ☆ CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors
The widespread use of Large Language Models (LLMs) in many applications marks
a significant advance in research and practice. However, their complexity and
hard-to-understand nature make them vulnerable to attacks, especially
jailbreaks designed to produce harmful responses. To counter these threats,
developing strong detection methods is essential for the safe and reliable use
of LLMs. This paper studies this detection problem using the Contextual
Co-occurrence Matrix, a structure recognized for its efficacy in data-scarce
environments. We propose a novel method leveraging the latent space
characteristics of Contextual Co-occurrence Matrices and Tensors for the
effective identification of adversarial and jailbreak prompts. Our evaluations
show that this approach achieves a notable F1 score of 0.83 using only 0.5% of
labeled prompts, which is a 96.6% improvement over baselines. This result
highlights the strength of our learned patterns, especially when labeled data
is scarce. Our method is also significantly faster, speedup ranging from 2.3 to
128.4 times compared to the baseline models. To support future research and
reproducibility, we have made our implementation publicly available.