Computation and Language
☆ VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
Yu Feng, Nathaniel Weir, Kaj Bostrom, Sam Bayless, Darion Cassel, Sapana Chaudhary, Benjamin Kiesl-Reiter, Huzefa Rangwala
LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but
they cannot reliably verify their own logic. Even when they reach correct
answers, the underlying reasoning may be flawed, undermining trust in
high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a
neuro-symbolic method that extracts and verifies formal logical arguments from
CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order
logic and identifies premises that ground the argument in source context,
commonsense knowledge, or prior reasoning steps. The symbolic representation
enables automated solvers to verify logical validity while the NL premises
allow humans and systems to identify ungrounded or fallacious reasoning steps.
Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT
effectively identifies flawed reasoning, and serves as a strong predictor of
final answer correctness. We also leverage VeriCoT's verification signal for
(1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on
VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct
preference optimization (DPO) using verification-based pairwise rewards,
further improving reasoning validity and accuracy.
☆ Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning NeurIPS 2025
Chain-of-Thought (CoT) prompting is a key technique for enabling complex
reasoning in large language models. However, generating full, fixed-length
rationales is computationally wasteful, inflating both token usage and latency.
We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free
decoding algorithm that adaptively halts rationale generation. LEASH monitors
two intrinsic signals: the slope of token-level entropy and the improvement in
the top-logit margin. It terminates the generation once both signals plateau,
indicating the model has reached a stable reasoning state. Across four
instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces
average token generation by 30--35% and latency by 27%, while incurring a 10
p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no
additional training or supervision, offering a simple and efficient alternative
to CoT decoding.
comment: Presented at the 1st Workshop on Efficient Reasoning (NeurIPS 2025)
☆ DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration
Cooperative multi-agent planning requires agents to make joint decisions with
partial information and limited communication. Coordination at the trajectory
level often fails, as small deviations in timing or movement cascade into
conflicts. Symbolic planning mitigates this challenge by raising the level of
abstraction and providing a minimal vocabulary of actions that enable
synchronization and collective progress. We present DR. WELL, a decentralized
neurosymbolic framework for cooperative multi-agent planning. Cooperation
unfolds through a two-phase negotiation protocol: agents first propose
candidate roles with reasoning and then commit to a joint allocation under
consensus and environment constraints. After commitment, each agent
independently generates and executes a symbolic plan for its role without
revealing detailed trajectories. Plans are grounded in execution outcomes via a
shared world model that encodes the current state and is updated as agents act.
By reasoning over symbolic plans rather than raw trajectories, DR. WELL avoids
brittle step-level alignment and enables higher-level operations that are
reusable, synchronizable, and interpretable. Experiments on cooperative
block-push tasks show that agents adapt across episodes, with the dynamic world
model capturing reusable patterns and improving task completion rates and
efficiency. Experiments on cooperative block-push tasks show that our dynamic
world model improves task completion and efficiency through negotiation and
self-refinement, trading a time overhead for evolving, more efficient
collaboration strategies.
☆ When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection
The proliferation of misinformation necessitates robust yet computationally
efficient fact verification systems. While current state-of-the-art approaches
leverage Large Language Models (LLMs) for generating explanatory rationales,
these methods face significant computational barriers and hallucination risks
in real-world deployments. We present DeReC (Dense Retrieval Classification), a
lightweight framework that demonstrates how general-purpose text embeddings can
effectively replace autoregressive LLM-based approaches in fact verification
tasks. By combining dense retrieval with specialized classification, our system
achieves better accuracy while being significantly more efficient. DeReC
outperforms explanation-generating LLMs in efficiency, reducing runtime by 95%
on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92%
on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds),
showcasing its effectiveness across varying dataset sizes. On the RAWFC
dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art
method L-Defense (61.20%). Our results demonstrate that carefully engineered
retrieval-based systems can match or exceed LLM performance in specialized
tasks while being significantly more practical for real-world deployment.
☆ Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis
Natural language interfaces to tabular data must handle ambiguities inherent
to queries. Instead of treating ambiguity as a deficiency, we reframe it as a
feature of cooperative interaction, where the responsibility of query
specification is shared among the user and the system. We develop a principled
framework distinguishing cooperative queries, i.e., queries that yield a
resolvable interpretation, from uncooperative queries that cannot be resolved.
Applying the framework to evaluations for tabular question answering and
analysis, we analyze the queries in 15 popular datasets, and observe an
uncontrolled mixing of query types neither adequate for evaluating a system's
execution accuracy nor for evaluating interpretation capabilities. Our
framework and analysis of queries shifts the perspective from fixing ambiguity
to embracing cooperation in resolving queries. This reflection enables more
informed design and evaluation for natural language interfaces for tabular
data, for which we outline implications and directions for future research.
comment: Accepted to the AI for Tabular Data workshop at EurIPS 2025
☆ Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper
Understanding the current capabilities and risks of AI Scientist systems is
essential for ensuring trustworthy and sustainable AI-driven scientific
progress while preserving the integrity of the academic ecosystem. To this end,
we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system
that mimics the core research workflow of a novice student researcher: Given
the baseline paper from the human mentor, it analyzes its limitations,
formulates novel hypotheses for improvement, validates them through rigorous
experimentation, and writes a paper with the results. Unlike previous
approaches that assume full automation or operate on small-scale code, Jr. AI
Scientist follows a well-defined research workflow and leverages modern coding
agents to handle complex, multi-file implementations, leading to scientifically
valuable contributions. For evaluation, we conducted automated assessments
using AI Reviewers, author-led evaluations, and submissions to Agents4Science,
a venue dedicated to AI-driven scientific contributions. The findings
demonstrate that Jr. AI Scientist generates papers receiving higher review
scores than existing fully automated systems. Nevertheless, we identify
important limitations from both the author evaluation and the Agents4Science
reviews, indicating the potential risks of directly applying current AI
Scientist systems and key challenges for future research. Finally, we
comprehensively report various risks identified during development. We hope
these insights will deepen understanding of current progress and risks in AI
Scientist development.
comment: Issues, comments, and questions are all welcome in
https://github.com/Agent4Science-UTokyo/Jr.AI-Scientist
☆ Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu
"Thinking with Text" and "Thinking with Images" paradigm significantly
improve the reasoning ability of large language models (LLMs) and Vision
Language Models (VLMs). However, these paradigms have inherent limitations. (1)
Images capture only single moments and fail to represent dynamic processes or
continuous changes, and (2) The separation of text and vision as distinct
modalities, hindering unified multimodal understanding and generation. To
overcome these limitations, we introduce "Thinking with Video", a new paradigm
that leverages video generation models, such as Sora-2, to bridge visual and
textual reasoning in a unified temporal framework. To support this exploration,
we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench
encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing
Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our
evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks,
Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even
surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric
tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU.
Furthermore, we systematically analyse the source of these abilities. We also
find that self-consistency and in-context learning can improve Sora-2's
performance. In summary, our findings demonstrate that the video generation
model is the potential unified multimodal understanding and generation model,
positions "thinking with video" as a unified multimodal reasoning paradigm.
comment: 36 pages, 14 figures
☆ BanglaMedQA and BanglaMMedBench: Evaluating Retrieval-Augmented Generation Strategies for Bangla Biomedical Question Answering
Sadia Sultana, Saiyma Sittul Muna, Mosammat Zannatul Samarukh, Ajwad Abrar, Tareque Mohmud Chowdhury
Developing accurate biomedical Question Answering (QA) systems in
low-resource languages remains a major challenge, limiting equitable access to
reliable medical knowledge. This paper introduces BanglaMedQA and
BanglaMMedBench, the first large-scale Bangla biomedical Multiple Choice
Question (MCQ) datasets designed to evaluate reasoning and retrieval in medical
artificial intelligence (AI). The study applies and benchmarks several
Retrieval-Augmented Generation (RAG) strategies, including Traditional,
Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining
textbook-based and web retrieval with generative reasoning to improve factual
accuracy. A key novelty lies in integrating a Bangla medical textbook corpus
through Optical Character Recognition (OCR) and implementing an Agentic RAG
pipeline that dynamically selects between retrieval and reasoning strategies.
Experimental results show that the Agentic RAG achieved the highest accuracy
89.54% with openai/gpt-oss-120b, outperforming other configurations and
demonstrating superior rationale quality. These findings highlight the
potential of RAG-based methods to enhance the reliability and accessibility of
Bangla medical QA, establishing a foundation for future research in
multilingual medical artificial intelligence.
comment: Under Review
☆ From Model to Breach: Towards Actionable LLM-Generated Vulnerabilities Reporting
As the role of Large Language Models (LLM)-based coding assistants in
software development becomes more critical, so does the role of the bugs they
generate in the overall cybersecurity landscape. While a number of LLM code
security benchmarks have been proposed alongside approaches to improve the
security of generated code, it remains unclear to what extent they have
impacted widely used coding LLMs. Here, we show that even the latest
open-weight models are vulnerable in the earliest reported vulnerability
scenarios in a realistic use setting, suggesting that the safety-functionality
trade-off has until now prevented effective patching of vulnerabilities. To
help address this issue, we introduce a new severity metric that reflects the
risk posed by an LLM-generated vulnerability, accounting for vulnerability
severity, generation chance, and the formulation of the prompt that induces
vulnerable code generation - Prompt Exposure (PE). To encourage the mitigation
of the most serious and prevalent vulnerabilities, we use PE to define the
Model Exposure (ME) score, which indicates the severity and prevalence of
vulnerabilities a model generates.
☆ IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection AAAI
Kaveh Eskandari Miandoab, Katharine Kowalyshyn, Kabir Pamnani, Anesu Gavhera, Vasanth Sarathy, Matthias Scheutz
We present IntelliProof, an interactive system for analyzing argumentative
essays through LLMs. IntelliProof structures an essay as an argumentation
graph, where claims are represented as nodes, supporting evidence is attached
as node properties, and edges encode supporting or attacking relations. Unlike
existing automated essay scoring systems, IntelliProof emphasizes the user
experience: each relation is initially classified and scored by an LLM, then
visualized for enhanced understanding. The system provides justifications for
classifications and produces quantitative measures for essay coherence. It
enables rapid exploration of argumentative quality while retaining human
oversight. In addition, IntelliProof provides a set of tools for a better
understanding of an argumentative essay and its corresponding graph in natural
language, bridging the gap between the structural semantics of argumentative
essays and the user's understanding of a given text. A live demo and the system
are available here to try: \textbf{https://intelliproof.vercel.app}
comment: Accepted for the 40th Annual AAAI Conference on Artificial
Intelligence (2026) - Demonstration Track
☆ Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics
When a language model generates text, the selection of individual tokens
might lead it down very different reasoning paths, making uncertainty difficult
to quantify. In this work, we consider whether reasoning language models
represent the alternate paths that they could take during generation. To test
this hypothesis, we use hidden activations to control and predict a language
model's uncertainty during chain-of-thought reasoning. In our experiments, we
find a clear correlation between how uncertain a model is at different tokens,
and how easily the model can be steered by controlling its activations. This
suggests that activation interventions are most effective when there are
alternate paths available to the model -- in other words, when it has not yet
committed to a particular final answer. We also find that hidden activations
can predict a model's future outcome distribution, demonstrating that models
implicitly represent the space of possible paths.
☆ Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways
Paloma Rabaey, Jong Hak Moon, Jung-Oh Lee, Min Gwan Kim, Hangyul Yoon, Thomas Demeester, Edward Choi
Radiology reports are invaluable for clinical decision-making and hold great
potential for automated analysis when structured into machine-readable formats.
These reports often contain uncertainty, which we categorize into two distinct
types: (i) Explicit uncertainty reflects doubt about the presence or absence of
findings, conveyed through hedging phrases. These vary in meaning depending on
the context, making rule-based systems insufficient to quantify the level of
uncertainty for specific findings; (ii) Implicit uncertainty arises when
radiologists omit parts of their reasoning, recording only key findings or
diagnoses. Here, it is often unclear whether omitted findings are truly absent
or simply unmentioned for brevity. We address these challenges with a two-part
framework. We quantify explicit uncertainty by creating an expert-validated,
LLM-based reference ranking of common hedging phrases, and mapping each finding
to a probability value based on this reference. In addition, we model implicit
uncertainty through an expansion framework that systematically adds
characteristic sub-findings derived from expert-defined diagnostic pathways for
14 common diagnoses. Using these methods, we release Lunguage++, an expanded,
uncertainty-aware version of the Lunguage benchmark of fine-grained structured
radiology reports. This enriched resource enables uncertainty-aware image
classification, faithful diagnostic reasoning, and new investigations into the
clinical impact of diagnostic uncertainty.
☆ RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG
Retrieval-Augmented Generation (RAG) is a critical technique for grounding
Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in
specialized, safety-critical domains remains a significant challenge. Existing
evaluation frameworks often rely on heuristic-based metrics that fail to
capture domain-specific nuances and other works utilize LLM-as-a-Judge
approaches that lack validated alignment with human judgment. This paper
introduces RAGalyst, an automated, human-aligned agentic framework designed for
the rigorous evaluation of domain-specific RAG systems. RAGalyst features an
agentic pipeline that generates high-quality, synthetic question-answering (QA)
datasets from source documents, incorporating an agentic filtering step to
ensure data fidelity. The framework refines two key LLM-as-a-Judge
metrics-Answer Correctness and Answerability-using prompt optimization to
achieve a strong correlation with human annotations. Applying this framework to
evaluate various RAG components across three distinct domains (military
operations, cybersecurity, and bridge engineering), we find that performance is
highly context-dependent. No single embedding model, LLM, or hyperparameter
configuration proves universally optimal. Additionally, we provide an analysis
on the most common low Answer Correctness reasons in RAG. These findings
highlight the necessity of a systematic evaluation framework like RAGalyst,
which empowers practitioners to uncover domain-specific trade-offs and make
informed design choices for building reliable and effective RAG systems.
RAGalyst is available on our Github.
☆ Large language models replicate and predict human cooperation across experiments in game theory
Large language models (LLMs) are increasingly used both to make decisions in
domains such as health, education and law, and to simulate human behavior. Yet
how closely LLMs mirror actual human decision-making remains poorly understood.
This gap is critical: misalignment could produce harmful outcomes in practical
applications, while failure to replicate human behavior renders LLMs
ineffective for social simulations. Here, we address this gap by developing a
digital twin of game-theoretic experiments and introducing a systematic
prompting and probing framework for machine-behavioral evaluation. Testing
three open-source models (Llama, Mistral and Qwen), we find that Llama
reproduces human cooperation patterns with high fidelity, capturing human
deviations from rational choice theory, while Qwen aligns closely with Nash
equilibrium predictions. Notably, we achieved population-level behavioral
replication without persona-based prompting, simplifying the simulation
process. Extending beyond the original human-tested games, we generate and
preregister testable hypotheses for novel game configurations outside the
original parameter grid. Our findings demonstrate that appropriately calibrated
LLMs can replicate aggregate human behavioral patterns and enable systematic
exploration of unexplored experimental spaces, offering a complementary
approach to traditional research in the social and behavioral sciences that
generates new empirical predictions about human social decision-making.
☆ Decoding Emergent Big Five Traits in Large Language Models: Temperature-Dependent Expression and Architectural Clustering AACL 2025
As Large Language Models (LLMs) become integral to human-centered
applications, understanding their personality-like behaviors is increasingly
important for responsible development and deployment. This paper systematically
evaluates six LLMs, applying the Big Five Inventory-2 (BFI-2) framework, to
assess trait expressions under varying sampling temperatures. We find
significant differences across four of the five personality dimensions, with
Neuroticism and Extraversion susceptible to temperature adjustments. Further,
hierarchical clustering reveals distinct model clusters, suggesting that
architectural features may predispose certain models toward stable trait
profiles. Taken together, these results offer new insights into the emergence
of personality-like patterns in LLMs and provide a new perspective on model
tuning, selection, and the ethical governance of AI systems. We share the data
and code for this analysis here:
https://osf.io/bsvzc/?view_only=6672219bede24b4e875097426dc3fac1
comment: Accepted at IJCNLP-AACL 2025
☆ OUNLP at TSAR 2025 Shared Task: Multi-Round Text Simplifier via Code Generation EMNLP2025
This paper describes the OUNLP system submitted to the TSAR-2025 Shared Task
(Alva-Manchego et al., 2025), designed for readability-controlled text
simplification using LLM-prompting-based generation. Based on the analysis of
prompt-based text simplification methods, we discovered an interesting finding
that text simplification performance is highly related to the gap between the
source CEFR (Arase et al., 2022) level and the target CEFR level. Inspired by
this finding, we propose two multi-round simplification methods and generate
them via GPT-4o: rule-based simplification (MRS-Rule) and jointly rule-based
LLM simplification (MRS-Joint). Our submitted systems ranked 7 out of 20 teams.
Later improvements with MRS-Joint show that taking the LLM simplified
candidates as the starting point could further boost the multi-round
simplification performance.
comment: Accepted to TSAR 2025 Workshop at EMNLP2025
☆ RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables
Existing tabular reasoning benchmarks mostly test models on small, uniform
tables, underrepresenting the complexity of real-world data and giving an
incomplete view of Large Language Models' (LLMs) reasoning abilities. Real
tables are long, heterogeneous, and domain-specific, mixing structured fields
with free text and requiring multi-hop reasoning across thousands of tokens. To
address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from
2031 real-world tables spanning two domains: i) RB-Science (NSF grant records)
and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates
LLMs jointly across scale, heterogeneity, domain specificity, and reasoning
complexity. Experiments with open-source and proprietary models show that LLMs
struggle with heterogeneous schemas and complex multi-hop inference, revealing
persistent weaknesses in current architectures and prompting strategies.
RUST-BENCH establishes a challenging new testbed for advancing tabular
reasoning research.
☆ ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai AACL 2025
Surapon Nonesung, Teetouch Jaknamon, Sirinya Chaiophat, Natapong Nitarach, Chanakan Wittayasakpan, Warit Sirichotedumrong, Adisai Na-Thalang, Kunat Pipatanakul
We present ThaiOCRBench, the first comprehensive benchmark for evaluating
vision-language models (VLMs) on Thai text-rich visual understanding tasks.
Despite recent progress in multimodal modeling, existing benchmarks
predominantly focus on high-resource languages, leaving Thai underrepresented,
especially in tasks requiring document structure understanding. ThaiOCRBench
addresses this gap by offering a diverse, human-annotated dataset comprising
2,808 samples across 13 task categories. We evaluate a wide range of
state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and
open-source systems. Results show a significant performance gap, with
proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source
counterparts. Notably, fine-grained text recognition and handwritten content
extraction exhibit the steepest performance drops among open-source models.
Through detailed error analysis, we identify key challenges such as language
bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a
standardized framework for assessing VLMs in low-resource, script-complex
settings, and provides actionable insights for improving Thai-language document
understanding.
comment: Accepted at the IJCNLP-AACL 2025 (Main)
☆ Probabilistic Textual Time Series Depression Detection
Accurate and interpretable predictions of depression severity are essential
for clinical decision support, yet existing models often lack uncertainty
estimates and temporal modeling. We propose PTTSD, a Probabilistic Textual Time
Series Depression Detection framework that predicts PHQ-8 scores from
utterance-level clinical interviews while modeling uncertainty over time. PTTSD
includes sequence-to-sequence and sequence-to-one variants, both combining
bidirectional LSTMs, self-attention, and residual connections with Gaussian or
Student-t output heads trained via negative log-likelihood. Evaluated on E-DAIC
and DAIC-WOZ, PTTSD achieves state-of-the-art performance among text-only
systems (e.g., MAE = 3.85 on E-DAIC, 3.55 on DAIC) and produces well-calibrated
prediction intervals. Ablations confirm the value of attention and
probabilistic modeling, while comparisons with MentalBERT establish generality.
A three-part calibration analysis and qualitative case studies further
highlight the interpretability and clinical relevance of uncertainty-aware
forecasting.
comment: 14 pages, 8 figures, 4 tables
☆ Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs
Retrieval of information from graph-structured knowledge bases represents a
promising direction for improving the factuality of LLMs. While various
solutions have been proposed, a comparison of methods is difficult due to the
lack of challenging QA datasets with ground-truth targets for graph retrieval.
We present SynthKGQA, a framework for generating high-quality synthetic
Knowledge Graph Question Answering datasets from any Knowledge Graph, providing
the full set of ground-truth facts in the KG to reason over each question. We
show how, in addition to enabling more informative benchmarking of KG
retrievers, the data produced with SynthKGQA also allows us to train better
models. We apply SynthKGQA to Wikidata to generate GTSQA, a new dataset
designed to test zero-shot generalization abilities of KG retrievers with
respect to unseen graph structures and relation types, and benchmark popular
solutions for KG-augmented LLMs on it.
☆ If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs
In this study, we experiment with the ability of LLMs to do temporal
reasoning. Using a Norwegian book from 1940 containing trivia questions, we
prompt the LLMs to answer the questions as if it were 1940. We also pose the
questions in both English and Norwegian. Correct answers are often presented as
sentences, and grading is done by means of LLM-as-judge, with sampled checks by
a native speaker. Prompting in English consistently gave better results than in
Norwegian, an unexpected result. In contrast, using larger LLMs improved
results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families,
and also the largest available LLM especially crafted for Norwegian.
comment: 8 pages, 1 figure, 3 tables, submitted to aconference
☆ The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity
Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is
critical for trustworthy deployment. While real-world language is inherently
ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically
benchmarked against tasks with no ambiguity. In this work, we demonstrate that
while current uncertainty estimators perform well under the restrictive
assumption of no ambiguity, they degrade to close-to-random performance on
ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first
ambiguous question-answering (QA) datasets equipped with ground-truth answer
distributions estimated from factual co-occurrence. We find this performance
deterioration to be consistent across different estimation paradigms: using the
predictive distribution itself, internal representations throughout the model,
and an ensemble of models. We show that this phenomenon can be theoretically
explained, revealing that predictive-distribution and ensemble-based estimators
are fundamentally limited under ambiguity. Overall, our study reveals a key
shortcoming of current UQ methods for LLMs and motivates a rethinking of
current modeling paradigms.
☆ Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning
Data quality and its effective selection are fundamental to improving the
performance of machine translation models, serving as cornerstones for
achieving robust and reliable translation systems. This paper presents a data
selection methodology specifically designed for fine-tuning machine translation
systems, which leverages the synergy between a learner model and a pre-trained
reference model to enhance overall training effectiveness. By defining a
learnability score, our approach systematically evaluates the utility of data
points for training, ensuring that only the most relevant and impactful
examples contribute to the fine-tuning process. Furthermore, our method employs
a batch selection strategy which considers interdependencies among data points,
optimizing the efficiency of the training process while maintaining a focus on
data relevance. Experiments on English to Persian and several other language
pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that
our method can achieve up to a fivefold improvement in data efficiency compared
to an iid baseline. Experimental results indicate that our approach improves
computational efficiency by 24 when utilizing cached embeddings, as it requires
fewer training data points. Additionally, it enhances generalization, resulting
in superior translation performance compared to random selection method.
☆ SSPO: Subsentence-level Policy Optimization
As a significant part of post-training of the Large Language Models (LLMs),
Reinforcement Learning from Verifiable Reward (RLVR) has greatly improved LLMs'
reasoning skills. However, some RLVR algorithms, such as GRPO (Group Relative
Policy Optimization) and GSPO (Group Sequence Policy Optimization), are
observed to suffer from unstable policy updates and low usage of sampling data,
respectively. The importance ratio of GRPO is calculated at the token level,
which focuses more on optimizing a single token. This will be easily affected
by outliers, leading to model training collapse. GSPO proposed the calculation
of the response level importance ratio, which solves the problem of high
variance and training noise accumulation in the calculation of the GRPO
importance ratio. However, since all the response tokens share a common
importance ratio, extreme values can easily raise or lower the overall mean,
leading to the entire response being mistakenly discarded, resulting in a
decrease in the utilization of sampled data. This paper introduces SSPO, which
applies sentence-level importance ratio, taking the balance between GRPO and
GSPO. SSPO not only avoids training collapse and high variance, but also
prevents the whole response tokens from being abandoned by the clipping
mechanism. Furthermore, we apply sentence entropy to PPO-CLIP to steadily
adjust the clipping bounds, encouraging high-entropy tokens to explore and
narrow the clipping range of low-entropy tokens. In particular, SSPO achieves
an average score of 46.57 across five datasets, surpassing GRPO (43.01) and
GSPO (44.42), and wins state-of-the-art performance on three datasets. These
results highlight SSPO's effectiveness in leveraging generated data by taking
the essence of GSPO but rejecting its shortcomings.
☆ Efficient Topic Extraction via Graph-Based Labeling: A Lightweight Alternative to Deep Models
Salma Mekaoui, Hiba Sofyan, Imane Amaaz, Imane Benchrif, Arsalane Zarghili, Ilham Chaker, Nikola S. Nikolov
Extracting topics from text has become an essential task, especially with the
rapid growth of unstructured textual data. Most existing works rely on highly
computational methods to address this challenge. In this paper, we argue that
probabilistic and statistical approaches, such as topic modeling (TM), can
offer effective alternatives that require fewer computational resources. TM is
a statistical method that automatically discovers topics in large collections
of unlabeled text; however, it produces topics as distributions of
representative words, which often lack clear interpretability. Our objective is
to perform topic labeling by assigning meaningful labels to these sets of
words. To achieve this without relying on computationally expensive models, we
propose a graph-based approach that not only enriches topic words with
semantically related terms but also explores the relationships among them. By
analyzing these connections within the graph, we derive suitable labels that
accurately capture each topic's meaning. We present a comparative study between
our proposed method and several benchmarks, including ChatGPT-3.5, across two
different datasets. Our method achieved consistently better results than
traditional benchmarks in terms of BERTScore and cosine similarity and produced
results comparable to ChatGPT-3.5, while remaining computationally efficient.
Finally, we discuss future directions for topic labeling and highlight
potential research avenues for enhancing interpretability and automation.
☆ Reusing Pre-Training Data at Test Time is a Compute Multiplier
Large language models learn from their vast pre-training corpora, gaining the
ability to solve an ever increasing variety of tasks; yet although researchers
work to improve these datasets, there is little effort to understand how
efficient the pre-training apparatus is at extracting ideas and knowledge from
the data. In this work, we use retrieval augmented generation along with
test-time compute as a way to quantify how much dataset value was left behind
by the process of pre-training, and how this changes across scale. We
demonstrate that pre-training then retrieving from standard and largely
open-sourced datasets results in significant accuracy gains in MMLU, Math-500,
and SimpleQA, which persist through decontamination. For MMLU we observe that
retrieval acts as a ~5x compute multiplier versus pre-training alone. We show
that these results can be further improved by leveraging additional compute at
test time to parse the retrieved context, demonstrating a 10 percentage point
improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results
suggest that today's pre-training methods do not make full use of the
information in existing pre-training datasets, leaving significant room for
progress.
☆ REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs
Machine unlearning aims to remove the influence of specific training data
from a model without requiring full retraining. This capability is crucial for
ensuring privacy, safety, and regulatory compliance. Therefore, verifying
whether a model has truly forgotten target data is essential for maintaining
reliability and trustworthiness. However, existing evaluation methods often
assess forgetting at the level of individual inputs. This approach may overlook
residual influence present in semantically similar examples. Such influence can
compromise privacy and lead to indirect information leakage. We propose REMIND
(Residual Memorization In Neighborhood Dynamics), a novel evaluation method
aiming to detect the subtle remaining influence of unlearned data and classify
whether the data has been effectively forgotten. REMIND analyzes the model's
loss over small input variations and reveals patterns unnoticed by single-point
evaluations. We show that unlearned data yield flatter, less steep loss
landscapes, while retained or unrelated data exhibit sharper, more volatile
patterns. REMIND requires only query-based access, outperforms existing methods
under similar constraints, and demonstrates robustness across different models,
datasets, and paraphrased inputs, making it practical for real-world
deployment. By providing a more sensitive and interpretable measure of
unlearning effectiveness, REMIND provides a reliable framework to assess
unlearning in language models. As a result, REMIND offers a novel perspective
on memorization and unlearning.
comment: Pre-print version under review
☆ Black-Box Guardrail Reverse-engineering Attack
Large language models (LLMs) increasingly employ guardrails to enforce
ethical, legal, and application-specific constraints on their outputs. While
effective at mitigating harmful responses, these guardrails introduce a new
class of vulnerabilities by exposing observable decision patterns. In this
work, we present the first study of black-box LLM guardrail reverse-engineering
attacks. We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement
learning-based framework that leverages genetic algorithm-driven data
augmentation to approximate the decision-making policy of victim guardrails. By
iteratively collecting input-output pairs, prioritizing divergence cases, and
applying targeted mutations and crossovers, our method incrementally converges
toward a high-fidelity surrogate of the victim guardrail. We evaluate GRA on
three widely deployed commercial systems, namely ChatGPT, DeepSeek, and Qwen3,
and demonstrate that it achieves an rule matching rate exceeding 0.92 while
requiring less than $85 in API costs. These findings underscore the practical
feasibility of guardrail extraction and highlight significant security risks
for current LLM safety mechanisms. Our findings expose critical vulnerabilities
in current guardrail designs and highlight the urgent need for more robust
defense mechanisms in LLM deployment.
☆ Block Rotation is All You Need for MXFP4 Quantization
Large language models (LLMs) have achieved remarkable success, but their
rapidly growing scale imposes prohibitive costs in memory, computation, and
energy. Post-training quantization (PTQ) is a promising solution for efficient
deployment, yet achieving accurate W4A4 quantization remains an open challenge.
While most existing methods are designed for INT4 formats, the emergence of
MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)--
raises questions about the applicability of current techniques. In this work,
we establish a comprehensive benchmark of PTQ methods under the MXFP4 format.
Through systematic evaluation, we find that methods like GPTQ consistently
deliver strong performance, whereas rotation-based approaches, which are almost
used by all state-of-the-art approaches, suffer from severe incompatibility
with MXFP4. We further provide the first in-depth analysis of this conflict,
tracing its root to a fundamental mismatch between MXFP4's PoT (power-of-two)
block scaling and the redistribution of outlier energy via global rotation.
Building on this insight, we propose a simple yet effective block rotation
strategy that adapts rotation-based methods to MXFP4, leading to substantial
accuracy improvements across diverse LLMs. Our findings not only offer clear
guidance for practitioners but also set a foundation for advancing PTQ research
under emerging low-precision formats.
comment: 9 pages, 10 figures
☆ LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal
Michał Karp, Anna Kubaszewska, Magdalena Król, Robert Król, Aleksander Smywiński-Pohl, Mateusz Szymański, Witold Wydmański
This study provides an empirical assessment of whether current large language
models (LLMs) can pass the official qualifying examination for membership in
Poland's National Appeal Chamber (Krajowa Izba Odwo{\l}awcza). The authors
examine two related ideas: using LLM as actual exam candidates and applying the
'LLM-as-a-judge' approach, in which model-generated answers are automatically
evaluated by other models. The paper describes the structure of the exam, which
includes a multiple-choice knowledge test on public procurement law and a
written judgment, and presents the hybrid information recovery and extraction
pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4
Sonnet and Bielik-11B-v2.6) were tested in closed-book and various
Retrieval-Augmented Generation settings. The results show that although the
models achieved satisfactory scores in the knowledge test, none met the passing
threshold in the practical written part, and the evaluations of the
'LLM-as-a-judge' often diverged from the judgments of the official examining
committee. The authors highlight key limitations: susceptibility to
hallucinations, incorrect citation of legal provisions, weaknesses in logical
argumentation, and the need for close collaboration between legal experts and
technical teams. The findings indicate that, despite rapid technological
progress, current LLMs cannot yet replace human judges or independent examiners
in Polish public procurement adjudication.
☆ Computational Turing Test Reveals Systematic Differences Between Human and AI Language
Large language models (LLMs) are increasingly used in the social sciences to
simulate human behavior, based on the assumption that they can generate
realistic, human-like text. Yet this assumption remains largely untested.
Existing validation efforts rely heavily on human-judgment-based evaluations --
testing whether humans can distinguish AI from human output -- despite evidence
that such judgments are blunt and unreliable. As a result, the field lacks
robust tools for assessing the realism of LLM-generated text or for calibrating
models to real-world data. This paper makes two contributions. First, we
introduce a computational Turing test: a validation framework that integrates
aggregate metrics (BERT-based detectability and semantic similarity) with
interpretable linguistic features (stylistic markers and topical patterns) to
assess how closely LLMs approximate human language within a given dataset.
Second, we systematically compare nine open-weight LLMs across five calibration
strategies -- including fine-tuning, stylistic prompting, and context retrieval
-- benchmarking their ability to reproduce user interactions on X (formerly
Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the
literature. Even after calibration, LLM outputs remain clearly distinguishable
from human text, particularly in affective tone and emotional expression.
Instruction-tuned models underperform their base counterparts, and scaling up
model size does not enhance human-likeness. Crucially, we identify a trade-off:
optimizing for human-likeness often comes at the cost of semantic fidelity, and
vice versa. These results provide a much-needed scalable framework for
validation and calibration in LLM simulations -- and offer a cautionary note
about their current limitations in capturing human communication.
☆ Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains
The proliferation of AI-generated content has created an absurd communication
theater where senders use LLMs to inflate simple ideas into verbose content,
recipients use LLMs to compress them back into summaries, and as a consequence
neither party engage with authentic content. LAAC (LLM as a Communicator)
proposes a paradigm shift - positioning LLMs as intelligent communication
intermediaries that capture the sender's intent through structured dialogue and
facilitate genuine knowledge exchange with recipients. Rather than perpetuating
cycles of AI-generated inflation and compression, LAAC enables authentic
communication across diverse contexts including academic papers, proposals,
professional emails, and cross-platform content generation. However, deploying
LLMs as trusted communication intermediaries raises critical questions about
information fidelity, consistency, and reliability. This position paper
systematically evaluates the trustworthiness requirements for LAAC's deployment
across multiple communication domains. We investigate three fundamental
dimensions: (1) Information Capture Fidelity - accuracy of intent extraction
during sender interviews across different communication types, (2)
Reproducibility - consistency of structured knowledge across multiple
interaction instances, and (3) Query Response Integrity - reliability of
recipient-facing responses without hallucination, source conflation, or
fabrication. Through controlled experiments spanning multiple LAAC use cases,
we assess these trust dimensions using LAAC's multi-agent architecture.
Preliminary findings reveal measurable trust gaps that must be addressed before
LAAC can be reliably deployed in high-stakes communication scenarios.
comment: 10 pages, 4 figures. Submitted to IEEE DISTILL 2025 (co-located with
IEEE TPS 2025)
☆ Transforming Mentorship: An AI Powered Chatbot Approach to University Guidance
Mashrur Rahman, Mantaqa abedin, Monowar Zamil Abir, Faizul Islam Ansari, Adib Reza, Farig Yousuf Sadeque, Niloy Farhan
University students face immense challenges during their undergraduate lives,
often being deprived of personalized on-demand guidance that mentors fail to
provide at scale. Digital tools exist, but there is a serious lack of
customized coaching for newcomers. This paper presents an AI-powered chatbot
that will serve as a mentor for the students of BRAC University. The main
component is a data ingestion pipeline that efficiently processes and updates
information from diverse sources, such as CSV files and university webpages.
The chatbot retrieves information through a hybrid approach, combining BM25
lexical ranking with ChromaDB semantic retrieval, and uses a Large Language
Model, LLaMA-3.3-70B, to generate conversational responses. The generated text
was found to be semantically highly relevant, with a BERTScore of 0.831 and a
METEOR score of 0.809. The data pipeline was also very efficient, taking 106.82
seconds for updates, compared to 368.62 seconds for new data. This chatbot will
be able to help students by responding to their queries, helping them to get a
better understanding of university life, and assisting them to plan better
routines for their semester in the open-credit university.
comment: 11 pages
☆ Seeing Straight: Document Orientation Detection for Efficient OCR
Suranjan Goswami, Abhinav Ravi, Raja Kolla, Ali Faraz, Shaharukh Khan, Akash, Chandra Khatri, Shubham Agarwal
Despite significant advances in document understanding, determining the
correct orientation of scanned or photographed documents remains a critical
pre-processing step in the real world settings. Accurate rotation correction is
essential for enhancing the performance of downstream tasks such as Optical
Character Recognition (OCR) where misalignment commonly arises due to user
errors, particularly incorrect base orientations of the camera during capture.
In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for
evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from
rotation-transformed structured and free-form English OCR datasets, and (ii)
ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource
languages. We also present a fast, robust and lightweight rotation
classification pipeline built on the vision encoder of Phi-3.5-Vision model
with dynamic image cropping, fine-tuned specifically for 4-class rotation task
in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy
on identifying the rotations respectively on both the datasets. Beyond
classification, we demonstrate the critical role of our module in boosting OCR
performance: closed-source (up to 14%) and open-weights models (up to 4x) in
the simulated real-world setting.
☆ BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation
Fahim Ahmed, Md Mubtasim Ahasan, Jahir Sadik Monon, Muntasir Wahed, M Ashraful Amin, A K M Mahbubur Rahman, Amin Ahsan Ali
Text-to-SQL systems provide a natural language interface that can enable even
laymen to access information stored in databases. However, existing Large
Language Models (LLM) struggle with SQL generation from natural instructions
due to large schema sizes and complex reasoning. Prior work often focuses on
complex, somewhat impractical pipelines using flagship models, while smaller,
efficient models remain overlooked. In this work, we explore three multi-agent
LLM pipelines, with systematic performance benchmarking across a range of small
to large open-source models: (1) Multi-agent discussion pipeline, where agents
iteratively critique and refine SQL queries, and a judge synthesizes the final
answer; (2) Planner-Coder pipeline, where a thinking model planner generates
stepwise SQL generation plans and a coder synthesizes queries; and (3)
Coder-Aggregator pipeline, where multiple coders independently generate SQL
queries, and a reasoning agent selects the best query. Experiments on the
Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small
model performance, with up to 10.6% increase in Execution Accuracy for
Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines,
the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B
and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest
score of 56.4%. Codes are available at
https://github.com/treeDweller98/bappa-sql.
☆ CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese
Automatic speech recognition (ASR) is critical for language accessibility,
yet low-resource Cantonese remains challenging due to limited annotated data,
six lexical tones, tone sandhi, and accent variation. Existing ASR models, such
as Whisper, often suffer from high word error rates. Large audio-language
models (LALMs), in contrast, can leverage broader contextual reasoning but
still require explicit tonal and prosodic acoustic cues. We introduce CantoASR,
a collaborative ASR-LALM error correction framework that integrates forced
alignment for acoustic feature extraction, a LoRA-finetuned Whisper for
improved tone discrimination, and an instruction-tuned Qwen-Audio for
prosody-aware correction. Evaluations on spontaneous Cantonese data show
substantial CER gains over Whisper-Large-V3. These findings suggest that
integrating acoustic cues with LALM reasoning provides a scalable strategy for
low-resource tonal and dialectal ASR.
☆ RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning
Large language models (LLMs) achieve high performance on mathematical
reasoning, but these results can be inflated by training data leakage or
superficial pattern matching rather than genuine reasoning. To this end, an
adversarial perturbation-based evaluation is needed to measure true
mathematical reasoning ability. Current rule-based perturbation methods often
generate ill-posed questions and impede the systematic evaluation of question
difficulty and the evolution of benchmarks. To bridge this gap, we propose
RIDE, a novel adversarial question-rewriting framework that leverages Item
Response Theory (IRT) to rigorously measure question difficulty and to generate
intrinsically more challenging, well-posed variations of mathematical problems.
We employ 35 LLMs to simulate students and build a difficulty ranker from their
responses. This ranker provides a reward signal during reinforcement learning
and guides a question-rewriting model to reformulate existing questions across
difficulty levels. Applying RIDE to competition-level mathematical benchmarks
yields perturbed versions that degrade advanced LLM performance, with
experiments showing an average 21.73% drop across 26 models, thereby exposing
limited robustness in mathematical reasoning and confirming the validity of our
evaluation approach.
☆ Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models
Recent work has explored batch prompting as a strategy to amortize inference
cost in large language models (LLMs). In this paper, we show that batching
offers an additional, underappreciated benefit: it regularizes model behavior
during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a
comprehensive study across 13 diverse benchmarks and observe that batching
improves accuracy while substantially reducing reasoning token usage, often by
3x-5x. Through detailed behavioral analysis, we find that batching suppresses
overthinking, reduces hedging language (e.g., repetitive self-corrections), and
encourages more decisive answers. Surprisingly, we also observe emergent
collective effects in batched inference: models often generalize patterns from
earlier examples to solve harder ones in the same batch. These findings
position batching not just as a throughput optimization, but as a powerful
inference-time regularizer for more efficient and reliable LLM reasoning.
☆ Sub-exponential Growth in Online Word Usage: A Piecewise Power-Law Model
The diffusion of ideas and language in society has conventionally been
described by S-shaped models, such as the logistic curve. However, the role of
sub-exponential growth -a slower than exponential pattern known in
epidemiology- has been largely overlooked in broader social phenomena. Here, we
present a piecewise power-law model to characterize complex growth curves with
a few parameters. We systematically analyzed a large-scale dataset of
approximately one billion Japanese blog articles linked to Wikipedia
vocabulary, and observed consistent patterns in web search trend data (English,
Spanish, and Japanese). Our analysis of the 2,965 selected items reveals that
about 55% (1,625 items) were found to have no abrupt jumps and were well
captured by one or two segments. For single-segment curves, we found that (i)
the mode of the shape parameter alpha was near 0.5, indicating prevalent
sub-exponential growth; (ii) the ultimate diffusion scale is primarily
determined by the growth rate R, with minor contributions from alpha or the
duration T; and (iii) alpha showed a tendency to vary with the nature of the
topic, being smaller for niche/local topics and larger for widely shared ones.
Furthermore, a micro-behavioral model distinguishing outward contact with
strangers from inward interaction within their community suggests that alpha
can be interpreted as an index of the preference for outward-oriented
communication. These findings suggest that sub-exponential growth is a common
pattern of social diffusion, and our model provides a practical framework for
consistently describing, comparing, and interpreting complex and diverse growth
curves.
☆ A Characterization of List Language Identification in the Limit
We study the problem of language identification in the limit, where given a
sequence of examples from a target language, the goal of the learner is to
output a sequence of guesses for the target language such that all the guesses
beyond some finite time are correct. Classical results of Gold showed that
language identification in the limit is impossible for essentially any
interesting collection of languages. Later, Angluin gave a precise
characterization of language collections for which this task is possible.
Motivated by recent positive results for the related problem of language
generation, we revisit the classic language identification problem in the
setting where the learner is given the additional power of producing a list of
$k$ guesses at each time step. The goal is to ensure that beyond some finite
time, one of the guesses is correct at each time step.
We give an exact characterization of collections of languages that can be
$k$-list identified in the limit, based on a recursive version of Angluin's
characterization (for language identification with a list of size $1$). This
further leads to a conceptually appealing characterization: A language
collection can be $k$-list identified in the limit if and only if the
collection can be decomposed into $k$ collections of languages, each of which
can be identified in the limit (with a list of size $1$). We also use our
characterization to establish rates for list identification in the statistical
setting where the input is drawn as an i.i.d. stream from a distribution
supported on some language in the collection. Our results show that if a
collection is $k$-list identifiable in the limit, then the collection can be
$k$-list identified at an exponential rate, and this is best possible. On the
other hand, if a collection is not $k$-list identifiable in the limit, then it
cannot be $k$-list identified at any rate that goes to zero.
☆ Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods
Eva Prakash, Maayane Attias, Pierre Chambon, Justin Xu, Steven Truong, Jean-Benoit Delbrouck, Tessa Cook, Curtis Langlotz
Objective: To enhance automated de-identification of radiology reports by
scaling transformer-based models through extensive training datasets and
benchmarking performance against commercial cloud vendor systems for protected
health information (PHI) detection. Materials and Methods: In this
retrospective study, we built upon a state-of-the-art, transformer-based, PHI
de-identification pipeline by fine-tuning on two large annotated radiology
corpora from Stanford University, encompassing chest X-ray, chest CT,
abdomen/pelvis CT, and brain MR reports and introducing an additional PHI
category (AGE) into the architecture. Model performance was evaluated on test
sets from Stanford and the University of Pennsylvania (Penn) for token-level
PHI detection. We further assessed (1) the stability of synthetic PHI
generation using a "hide-in-plain-sight" method and (2) performance against
commercial systems. Precision, recall, and F1 scores were computed across all
PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the
Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining
the previous state-of-the-art model performance. Synthetic PHI evaluation
showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50
independently de-identified Penn datasets. Our model outperformed all vendor
systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754).
Discussion: Large-scale, multimodal training improved cross-institutional
generalization and robustness. Synthetic PHI generation preserved data utility
while ensuring privacy. Conclusion: A transformer-based de-identification model
trained on diverse radiology datasets outperforms prior academic and commercial
systems in PHI detection and establishes a new benchmark for secure clinical
text processing.
comment: In submission to JAMIA
☆ The truth is no diaper: Human and AI-generated associations to emotional words
Human word associations are a well-known method of gaining insight into the
internal mental lexicon, but the responses spontaneously offered by human
participants to word cues are not always predictable as they may be influenced
by personal experience, emotions or individual cognitive styles. The ability to
form associative links between seemingly unrelated concepts can be the driving
mechanisms of creativity. We perform a comparison of the associative behaviour
of humans compared to large language models. More specifically, we explore
associations to emotionally loaded words and try to determine whether large
language models generate associations in a similar way to humans. We find that
the overlap between humans and LLMs is moderate, but also that the associations
of LLMs tend to amplify the underlying emotional load of the stimulus, and that
they tend to be more predictable and less creative than human ones.
comment: 6 pages, 1 figure. Presented at ICCC'25, Campinas, Brazil
☆ Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering
Temporal Knowledge Graph Question Answering (TKGQA) aims to answer
time-sensitive questions by leveraging factual information from Temporal
Knowledge Graphs (TKGs). While previous studies have employed pre-trained TKG
embeddings or graph neural networks to inject temporal knowledge, they fail to
fully understand the complex semantic information of time constraints.
Recently, Large Language Models (LLMs) have shown remarkable progress,
benefiting from their strong semantic understanding and reasoning
generalization capabilities. However, their temporal reasoning ability remains
limited. LLMs frequently suffer from hallucination and a lack of knowledge. To
address these limitations, we propose the Plan of Knowledge framework with a
contrastive temporal retriever, which is named PoK. Specifically, the proposed
Plan of Knowledge module decomposes a complex temporal question into a sequence
of sub-objectives from the pre-defined tools, serving as intermediate guidance
for reasoning exploration. In parallel, we construct a Temporal Knowledge Store
(TKS) with a contrastive retrieval framework, enabling the model to selectively
retrieve semantically and temporally aligned facts from TKGs. By combining
structured planning with temporal knowledge retrieval, PoK effectively enhances
the interpretability and factual consistency of temporal reasoning. Extensive
experiments on four benchmark TKGQA datasets demonstrate that PoK significantly
improves the retrieval precision and reasoning accuracy of LLMs, surpassing the
performance of the state-of-the-art TKGQA methods by 56.0% at most.
comment: Submitted to the IEEE for possible publication
☆ T-FIX: Text-Based Explanations with Features Interpretable to eXperts
Shreya Havaldar, Helen Jin, Chaehyeon Kim, Anton Xue, Weiqiu You, Marco Gatti, Bhuvnesh Jain, Helen Qu, Daniel A Hashimoto, Amin Madani, Rajat Deo, Sameed Ahmed M. Khatana, Gary E. Weissman, Lyle Ungar, Eric Wong
As LLMs are deployed in knowledge-intensive settings (e.g., surgery,
astronomy, therapy), users expect not just answers, but also meaningful
explanations for those answers. In these settings, users are often domain
experts (e.g., doctors, astrophysicists, psychologists) who require
explanations that reflect expert-level reasoning. However, current evaluation
schemes primarily emphasize plausibility or internal faithfulness of the
explanation, which fail to capture whether the content of the explanation truly
aligns with expert intuition. We formalize expert alignment as a criterion for
evaluating explanations with T-FIX, a benchmark spanning seven
knowledge-intensive domains. In collaboration with domain experts, we develop
novel metrics to measure the alignment of LLM explanations with expert
judgment.
☆ DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization NeurIPS 2025
Quantization plays a crucial role in accelerating the inference of
large-scale models, and rotational matrices have been shown to effectively
improve quantization performance by smoothing outliers. However, end-to-end
fine-tuning of rotational optimization algorithms incurs high computational
costs and is prone to overfitting. To address this challenge, we propose an
efficient distribution-aware rotational calibration method, DartQuant, which
reduces the complexity of rotational optimization by constraining the
distribution of the activations after rotation. This approach also effectively
reduces reliance on task-specific losses, thereby mitigating the risk of
overfitting. Additionally, we introduce the QR-Orth optimization scheme, which
replaces expensive alternating optimization with a more efficient solution. In
a variety of model quantization experiments, DartQuant demonstrates superior
performance. Compared to existing methods, it achieves 47$\times$ acceleration
and 10$\times$ memory savings for rotational optimization on a 70B model.
Furthermore, it is the first to successfully complete rotational calibration
for a 70B model on a single 3090 GPU, making quantization of large language
models feasible in resource-constrained environments. Code is available at
https://github.com/CAS-CLab/DartQuant.git.
comment: NeurIPS 2025, 10 pages, 12 figures
☆ Explorability in Pushdown Automata
We study explorability, a measure of nondeterminism in pushdown automata,
which generalises history-determinism. An automaton is k-explorable if, while
reading the input, it suffices to follow k concurrent runs, built step-by-step
based only on the input seen so far, to construct an accepting one, if it
exists. We show that the class of explorable PDAs lies strictly between
history-deterministic and fully nondeterministic PDAs in terms of both
expressiveness and succinctness. In fact increasing explorability induces an
infinite hierarchy: each level k defines a strictly more expressive class than
level k-1, yet the entire class remains less expressive than general
nondeterministic PDAs. We then introduce a parameterized notion of
explorability, where the number of runs may depend on input length, and show
that exponential explorability precisely captures the context-free languages.
Finally, we prove that explorable PDAs can be doubly exponentially more
succinct than history-deterministic ones, and that the succinctness gap between
deterministic and 2-explorable PDAs is not recursively enumerable. These
results position explorability as a robust and operationally meaningful measure
of nondeterminism for pushdown systems.
☆ WST: Weakly Supervised Transducer for Automatic Speech Recognition
Dongji Gao, Chenda Liao, Changliang Liu, Matthew Wiesner, Leibny Paola Garcia, Daniel Povey, Sanjeev Khudanpur, Jian Wu
The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in
end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily
on large-scale, high-quality annotated data, which are often costly and
difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised
Transducer (WST), which integrates a flexible training graph designed to
robustly handle errors in the transcripts without requiring additional
confidence estimation or auxiliary pre-trained models. Empirical evaluations on
synthetic and industrial datasets reveal that WST effectively maintains
performance even with transcription error rates of up to 70%, consistently
outperforming existing Connectionist Temporal Classification (CTC)-based weakly
supervised approaches, such as Bypass Temporal Classification (BTC) and
Omni-Temporal Classification (OTC). These results demonstrate the practical
utility and robustness of WST in realistic ASR settings. The implementation
will be publicly available.
☆ Abductive Inference in Retrieval-Augmented Language Models: Generating and Validating Missing Premises
Large Language Models (LLMs) enhanced with retrieval -- commonly referred to
as Retrieval-Augmented Generation (RAG) -- have demonstrated strong performance
in knowledge-intensive tasks. However, RAG pipelines often fail when retrieved
evidence is incomplete, leaving gaps in the reasoning process. In such cases,
\emph{abductive inference} -- the process of generating plausible missing
premises to explain observations -- offers a principled approach to bridge
these gaps. In this paper, we propose a framework that integrates abductive
inference into retrieval-augmented LLMs. Our method detects insufficient
evidence, generates candidate missing premises, and validates them through
consistency and plausibility checks. Experimental results on abductive
reasoning and multi-hop QA benchmarks show that our approach improves both
answer accuracy and reasoning faithfulness. This work highlights abductive
inference as a promising direction for enhancing the robustness and
explainability of RAG systems.
☆ Towards Scalable Meta-Learning of near-optimal Interpretable Models via Synthetic Model Generations
Decision trees are widely used in high-stakes fields like finance and
healthcare due to their interpretability. This work introduces an efficient,
scalable method for generating synthetic pre-training data to enable
meta-learning of decision trees. Our approach samples near-optimal decision
trees synthetically, creating large-scale, realistic datasets. Using the
MetaTree transformer architecture, we demonstrate that this method achieves
performance comparable to pre-training on real-world data or with
computationally expensive optimal decision trees. This strategy significantly
reduces computational costs, enhances data generation flexibility, and paves
the way for scalable and efficient meta-learning of interpretable decision tree
models.
comment: 9 pages, 3 figures, Neurips 2025 GenAI in Finance Workshop
☆ LLMs and Cultural Values: the Impact of Prompt Language and Explicit Cultural Framing
Large Language Models (LLMs) are rapidly being adopted by users across the
globe, who interact with them in a diverse range of languages. At the same
time, there are well-documented imbalances in the training data and
optimisation objectives of this technology, raising doubts as to whether LLMs
can represent the cultural diversity of their broad user base. In this study,
we look at LLMs and cultural values and examine how prompt language and
cultural framing influence model responses and their alignment with human
values in different countries. We probe 10 LLMs with 63 items from the Hofstede
Values Survey Module and World Values Survey, translated into 11 languages, and
formulated as prompts with and without different explicit cultural
perspectives. Our study confirms that both prompt language and cultural
perspective produce variation in LLM outputs, but with an important caveat:
While targeted prompting can, to a certain extent, steer LLM responses in the
direction of the predominant values of the corresponding countries, it does not
overcome the models' systematic bias toward the values associated with a
restricted set of countries in our dataset: the Netherlands, Germany, the US,
and Japan. All tested models, regardless of their origin, exhibit remarkably
similar patterns: They produce fairly neutral responses on most topics, with
selective progressive stances on issues such as social tolerance. Alignment
with cultural values of human respondents is improved more with an explicit
cultural perspective than with a targeted prompt language. Unexpectedly,
combining both approaches is no more effective than cultural framing with an
English prompt. These findings reveal that LLMs occupy an uncomfortable middle
ground: They are responsive enough to changes in prompts to produce variation,
but too firmly anchored to specific cultural defaults to adequately represent
cultural diversity.
comment: Preprint under review at Computational Linguistics. Accepted with
minor revisions (10/10/2025); second round
☆ Multi-Agent Collaborative Framework For Math Problem Generation
Automatic question generation (AQG) for mathematics education remains an
elusive goal for Intelligent Tutoring Systems and educators. While pre-trained
transformer-based language models have significantly advanced natural language
generation, they often struggle to precisely control problem complexity and
cognitive demands. In this paper, we introduce a collaborative multi-agent
framework as a novel method of incorporating inference-time computation into
AQG. This approach leverages multiple agents that iteratively refine generated
question-answer pairs to better balance complexity and cognitive demand. We
evaluate the generated questions on five meta-evaluation criteria: relevance,
importance, clarity, difficulty matching, answerability, to assess the system's
ability to control the required complexity and quality of the questions.
Preliminary evaluations show that this collaborative multi-agent framework
elevates the quality of generated educational content by fostering a more
nuanced balance between cognitive challenge and clarity. These promising
outcomes suggest that integrating collaborative multi-agent workflows can yield
more controlled, pedagogically valuable content that can help advance automated
educational content generation and adaptive learning environments.
comment: Published in the Proceedings of the 18th International Conference on
Educational Data Mining, 6 pages, 5 figures
☆ Direct Semantic Communication Between Large Language Models via Vector Translation
In multi-agent settings, such as debate, reflection, or tool-calling, large
language models (LLMs) pass messages as plain tokens, discarding most latent
semantics. This constrains information transfer and adds unnecessary
computational overhead. We form a latent bridge via vector translations, which
use learned mappings that enable direct semantic exchange between
representation spaces. A dual-encoder translator trained between Llama-2-7B and
Mistral-7B-Instruct attains an average cosine alignment of 0.538. Injecting the
translated vectors at 30 percent blending strength steers the target model's
generation without destabilizing logits. Bidirectional evaluation shows a
2.01:1 transfer asymmetry, indicating that general-purpose models yield more
transferable representations than instruction-tuned variants. This conservative
injection preserves computational stability while demonstrating that
cross-model latent communication is feasible, enabling collaborative AI systems
that share meaning rather than tokens.
comment: 9 pages, 1 figure, 2 tables
☆ MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation NeurIPS 2025
We present MIDI-LLM, an LLM for generating multitrack MIDI music from
free-form text prompts. Our approach expands a text LLM's vocabulary to include
MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI
abilities. By preserving the original LLM's parameter structure, we can
directly leverage the vLLM library for accelerated inference. Experiments show
that MIDI-LLM achieves higher quality, better text control, and faster
inference compared to the recent Text2midi model. Live demo at
https://midi-llm-demo.vercel.app.
comment: To appear at NeurIPS 2025 Workshop on AI for Music
☆ RLHF: A comprehensive Survey for Cultural, Multimodal and Low Latency Alignment Methods
Reinforcement Learning from Human Feedback (RLHF) is the standard for
aligning Large Language Models (LLMs), yet recent progress has moved beyond
canonical text-based methods. This survey synthesizes the new frontier of
alignment research by addressing critical gaps in multi-modal alignment,
cultural fairness, and low-latency optimization. To systematically explore
these domains, we first review foundational algo- rithms, including PPO, DPO,
and GRPO, before presenting a detailed analysis of the latest innovations. By
providing a comparative synthesis of these techniques and outlining open
challenges, this work serves as an essential roadmap for researchers building
more robust, efficient, and equitable AI systems.
♻ ☆ Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences
When do machine learning systems fail to generalize, and what mechanisms
could improve their generalization? Here, we draw inspiration from cognitive
science to argue that one weakness of parametric machine learning systems is
their failure to exhibit latent learning -- learning information that is not
relevant to the task at hand, but that might be useful in a future task. We
show how this perspective links failures ranging from the reversal curse in
language modeling to new findings on agent-based navigation. We then highlight
how cognitive science points to episodic memory as a potential part of the
solution to these issues. Correspondingly, we show that a system with an oracle
retrieval mechanism can use learning experiences more flexibly to generalize
better across many of these challenges. We also identify some of the essential
components for effectively using retrieval, including the importance of
within-example in-context learning for acquiring the ability to use information
across retrieved examples. In summary, our results illustrate one possible
contributor to the relative data inefficiency of current machine learning
systems compared to natural intelligence, and help to understand how retrieval
methods can complement parametric learning to improve generalization. We close
by discussing some of the links between these findings and prior results in
cognitive science and neuroscience, and the broader implications.
♻ ☆ Distillation versus Contrastive Learning: How to Train Your Rerankers AACL 2025
Training effective text rerankers is crucial for information retrieval. Two
strategies are widely used: contrastive learning (optimizing directly on
ground-truth labels) and knowledge distillation (transferring knowledge from a
larger reranker). While both have been studied extensively, a clear comparison
of their effectiveness for training cross-encoder rerankers under practical
conditions is needed.
This paper empirically compares these strategies by training rerankers of
different sizes (0.5B, 1.5B, 3B, 7B) and architectures (Transformer, Recurrent)
using both methods on the same data, with a strong contrastive learning model
acting as the distillation teacher. Our results show that knowledge
distillation generally yields better in-domain and out-of-domain ranking
performance than contrastive learning when distilling from a more performant
teacher model. This finding is consistent across student model sizes and
architectures. However, distilling from a teacher of the same capacity does not
provide the same advantage, particularly for out-of-domain tasks. These
findings offer practical guidance for choosing a training strategy based on
available teacher models. We recommend using knowledge distillation to train
smaller rerankers if a larger, more performant teacher is accessible; in its
absence, contrastive learning remains a robust baseline. Our code
implementation is made available to facilitate reproducbility.
comment: IJCNLP-AACL 2025 Findings
♻ ☆ Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions
For machine learning datasets to accurately represent diverse opinions in a
population, they must preserve variation in data labels while filtering out
spam or low-quality responses. How can we balance annotator reliability and
representation? We empirically evaluate how a range of heuristics for annotator
filtering affect the preservation of variation on subjective tasks. We find
that these methods, designed for contexts in which variation from a single
ground-truth label is considered noise, often remove annotators who disagree
instead of spam annotators, introducing suboptimal tradeoffs between accuracy
and label diversity. We find that conservative settings for annotator removal
(<5%) are best, after which all tested methods increase the mean absolute error
from the true average label. We analyze performance on synthetic spam to
observe that these methods often assume spam annotators are more random than
real spammers tend to be: most spammers are distributionally indistinguishable
from real annotators, and the minority that are distinguishable tend to give
relatively fixed answers, not random ones. Thus, tasks requiring the
preservation of variation reverse the intuition of existing spam filtering
methods: spammers tend to be less random than non-spammers, so metrics that
assume variation is spam fare worse. These results highlight the need for spam
removal methods that account for label diversity.
♻ ☆ Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications
Anran Li, Lingfei Qian, Mengmeng Du, Yu Yin, Yan Hu, Zihao Sun, Yihang Fu, Erica Stutz, Xuguang Ai, Qianqian Xie, Rui Zhu, Jimin Huang, Yifan Yang, Siru Liu, Yih-Chung Tham, Lucila Ohno-Machado, Hyunghoon Cho, Zhiyong Lu, Hua Xu, Qingyu Chen
Large Language Models (LLMs) have demonstrated significant potential in
medicine. To date, LLMs have been widely applied to tasks such as diagnostic
assistance, medical question answering, and clinical information synthesis.
However, a key open question remains: to what extent do LLMs memorize medical
training data. In this study, we present the first comprehensive evaluation of
memorization of LLMs in medicine, assessing its prevalence (how frequently it
occurs), characteristics (what is memorized), volume (how much content is
memorized), and potential downstream impacts (how memorization may affect
medical applications). We systematically analyze common adaptation scenarios:
(1) continued pretraining on medical corpora, (2) fine-tuning on standard
medical benchmarks, and (3) fine-tuning on real-world clinical data, including
over 13,000 unique inpatient records from Yale New Haven Health System. The
results demonstrate that memorization is prevalent across all adaptation
scenarios and significantly higher than reported in the general domain.
Memorization affects both the development and adoption of LLMs in medicine and
can be categorized into three types: beneficial (e.g., accurate recall of
clinical guidelines and biomedical references), uninformative (e.g., repeated
disclaimers or templated medical document language), and harmful (e.g.,
regeneration of dataset-specific or sensitive clinical content). Based on these
findings, we offer practical recommendations to facilitate beneficial
memorization that enhances domain-specific reasoning and factual accuracy,
minimize uninformative memorization to promote deeper learning beyond
surface-level patterns, and mitigate harmful memorization to prevent the
leakage of sensitive or identifiable patient information.
♻ ☆ XL-DURel: Finetuning Sentence Transformers for Ordinal Word-in-Context Classification
We propose XL-DURel, a finetuned, multilingual Sentence Transformer model
optimized for ordinal Word-in-Context classification. We test several loss
functions for regression and ranking tasks managing to outperform previous
models on ordinal and binary data with a ranking objective based on angular
distance in complex space. We further show that binary WiC can be treated as a
special case of ordinal WiC and that optimizing models for the general ordinal
task improves performance on the more specific binary task. This paves the way
for a unified treatment of WiC modeling across different task formulations.
comment: 9 pages
♻ ☆ VISTA Score: Verification In Sequential Turn-based Assessment
Hallucination--defined here as generating statements unsupported or
contradicted by available evidence or conversational context--remains a major
obstacle to deploying conversational AI systems in settings that demand factual
reliability. Existing metrics either evaluate isolated responses or treat
unverifiable content as errors, limiting their use for multi-turn dialogue. We
introduce VISTA (Verification In Sequential Turn-based Assessment), a framework
for evaluating conversational factuality through claim-level verification and
sequential consistency tracking. VISTA decomposes each assistant turn into
atomic factual claims, verifies them against trusted sources and dialogue
history, and categorizes unverifiable statements (subjective, contradicted,
lacking evidence, or abstaining). Across eight large language models and four
dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA
substantially improves hallucination detection over FACTSCORE and LLM-as-Judge
baselines. Human evaluation confirms that VISTA's decomposition improves
annotator agreement and reveals inconsistencies in existing benchmarks. By
modeling factuality as a dynamic property of conversation, VISTA offers a more
transparent, human-aligned measure of truthfulness in dialogue systems.
♻ ☆ Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs NeurIPS 2025
Recent advances in Large Language Models (LLMs) have highlighted the critical
importance of extending context length, yet the quadratic complexity of
attention mechanisms poses significant challenges for efficient long-context
modeling. KV cache compression has emerged as a key approach to address this
challenge. Through extensive empirical analysis, we reveal a fundamental yet
previously overlooked asymmetry in KV caches: while adjacent keys receive
similar attention weights ({\it local homogeneity}), adjacent values
demonstrate distinct {\it heterogeneous} distributions. This key-value
asymmetry reveals a critical limitation in existing compression methods that
treat keys and values uniformly. To address the limitation, we propose a
training-free compression framework (AsymKV) that combines homogeneity-based
key merging with a mathematically proven lossless value compression. Extensive
experiments demonstrate that AsymKV consistently outperforms existing
long-context methods across various tasks and base models. For example, on
LLaMA3.1-8B, AsymKV achieves an average score of 43.95 on LongBench, surpassing
SOTA methods like H$_2$O (38.89) by a large margin.Our code can be found in
this link:https://github.com/the-scale-lab/Asymkv.
comment: 14 pages,7 figures;Accepted by NeurIPS 2025
♻ ☆ OceanAI: A Conversational Platform for Accurate, Transparent, Near-Real-Time Oceanographic Insights
Bowen Chen, Jayesh Gajbhar, Gregory Dusek, Rob Redmon, Patrick Hogan, Paul Liu, DelWayne Bohnenstiehl, Dongkuan Xu, Ruoying He
Artificial intelligence is transforming the sciences, yet general
conversational AI systems often generate unverified "hallucinations"
undermining scientific rigor. We present OceanAI, a conversational platform
that integrates the natural-language fluency of open-source large language
models (LLMs) with real-time, parameterized access to authoritative
oceanographic data streams hosted by the National Oceanic and Atmospheric
Administration (NOAA). Each query such as "What was Boston Harbor's highest
water level in 2024?" triggers real-time API calls that identify, parse, and
synthesize relevant datasets into reproducible natural-language responses and
data visualizations. In a blind comparison with three widely used AI
chat-interface products, only OceanAI produced NOAA-sourced values with
original data references; others either declined to answer or provided
unsupported results. Designed for extensibility, OceanAI connects to multiple
NOAA data products and variables, supporting applications in marine hazard
forecasting, ecosystem assessment, and water-quality monitoring. By grounding
outputs and verifiable observations, OceanAI advances transparency,
reproducibility, and trust, offering a scalable framework for AI-enabled
decision support within the oceans. A public demonstration is available at
https://oceanai.ai4ocean.xyz.
comment: A related presentation will be given at the AGU(American Geophysical
Union) and AMS(American Meteorological Society) Annual Meetings
♻ ☆ Legal Fact Prediction: The Missing Piece in Legal Judgment Prediction EMNLP 2025
Junkai Liu, Yujie Tong, Hui Huang, Bowen Zheng, Yiran Hu, Peicheng Wu, Chuan Xiao, Makoto Onizuka, Muyun Yang, Shuyuan Zheng
Legal judgment prediction (LJP), which enables litigants and their lawyers to
forecast judgment outcomes and refine litigation strategies, has emerged as a
crucial legal NLP task. Existing studies typically utilize legal facts, i.e.,
facts that have been established by evidence and determined by the judge, to
predict the judgment. However, legal facts are often difficult to obtain in the
early stages of litigation, significantly limiting the practical applicability
of fact-based LJP. To address this limitation, we propose a novel legal NLP
task: legal fact prediction (LFP), which takes the evidence submitted by
litigants for trial as input to predict legal facts, thereby empowering
fact-based LJP technologies to make predictions in the absence of ground-truth
legal facts. We also propose the first benchmark dataset, LFPBench, for
evaluating the LFP task. Our extensive experiments on LFPBench demonstrate the
effectiveness of LFP-empowered LJP and highlight promising research directions
for LFP.
comment: Accepted for EMNLP 2025 Main Conference
♻ ☆ Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
State Space Models (SSMs) are emerging as a compelling alternative to
Transformers because of their consistent memory usage and high performance.
Despite this, scaling up SSMs on cloud services or limited-resource devices is
challenging due to their storage requirements and computational power. To
overcome this, quantizing SSMs with low bit-width data formats can reduce model
size and benefit from hardware acceleration. As SSMs are prone to
quantization-induced errors, recent efforts have focused on optimizing a
particular model or bit-width for efficiency without sacrificing performance.
However, distinct bit-width configurations are essential for different
scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for
enhancing generation speed in short prompt applications for a single user. To
this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both
Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment
on various platforms. Based on the channel order preserving and activation
persistence of SSMs, we propose an offline approach to quantize inputs of a
linear recurrence in 8-bit by sorting and clustering for input $x$, combined
with a per-state-group quantization for input-dependent parameters $B$ and $C$.
To ensure compute-invariance in the SSM output, we rearrange weights offline
according to the clustering sequence. The experiments show that Quamba2-8B
outperforms two state-of-the-art SSM quantization methods and delivers
1.3$\times$ and 3$\times$ speed-ups in the pre-filling and generation stages,
respectively, while offering 4$\times$ memory reduction with only a $1.6\%$
average accuracy drop. The evaluation on MMLU shows the generalizability and
robustness of our framework. The code and quantized models will be released at:
https://github.com/enyac-group/Quamba.
♻ ☆ LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users AAAI 2026
While state-of-the-art large language models (LLMs) have shown impressive
performance on many tasks, there has been extensive research on undesirable
model behavior such as hallucinations and bias. In this work, we investigate
how the quality of LLM responses changes in terms of information accuracy,
truthfulness, and refusals depending on three user traits: English proficiency,
education level, and country of origin. We present extensive experimentation on
three state-of-the-art LLMs and two different datasets targeting truthfulness
and factuality. Our findings suggest that undesirable behaviors in
state-of-the-art LLMs occur disproportionately more for users with lower
English proficiency, of lower education status, and originating from outside
the US, rendering these models unreliable sources of information towards their
most vulnerable users.
comment: Paper accepted at AAAI 2026
♻ ☆ LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge
Heng Zhou, Ao Yu, Yuchen Fan, Jianing Shi, Li Kang, Hejia Geng, Yongting Zhang, Yutao Fan, Yuhao Wu, Tiancheng He, Yiran Qin, Lei Bai, Zhenfei Yin
Evaluating large language models (LLMs) on question answering often relies on
static benchmarks that reward memorization and understate the role of
retrieval, failing to capture the dynamic nature of world knowledge. We present
LiveSearchBench, an automated pipeline for constructing retrieval-dependent
benchmarks from recent knowledge updates. Our method computes deltas between
successive Wikidata snapshots, filters candidate triples for quality, and
synthesizes natural-language questions at three levels of reasoning difficulty,
each guaranteed to admit a unique, verifiable answer through SPARQL validation.
The pipeline is fully automated, scalable across time, and minimizes human
intervention, enabling continual regeneration of temporally grounded
benchmarks. Experiments show a pronounced performance drop when models confront
facts that post-date pretraining, with the gap most salient on multi-hop
queries. Retrieval augmented methods and larger, instruction-tuned models
provide partial gains but fail to close this recency gap. By design,
LiveSearchBench shifts evaluation from static memorization toward tasks that
require up-to-date retrieval and reasoning, offering a foundation for
systematic, long-term assessment of LLMs under evolving knowledge.
♻ ☆ What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization AACL
Traditional dialogue summarization primarily focuses on dialogue content,
assuming it comprises adequate information for a clear summary. However, this
assumption often fails for discussions grounded in shared background, where
participants frequently omit context and use implicit references. This results
in summaries that are confusing to readers unfamiliar with the background. To
address this, we introduce Knowledge-Grounded Discussion Summarization (KGDS),
a novel task that produces a supplementary background summary for context and a
clear opinion summary with clarified references. To facilitate research, we
construct the first KGDS benchmark, featuring news-discussion pairs and
expert-created multi-granularity gold annotations for evaluating sub-summaries.
We also propose a novel hierarchical evaluation framework with fine-grained and
interpretable metrics. Our extensive evaluation of 12 advanced large language
models (LLMs) reveals that KGDS remains a significant challenge. The models
frequently miss key facts and retain irrelevant ones in background
summarization, and often fail to resolve implicit references in opinion summary
integration.
comment: Accepted to AACL-IJCNLP 2025 Main
♻ ☆ Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards EMNLP
Manveer Singh Tamber, Forrest Sheng Bao, Chenyu Xu, Ge Luo, Suleman Kazi, Minseok Bae, Miaoran Li, Ofer Mendelevitch, Renyi Qu, Jimmy Lin
Retrieval-augmented generation (RAG) aims to reduce hallucinations by
grounding responses in external context, yet large language models (LLMs) still
frequently introduce unsupported information or contradictions even when
provided with relevant context. This paper presents two complementary efforts
at Vectara to measure and benchmark LLM faithfulness in RAG. First, we describe
our original hallucination leaderboard, which has tracked hallucination rates
for LLMs since 2023 using our HHEM hallucination detection model. Motivated by
limitations observed in current hallucination detection methods, we introduce
FaithJudge, an LLM-as-a-judge framework that leverages a pool of diverse
human-annotated hallucination examples to substantially improve the automated
hallucination evaluation of LLMs. We introduce an enhanced hallucination
leaderboard centered on FaithJudge that benchmarks LLMs on RAG faithfulness in
summarization, question-answering, and data-to-text generation tasks.
FaithJudge enables a more reliable benchmarking of LLM hallucinations in RAG
and supports the development of more trustworthy generative AI systems:
https://github.com/vectara/FaithJudge.
comment: EMNLP Industry Track 2025
♻ ☆ DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns
We present DYNARTmo, a dynamic articulatory model designed to visualize
speech articulation processes in a two-dimensional midsagittal plane. The model
builds upon the UK-DYNAMO framework and integrates principles of articulatory
underspecification, segmental and gestural control, and coarticulation.
DYNARTmo simulates six key articulators based on ten continuous and six
discrete control parameters, allowing for the generation of both vocalic and
consonantal articulatory configurations. The current implementation is embedded
in a web-based application (SpeechArticulationTrainer) that includes sagittal,
glottal, and palatal views, making it suitable for use in phonetics education
and speech therapy. While this paper focuses on the static modeling aspects,
future work will address dynamic movement generation and integration with
articulatory-acoustic modules.
comment: 10 pages, 29 references, 2 figures, supplementary material. V2:
Discussion of the tongue-palate contact pattern for /t/. V4: replaces wrong
paper upload (of V3). V5: minor corrections
♻ ☆ Text2VectorSQL: Towards a Unified Interface for Vector Search and SQL Queries
Zhengren Wang, Dongwen Yao, Bozhou Li, Dongsheng Ma, Bo Li, Zhiyu Li, Feiyu Xiong, Bin Cui, Linpeng Tang, Wentao Zhang
The proliferation of unstructured data poses a fundamental challenge to
traditional database interfaces. While Text-to-SQL has democratized access to
structured data, it remains incapable of interpreting semantic or multi-modal
queries. Concurrently, vector search has emerged as the de facto standard for
querying unstructured data, but its integration with SQL-termed VectorSQL-still
relies on manual query crafting and lacks standardized evaluation
methodologies, creating a significant gap between its potential and practical
application.
To bridge this fundamental gap, we introduce and formalize Text2VectorSQL, a
novel task to establish a unified natural language interface for seamlessly
querying both structured and unstructured data. To catalyze research in this
new domain, we present a comprehensive foundational ecosystem, including: (1) A
scalable and robust pipeline for synthesizing high-quality Text-to-VectorSQL
training data. (2) VectorSQLBench, the first large-scale, multi-faceted
benchmark for this task, encompassing 12 distinct combinations across three
database backends (SQLite, PostgreSQL, ClickHouse) and four data sources (BIRD,
Spider, arXiv, Wikipedia). (3) Several novel evaluation metrics designed for
more nuanced performance analysis. Extensive experiments not only confirm
strong baseline performance with our trained models, but also reveal the recall
degradation challenge: the integration of SQL filters with vector search can
lead to more pronounced result omissions than in conventional filtered vector
search. By defining the core task, delivering the essential data and evaluation
infrastructure, and identifying key research challenges, our work lays the
essential groundwork to build the next generation of unified and intelligent
data interfaces. Our repository is available at
https://github.com/OpenDCAI/Text2VectorSQL.
comment: Manuscript
♻ ☆ Will Large Language Models Transform Clinical Prediction?
Objective: Large language models (LLMs) are attracting increasing interest in
healthcare. This commentary evaluates the potential of LLMs to improve clinical
prediction models (CPMs) for diagnostic and prognostic tasks, with a focus on
their ability to process longitudinal electronic health record (EHR) data.
Findings: LLMs show promise in handling multimodal and longitudinal EHR data
and can support multi-outcome predictions for diverse health conditions.
However, methodological, validation, infrastructural, and regulatory chal-
lenges remain. These include inadequate methods for time-to-event modelling,
poor calibration of predictions, limited external validation, and bias
affecting underrepresented groups. High infrastructure costs and the absence of
clear regulatory frameworks further prevent adoption.
Implications: Further work and interdisciplinary collaboration are needed to
support equitable and effective integra- tion into the clinical prediction.
Developing temporally aware, fair, and explainable models should be a priority
focus for transforming clinical prediction workflow.
comment: Published: BMC Diagnostic and Prognostic Research
♻ ☆ TathyaNyaya and FactLegalLlama: Advancing Factual Judgment Prediction and Explanation in the Indian Legal Context AACL
Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Shivam Mishra, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
In the landscape of Fact-based Judgment Prediction and Explanation (FJPE),
reliance on factual data is essential for developing robust and realistic
AI-driven decision-making tools. This paper introduces TathyaNyaya, the largest
annotated dataset for FJPE tailored to the Indian legal context, encompassing
judgments from the Supreme Court of India and various High Courts. Derived from
the Hindi terms "Tathya" (fact) and "Nyaya" (justice), the TathyaNyaya dataset
is uniquely designed to focus on factual statements rather than complete legal
texts, reflecting real-world judicial processes where factual data drives
outcomes. Complementing this dataset, we present FactLegalLlama, an
instruction-tuned variant of the LLaMa-3-8B Large Language Model (LLM),
optimized for generating high-quality explanations in FJPE tasks. Finetuned on
the factual data in TathyaNyaya, FactLegalLlama integrates predictive accuracy
with coherent, contextually relevant explanations, addressing the critical need
for transparency and interpretability in AI-assisted legal systems. Our
methodology combines transformers for binary judgment prediction with
FactLegalLlama for explanation generation, creating a robust framework for
advancing FJPE in the Indian legal domain. TathyaNyaya not only surpasses
existing datasets in scale and diversity but also establishes a benchmark for
building explainable AI systems in legal analysis. The findings underscore the
importance of factual precision and domain-specific tuning in enhancing
predictive performance and interpretability, positioning TathyaNyaya and
FactLegalLlama as foundational resources for AI-assisted legal decision-making.
comment: Paper accepted in the AACL-IJCNLP 2025 conference
♻ ☆ Pragmatic Reasoning improves LLM Code Generation
Large Language Models (LLMs) have demonstrated impressive potential in
translating natural language (NL) instructions into program code. However, user
instructions often contain inherent ambiguities, making it challenging for LLMs
to generate code that accurately reflects the user's true intent. To address
this challenge, researchers have proposed approaches that produce multiple
candidates of the program code and then rerank them to identify the best
solution. In this paper, we propose CodeRSA, a novel code candidate reranking
mechanism built upon the Rational Speech Act (RSA) framework, designed to guide
LLMs toward more comprehensive pragmatic reasoning about user intent. We
evaluate CodeRSA using Llama-3-8B-Instruct and Qwen-2.5-7B-Instruct on two
widely used code generation benchmarks, HumanEval and MBPP. Our experiment
results show that CodeRSA consistently outperforms common baselines, surpasses
the state-of-the-art approach in most cases, and demonstrates robust overall
performance. These findings underscore the effectiveness of integrating
pragmatic reasoning into code candidate reranking, offering a promising
direction for enhancing code generation quality in LLMs.
♻ ☆ CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field LREC 2026
Critical appraisal of scientific literature is an essential skill in the
biomedical field. While large language models (LLMs) can offer promising
support in this task, their reliability remains limited, particularly for
critical reasoning in specialized domains. We introduce CareMedEval, an
original dataset designed to evaluate LLMs on biomedical critical appraisal and
reasoning tasks. Derived from authentic exams taken by French medical students,
the dataset contains 534 questions based on 37 scientific articles. Unlike
existing benchmarks, CareMedEval explicitly evaluates critical reading and
reasoning grounded in scientific papers. Benchmarking state-of-the-art
generalist and biomedical-specialized LLMs under various context conditions
reveals the difficulty of the task: open and commercial models fail to exceed
an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens
considerably improves the results. Yet, models remain challenged especially on
questions about study limitations and statistical analysis. CareMedEval
provides a challenging benchmark for grounded reasoning, exposing current LLM
limitations and paving the way for future development of automated support for
critical appraisal.
comment: Preprint submitted to LREC 2026 (under review) To access the dataset,
see https://github.com/bonzid/CareMedEval
♻ ☆ NyayaRAG: Realistic Legal Judgment Prediction with RAG under the Indian Common Law System AACL
Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Shivam Mishra, Ajay Varghese Thomas, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
Legal Judgment Prediction (LJP) has emerged as a key area in AI for law,
aiming to automate judicial outcome forecasting and enhance interpretability in
legal reasoning. While previous approaches in the Indian context have relied on
internal case content such as facts, issues, and reasoning, they often overlook
a core element of common law systems, which is reliance on statutory provisions
and judicial precedents. In this work, we propose NyayaRAG, a
Retrieval-Augmented Generation (RAG) framework that simulates realistic
courtroom scenarios by providing models with factual case descriptions,
relevant legal statutes, and semantically retrieved prior cases. NyayaRAG
evaluates the effectiveness of these combined inputs in predicting court
decisions and generating legal explanations using a domain-specific pipeline
tailored to the Indian legal system. We assess performance across various input
configurations using both standard lexical and semantic metrics as well as
LLM-based evaluators such as G-Eval. Our results show that augmenting factual
inputs with structured legal knowledge significantly improves both predictive
accuracy and explanation quality.
comment: Paper accepted in the AACL-IJCNLP 2025 conference
♻ ☆ Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality
Alex Fang, Hadi Pouransari, Matt Jordan, Alexander Toshev, Vaishaal Shankar, Ludwig Schmidt, Tom Gunter
Data filtering has become a powerful tool for improving model performance
while reducing computational cost. However, as large language model compute
budgets continue to grow, the limited data volume provided by heavily filtered
and deduplicated datasets will become a practical constraint. In efforts to
better understand how to proceed, we study model performance at various compute
budgets and across multiple pre-training datasets created through data
filtering and deduplication. We find that, given appropriate modifications to
the training recipe, repeating existing aggressively filtered datasets for up
to ten epochs can outperform training on the ten times larger superset for a
single epoch across multiple compute budget orders of magnitude. While this
finding relies on repeating the dataset for many epochs, we also investigate
repeats within these datasets at the document level. We find that not all
documents within a dataset are equal, and we can create better datasets
relative to a token budget by explicitly manipulating the counts of individual
documents. We conclude by arguing that even as large language models scale,
data filtering remains an important direction of research.
♻ ☆ Mathematics with large language models as provers and verifiers
During 2024 and 2025 the discussion about the theorem-proving capabilities of
large language models started reporting interesting success stories, mostly to
do with difficult exercises (such as problems from the International
Mathematical Olympiad), but also with conjectures [Feldman & Karbasi,
arXiv:2509.18383v1] formulated for the purpose of verifying whether the
artificial intelligence could prove it. In this paper we report a theorem
proving feat achieved by ChatGPT by using a protocol involving different prover
and verifier instances of the gpt-5 model working collaboratively. To make sure
that the produced proofs do not suffer from hallucinations, the final proof is
formally verified by the lean proof assistant, and the conformance of premises
and conclusion of the lean code is verified by a human. Our methodology is by
no means complete or exact. It was nonetheless able to solve five out of six
2025 IMO problems, and close about a third of the sixty-six number theory
conjectures in [Cohen, Journal of Integer Sequences, 2025].
♻ ☆ RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability NeurIPS 2025
Recent advancements in multimodal models have significantly improved
vision-language (VL) alignment in radiology. However, existing approaches
struggle to effectively utilize complex radiology reports for learning and
offer limited interpretability through attention probability visualizations. To
address these challenges, we introduce $\textbf{RadZero}$, a novel framework
for VL alignment in chest X-ray with zero-shot multi-task capability. A key
component of our approach is $\textbf{VL-CABS}$
($\textbf{V}$ision-$\textbf{L}$anguage $\textbf{C}$ross-$\textbf{A}$ttention
$\textbf{B}$ased on $\textbf{S}$imilarity), which aligns text embeddings with
local image features for interpretable, fine-grained VL reasoning. RadZero
leverages large language models to extract concise semantic sentences from
radiology reports and employs multi-positive contrastive training to
effectively capture relationships between images and multiple relevant textual
descriptions. It uses a pre-trained vision encoder with additional trainable
Transformer layers, allowing efficient high-resolution image processing. By
computing similarity between text embeddings and local image patch features,
VL-CABS enables zero-shot inference with similarity probability for
classification, and pixel-level VL similarity maps for grounding and
segmentation. Experimental results on public chest radiograph benchmarks show
that RadZero outperforms state-of-the-art methods in zero-shot classification,
grounding, and segmentation. Furthermore, VL similarity map analysis highlights
the potential of VL-CABS for improving explainability in VL alignment.
Additionally, qualitative evaluation demonstrates RadZero's capability for
open-vocabulary semantic segmentation, further validating its effectiveness in
medical imaging. Code is available at
$\href{https://github.com/deepnoid-ai/RadZero}{https://github.com/deepnoid-ai/RadZero}$.
comment: NeurIPS 2025
♻ ☆ Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models NeurIPS 2025
Large language models (LLMs) have significantly advanced in reasoning tasks
through reinforcement learning (RL) optimization, achieving impressive
capabilities across various challenging benchmarks. However, our empirical
analysis reveals a critical drawback: reasoning-oriented RL fine-tuning
significantly increases the prevalence of hallucinations. We theoretically
analyze the RL training dynamics, identifying high-variance gradient,
entropy-induced randomness, and susceptibility to spurious local optima as key
factors leading to hallucinations. To address this drawback, we propose
Factuality-aware Step-wise Policy Optimization (FSPO), an innovative RL
fine-tuning algorithm incorporating explicit factuality verification at each
reasoning step. FSPO leverages automated verification against given evidence to
dynamically adjust token-level advantage values, incentivizing factual
correctness throughout the reasoning process. Experiments across mathematical
reasoning and hallucination benchmarks using Qwen2.5 and Llama models
demonstrate that FSPO effectively reduces hallucinations while enhancing
reasoning accuracy, substantially improving both reliability and performance.
comment: accepted by NeurIPS 2025
♻ ☆ Who is the root in a syntactic dependency structure?
The syntactic structure of a sentence can be described as a tree that
indicates the syntactic relationships between words. In spite of significant
progress in unsupervised methods that retrieve the syntactic structure of
sentences, guessing the right direction of edges is still a challenge. As in a
syntactic dependency structure edges are oriented away from the root, the
challenge of guessing the right direction can be reduced to finding an
undirected tree and the root. The limited performance of current unsupervised
methods demonstrates the lack of a proper understanding of what a root vertex
is from first principles. We consider an ensemble of centrality scores, some
that only take into account the free tree (non-spatial scores) and others that
take into account the position of vertices (spatial scores). We test the
hypothesis that the root vertex is an important or central vertex of the
syntactic dependency structure. We confirm the hypothesis in the sense that
root vertices tend to have high centrality and that vertices of high centrality
tend to be roots. The best performance in guessing the root is achieved by
novel scores that only take into account the position of a vertex and that of
its neighbours. We provide theoretical and empirical foundations towards a
universal notion of rootness from a network science perspective.
comment: Background and discussion improved. Clarity and consistency enhanced.
Language improved. Typos corrected
♻ ☆ Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language Models AACL
Probing the multilingual knowledge of linguistic structure in LLMs, often
characterized as sequence labeling, faces challenges with maintaining output
templates in current text-to-text prompting strategies. To solve this, we
introduce a decomposed prompting approach for sequence labeling tasks.
Diverging from the single text-to-text prompt, our prompt method generates for
each token of the input sentence an individual prompt which asks for its
linguistic label. We test our method on the Universal Dependencies
part-of-speech tagging dataset for 38 languages, using both English-centric and
multilingual LLMs. Our findings show that decomposed prompting surpasses the
iterative prompting baseline in efficacy and efficiency under zero- and
few-shot settings. Moreover, our analysis of multilingual performance of
English-centric LLMs yields insights into the transferability of linguistic
knowledge via multilingual prompting.
comment: Accepted to AACL-IJCNLP 2025 Findings
♻ ☆ On Multilingual Encoder Language Model Compression for Low-Resource Languages AACL
In this paper, we combine two-step knowledge distillation, structured
pruning, truncation, and vocabulary trimming for extremely compressing
multilingual encoder-only language models for low-resource languages. Our novel
approach systematically combines existing techniques and takes them to the
extreme, reducing layer depth, feed-forward hidden size, and intermediate layer
embedding size to create significantly smaller monolingual models while
retaining essential language-specific knowledge. We achieve compression rates
of up to 92% while maintaining competitive performance, with average drops of
2-10% for moderate compression and 8-13% at maximum compression in four
downstream tasks, including sentiment analysis, topic classification, named
entity recognition, and part-of-speech tagging, across three low-resource
languages. Notably, the performance degradation correlates with the amount of
language-specific data in the teacher model, with larger datasets resulting in
smaller performance losses. Additionally, we conduct ablation studies to
identify the best practices for multilingual model compression using these
techniques.
comment: Accepted to SRW AACL
♻ ☆ Findings of the Fourth Shared Task on Multilingual Coreference Resolution: Can LLMs Dethrone Traditional Approaches?
Michal Novák, Miloslav Konopík, Anna Nedoluzhko, Martin Popel, Ondřej Pražák, Jakub Sido, Milan Straka, Zdeněk Žabokrtský, Daniel Zeman
The paper presents an overview of the fourth edition of the Shared Task on
Multilingual Coreference Resolution, organized as part of the CODI-CRAC 2025
workshop. As in the previous editions, participants were challenged to develop
systems that identify mentions and cluster them according to identity
coreference.
A key innovation of this year's task was the introduction of a dedicated
Large Language Model (LLM) track, featuring a simplified plaintext format
designed to be more suitable for LLMs than the original CoNLL-U representation.
The task also expanded its coverage with three new datasets in two additional
languages, using version 1.3 of CorefUD - a harmonized multilingual collection
of 22 datasets in 17 languages.
In total, nine systems participated, including four LLM-based approaches (two
fine-tuned and two using few-shot adaptation). While traditional systems still
kept the lead, LLMs showed clear potential, suggesting they may soon challenge
established approaches in future editions.
comment: Accepted to CODI-CRAC 2025
♻ ☆ CorPipe at CRAC 2025: Evaluating Multilingual Encoders for Multilingual Coreference Resolution
We present CorPipe 25, the winning entry to the CRAC 2025 Shared Task on
Multilingual Coreference Resolution. This fourth iteration of the shared task
introduces a new LLM track alongside the original unconstrained track, features
reduced development and test sets to lower computational requirements, and
includes additional datasets. CorPipe 25 represents a complete reimplementation
of our previous systems, migrating from TensorFlow to PyTorch. Our system
significantly outperforms all other submissions in both the LLM and
unconstrained tracks by a substantial margin of 8 percentage points. The source
code and trained models are publicly available at
https://github.com/ufal/crac2025-corpipe.
comment: Accepted to CODI-CRAC 2025
♻ ☆ Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics
Large Language Models (LLMs) have emerged as a promising cornerstone for the
development of natural language processing (NLP) and artificial intelligence
(AI). However, ensuring the robustness of LLMs remains a critical challenge. To
address these challenges and advance the field, this survey provides a
comprehensive overview of current studies in this area. First, we
systematically examine the nature of robustness in LLMs, including its
conceptual foundations, the importance of consistent performance across diverse
inputs, and the implications of failure modes in real-world applications. Next,
we analyze the sources of non-robustness, categorizing intrinsic model
limitations, data-driven vulnerabilities, and external adversarial factors that
compromise reliability. Following this, we review state-of-the-art mitigation
strategies, and then we discuss widely adopted benchmarks, emerging metrics,
and persistent gaps in assessing real-world reliability. Finally, we synthesize
findings from existing surveys and interdisciplinary studies to highlight
trends, unresolved issues, and pathways for future research.
comment: Accepted at TMLR
♻ ☆ META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine
Evidence-based medicine (EBM) holds a crucial role in clinical application.
Given suitable medical articles, doctors effectively reduce the incidence of
misdiagnoses. Researchers find it efficient to use large language models (LLMs)
techniques like RAG for EBM tasks. However, the EBM maintains stringent
requirements for evidence, and RAG applications in EBM struggle to efficiently
distinguish high-quality evidence. Therefore, inspired by the meta-analysis
used in EBM, we provide a new method to re-rank and filter the medical
evidence. This method presents multiple principles to filter the best evidence
for LLMs to diagnose. We employ a combination of several EBM methods to emulate
the meta-analysis, which includes reliability analysis, heterogeneity analysis,
and extrapolation analysis. These processes allow the users to retrieve the
best medical evidence for the LLMs. Ultimately, we evaluate these high-quality
articles and show an accuracy improvement of up to 11.4% in our experiments and
results. Our method successfully enables RAG to extract higher-quality and more
reliable evidence from the PubMed dataset. This work can reduce the infusion of
incorrect knowledge into responses and help users receive more effective
replies.
♻ ☆ Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
Large Language Models (LLMs) often exhibit significant behavioral shifts when
they perceive a change from a real-world deployment context to a controlled
evaluation setting, a phenomenon known as "evaluation awareness." This
discrepancy poses a critical challenge for AI alignment, as benchmark
performance may not accurately reflect a model's true safety and honesty. In
this work, we systematically quantify these behavioral changes by manipulating
the perceived context of prompts. We introduce a methodology that uses a linear
probe to score prompts on a continuous scale from "test-like" to "deploy-like"
and leverage an LLM rewriting strategy to shift these prompts towards a more
natural, deployment-style context while preserving the original task. Using
this method, we achieved a 30% increase in the average probe score across a
strategic role-playing dataset after rewriting. Evaluating a suite of
state-of-the-art models on these original and rewritten prompts, we find that
rewritten "deploy-like" prompts induce a significant and consistent shift in
behavior. Across all models, we observed an average increase in honest
responses of 5.26% and a corresponding average decrease in deceptive responses
of 12.40%. Furthermore, refusal rates increased by an average of 6.38%,
indicating heightened safety compliance. Our findings demonstrate that
evaluation awareness is a quantifiable and manipulable factor that directly
influences LLM behavior, revealing that models are more prone to unsafe or
deceptive outputs in perceived test environments. This underscores the urgent
need for more realistic evaluation frameworks to accurately gauge true model
alignment before deployment.
♻ ☆ Training Large Language Models To Reason In Parallel With Global Forking Tokens
Although LLMs have demonstrated improved performance by scaling parallel
test-time compute, doing so relies on generating reasoning paths that are both
diverse and accurate. For challenging problems, the forking tokens that trigger
diverse yet correct reasoning modes are typically deep in the sampling tree.
Consequently, common strategies to encourage diversity, such as temperature
scaling, encounter a worsened trade-off between diversity and accuracy.
Motivated by this challenge, we treat parallel reasoning as a
set-of-next-token-prediction problem, and incorporate a set-based global loss
into Supervised Fine-Tuning (SFT) using self-supervised bipartite matching
between our global forking tokens and unique reasoning traces. We observe that,
while naive fine-tuning with multiple reasoning traces collapses these unique
reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT),
preserves these modes and produces emergent global forking tokens. Experiments
on multiple reasoning benchmarks show that our SSFT consistently outperforms
SFT under both Pass@1 and Cons@k metrics.
♻ ☆ KGGen: Extracting Knowledge Graphs from Plain Text with Language Models
Belinda Mo, Kyssen Yu, Joshua Kazdan, Joan Cabezas, Proud Mpala, Lisa Yu, Chris Cundy, Charilaos Kanatsoulis, Sanmi Koyejo
Recent interest in building foundation models for KGs has highlighted a
fundamental challenge: knowledge-graph data is relatively scarce. The
best-known KGs are primarily human-labeled, created by pattern-matching, or
extracted using early NLP techniques. While human-generated KGs are in short
supply, automatically extracted KGs are of questionable quality. We present a
solution to this data scarcity problem in the form of a text-to-KG generator
(KGGen), a package that uses language models to create high-quality graphs from
plaintext. Unlike other KG extractors, KGGen clusters related entities to
reduce sparsity in extracted KGs. KGGen is available as a Python library
(\texttt{pip install kg-gen}), making it accessible to everyone. Along with
KGGen, we release the first benchmark, Measure of of Information in Nodes and
Edges (MINE), that tests an extractor's ability to produce a useful KG from
plain text. We benchmark our new tool against existing extractors and
demonstrate far superior performance.
♻ ☆ Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents ACL 2025
Retrieval-augmented generation (RAG) based large language models (LLMs) are
widely used in finance for their excellent performance on knowledge-intensive
tasks. However, standardized documents (e.g., SEC filing) share similar formats
such as repetitive boilerplate texts, and similar table structures. This
similarity forces traditional RAG methods to misidentify near-duplicate text,
leading to duplicate retrieval that undermines accuracy and completeness. To
address these issues, we propose the Hierarchical Retrieval with Evidence
Curation (HiREC) framework. Our approach first performs hierarchical retrieval
to reduce confusion among similar texts. It first retrieve related documents
and then selects the most relevant passages from the documents. The evidence
curation process removes irrelevant passages. When necessary, it automatically
generates complementary queries to collect missing information. To evaluate our
approach, we construct and release a Large-scale Open-domain Financial (LOFin)
question answering benchmark that includes 145,897 SEC documents and 1,595
question-answer pairs. Our code and data are available at
https://github.com/deep-over/LOFin-bench-HiREC.
comment: ACL 2025 (Findings)
♻ ☆ Efficient Model Development through Fine-tuning Transfer
Modern LLMs struggle with efficient updates, as each new pretrained model
version requires repeating expensive alignment processes. This challenge also
applies to domain- or languagespecific models, where fine-tuning on specialized
data must be redone for every new base model release. In this paper, we explore
the transfer of fine-tuning updates between model versions. Specifically, we
derive the diff vector (representing the weight changes from finetuning) from
one source model version and apply it to the base model of a different target
version. Through empirical evaluations on various open-weight model versions,
we show that transferring diff vectors can significantly improve the
performance of the target base model. For example, transferring the fine-tuning
updates from Llama 3.0 8B improves Llama 3.1 8B by 46.9% on IFEval and 15.7% on
LiveCodeBench without additional training, even surpassing Llama 3.1 8B
Instruct. Furthermore, we demonstrate performance gains on multilingual tasks,
with 4.7% and 15.5% improvements on Global MMLU for Malagasy and Turkish,
respectively. We observe that these merged models provide stronger
initializations for further fine-tuning. Lastly, our controlled experiments
suggest that fine-tuning transfer is most effective when source and target
models lie in a linearly connected region of parameter space, and we provide a
theoretical analysis of our method. Taken together, fine-tuning transfer offers
a cost-efficient and practical strategy for continuous LLM development. Our
code is available at github.com/pjlintw/finetuning-transfer.
comment: 25 pages, 4 figures, 16 tables
♻ ☆ TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal
pairs, designed to evaluate the linguistic abilities of monolingual and
multilingual language models (LMs). Covering 16 linguistic phenomena with 1000
minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation
resources for Turkish. In designing the benchmark, we give extra attention to
two properties of Turkish that remain understudied in current syntactic
evaluations of LMs, namely word order flexibility and subordination through
morphological processes. Our experiments on a wide range of LMs and a newly
collected set of human acceptability judgments reveal that even cutting-edge
Large LMs still struggle with grammatical phenomena that are not challenging
for humans, and may also exhibit different sensitivities to word order and
morphological complexity compared to humans.
♻ ☆ FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning EMNLP2025
Shaoyu Dou, Yutian Shen, Mofan Chen, Zixuan Wang, Jiajie Xu, Qi Guo, Kailai Shao, Chao Chen, Haixiang Hu, Haibo Shi, Min Min, Liwen Zhang
Large Language Models (LLMs) demonstrate significant potential but face
challenges in complex financial reasoning tasks requiring both domain knowledge
and sophisticated reasoning. Current evaluation benchmarks often fall short by
not decoupling these capabilities indicators from single task performance and
lack root cause analysis for task failure. To address this, we introduce
FinEval-KR, a novel evaluation framework for decoupling and quantifying LLMs'
knowledge and reasoning abilities independently, proposing distinct knowledge
score and reasoning score metrics. Inspired by cognitive science, we further
propose a cognitive score based on Bloom's taxonomy to analyze capabilities in
reasoning tasks across different cognitive levels. We also release a new
open-source Chinese financial reasoning dataset covering 22 subfields to
support reproducible research and further advancements in financial reasoning.
Our experimental results reveal that LLM reasoning ability and higher-order
cognitive ability are the core factors influencing reasoning accuracy. We also
specifically find that even top models still face a bottleneck with knowledge
application. Furthermore, our analysis shows that specialized financial LLMs
generally lag behind the top general large models across multiple metrics.
comment: Accepted by FinNLP@EMNLP2025
♻ ☆ Control Barrier Function for Aligning Large Language Models
This paper proposes a control-based framework for aligning large language
models (LLMs) by leveraging a control barrier function (CBF) to ensure
user-desirable text generation. The presented framework applies the CBF safety
filter to the predicted token generated from the baseline LLM, to intervene in
the generated text. The safety filter includes two significant advantages: this
safety filter is an add-on type, allowing it to be used for alignment purposes
without fine-tuning the baseline LLM, and if there is an evaluation model
regarding the desired alignment, it can be directly applied to the filter
design. The overall text-generation system is implemented with open-source
language models, aiming to generate positive text.
comment: This work is an extenede version of arXiv:2408.15625 and has been
submitted to the IEEE for possible publication
♻ ☆ RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning
Medical question answering requires advanced reasoning that integrates domain
knowledge with logical inference. However, existing large language models
(LLMs) often generate reasoning chains that lack factual accuracy and clinical
reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a
novel framework that uniquely combines reinforcement learning with
preference-driven reasoning refinement to enhance clinical chain-of-thought
(CoT) performance. RPRO differentiates itself from prior approaches by
employing task-adaptive reasoning templates and a probabilistic evaluation
mechanism that aligns outputs with established clinical workflows, while
automatically identifying and correcting low-quality reasoning chains. Unlike
traditional pairwise preference methods, RPRO introduces a groupwise ranking
optimization based on the Bradley-Terry model and incorporates KL-divergence
regularization for stable training. Experiments on PubMedQA and MedQA-USMLE
show consistent improvements over strong baselines. Remarkably, our 1.1B
parameter model outperforms much larger 7B-13B models, including
medical-specialized variants. These findings demonstrate that combining
preference optimization with quality-driven refinement offers a scalable and
effective approach to building more reliable, clinically grounded medical LLMs.
♻ ☆ GraphCheck: Multipath Fact-Checking with Entity-Relationship Graphs
Automated fact-checking aims to assess the truthfulness of textual claims
based on relevant evidence. However, verifying complex claims that require
multi-hop reasoning remains a significant challenge. We propose GraphCheck, a
novel framework that transforms claims into entity-relationship graphs for
structured and systematic fact-checking. By explicitly modeling both explicit
and latent entities and exploring multiple reasoning paths, GraphCheck enhances
verification robustness. While GraphCheck excels in complex scenarios, it may
be unnecessarily elaborate for simpler claims. To address this, we introduce
DP-GraphCheck, a variant that employs a lightweight strategy selector to choose
between direct prompting and GraphCheck adaptively. This selective mechanism
improves both accuracy and efficiency by applying the appropriate level of
reasoning to each claim. Experiments on the HOVER and EX-FEVER datasets
demonstrate that our approach outperforms existing methods in verification
accuracy, while achieving strong computational efficiency despite its multipath
exploration. Moreover, the strategy selection mechanism in DP-GraphCheck
generalizes well to other fact-checking pipelines, highlighting the broad
applicability of our framework.
♻ ☆ Compression Hacking: A Supplementary Perspective on Informatics Properties of Language Models from Geometric Distortion
Jianxiang Zang, Meiling Ning, Yongda Wei, Shihan Dou, Jiazheng Zhang, Nijia Mo, Binhong Li, Tao Gui, Qi Zhang, Xuanjing Huang
Recently, the concept of ``compression as intelligence'' has provided a novel
informatics metric perspective for language models (LMs), emphasizing that
highly structured representations signify the intelligence level of LMs.
However, from a geometric standpoint, the word representation space of highly
compressed LMs tends to degenerate into a highly anisotropic state, which
hinders the LM's ability to comprehend instructions and directly impacts its
performance. We found this compression-anisotropy synchronicity is essentially
the ``Compression Hacking'' in LM representations, where noise-dominated
directions tend to create the illusion of high compression rates by sacrificing
spatial uniformity. Based on this, we propose three refined compression metrics
by incorporating geometric distortion analysis and integrate them into a
self-evaluation pipeline. The refined metrics exhibit strong alignment with the
LM's comprehensive capabilities, achieving Spearman correlation coefficients
above 0.9, significantly outperforming both the original compression and other
internal structure-based metrics. This confirms that compression hacking
substantially enhances the informatics interpretation of LMs by incorporating
geometric distortion of representations.
♻ ☆ VERA: Variational Inference Framework for Jailbreaking Large Language Models NeurIPS 2025
The rise of API-only access to state-of-the-art LLMs highlights the need for
effective black-box jailbreak methods to identify model vulnerabilities in
real-world settings. Without a principled objective for gradient-based
optimization, most existing approaches rely on genetic algorithms, which are
limited by their initialization and dependence on manually curated prompt
pools. Furthermore, these methods require individual optimization for each
prompt, failing to provide a comprehensive characterization of model
vulnerabilities. To address this gap, we introduce VERA: Variational infErence
fRamework for jAilbreaking. VERA casts black-box jailbreak prompting as a
variational inference problem, training a small attacker LLM to approximate the
target LLM's posterior over adversarial prompts. Once trained, the attacker can
generate diverse, fluent jailbreak prompts for a target query without
re-optimization. Experimental results show that VERA achieves strong
performance across a range of target LLMs, highlighting the value of
probabilistic inference for adversarial prompt generation.
comment: Accepted by NeurIPS 2025