Computation and Language
☆ Exploring Training and Inference Scaling Laws in Generative Retrieval
Generative retrieval has emerged as a novel paradigm that leverages large
language models (LLMs) to autoregressively generate document identifiers.
Although promising, the mechanisms that underpin its performance and
scalability remain largely unclear. We conduct a systematic investigation of
training and inference scaling laws in generative retrieval, exploring how
model size, training data scale, and inference-time compute jointly influence
retrieval performance. To address the lack of suitable metrics, we propose a
novel evaluation measure inspired by contrastive entropy and generation loss,
providing a continuous performance signal that enables robust comparisons
across diverse generative retrieval methods. Our experiments show that
n-gram-based methods demonstrate strong alignment with both training and
inference scaling laws, especially when paired with larger LLMs. Furthermore,
increasing inference computation yields substantial performance gains,
revealing that generative retrieval can significantly benefit from higher
compute budgets at inference. Across these settings, LLaMA models consistently
outperform T5 models, suggesting a particular advantage for larger decoder-only
models in generative retrieval. Taken together, our findings underscore that
model sizes, data availability, and inference computation interact to unlock
the full potential of generative retrieval, offering new insights for designing
and optimizing future systems.
☆ xKV: Cross-Layer SVD for KV-Cache Compression
Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah
Large Language Models (LLMs) with long context windows enable powerful
applications but come at the cost of high memory consumption to store the Key
and Value states (KV-Cache). Recent studies attempted to merge KV-cache from
multiple layers into shared representations, yet these approaches either
require expensive pretraining or rely on assumptions of high per-token cosine
similarity across layers which generally does not hold in practice. We find
that the dominant singular vectors are remarkably well-aligned across multiple
layers of the KV-Cache. Exploiting this insight, we propose xKV, a simple
post-training method that applies Singular Value Decomposition (SVD) on the
KV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layers
into a shared low-rank subspace, significantly reducing KV-Cache sizes. Through
extensive evaluations on the RULER long-context benchmark with widely-used LLMs
(e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression rates
than state-of-the-art inter-layer technique while improving accuracy by 2.7%.
Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA)
(e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on coding
tasks without performance degradation. These results highlight xKV's strong
capability and versatility in addressing memory bottlenecks for long-context
LLM inference. Our code is publicly available at:
https://github.com/abdelfattah-lab/xKV.
☆ SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can
naturally emerge through a simple reinforcement learning (RL) framework with
rule-based rewards, where the training may directly start from the base
models-a paradigm referred to as zero RL training. Most recent efforts to
reproduce zero RL training have primarily focused on the Qwen2.5 model series,
which may not be representative as we find the base models already exhibit
strong instruction-following and self-reflection abilities. In this work, we
investigate zero RL training across 10 diverse base models, spanning different
families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B,
Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several
key design strategies-such as adjusting format reward and controlling query
difficulty-we achieve substantial improvements in both reasoning accuracy and
response length across most settings. However, by carefully monitoring the
training dynamics, we observe that different base models exhibit distinct
patterns during training. For instance, the increased response length does not
always correlate with the emergence of certain cognitive behaviors such as
verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for
the first time in small models not from the Qwen family. We share the key
designs that enable successful zero RL training, along with our findings and
practices. To facilitate further research, we open-source the code, models, and
analysis tools.
☆ AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration
Multi-agent systems (MAS) based on large language models (LLMs) have
demonstrated significant potential in collaborative problem-solving. However,
they still face substantial challenges of low communication efficiency and
suboptimal task performance, making the careful design of the agents'
communication topologies particularly important. Inspired by the management
theory that roles in an efficient team are often dynamically adjusted, we
propose AgentDropout, which identifies redundant agents and communication
across different communication rounds by optimizing the adjacency matrices of
the communication graphs and eliminates them to enhance both token efficiency
and task performance. Compared to state-of-the-art methods, AgentDropout
achieves an average reduction of 21.6% in prompt token consumption and 18.4% in
completion token consumption, along with a performance improvement of 1.14 on
the tasks. Furthermore, the extended experiments demonstrate that AgentDropout
achieves notable domain transferability and structure robustness, revealing its
reliability and effectiveness. We release our code at
https://github.com/wangzx1219/AgentDropout.
☆ Toward building next-generation Geocoding systems: a systematic review
Zhengcong Yin, Daniel W. Goldberg, Binbin Lin, Bing Zhou, Diya Li, Andong Ma, Ziqian Ming, Heng Cai, Zhe Zhang, Shaohua Wang, Shanzhen Gao, Joey Ying Lee, Xiao Li, Da Huo
Geocoding systems are widely used in both scientific research for spatial
analysis and everyday life through location-based services. The quality of
geocoded data significantly impacts subsequent processes and applications,
underscoring the need for next-generation systems. In response to this demand,
this review first examines the evolving requirements for geocoding inputs and
outputs across various scenarios these systems must address. It then provides a
detailed analysis of how to construct such systems by breaking them down into
key functional components and reviewing a broad spectrum of existing
approaches, from traditional rule-based methods to advanced techniques in
information retrieval, natural language processing, and large language models.
Finally, we identify opportunities to improve next-generation geocoding systems
in light of recent technological advances.
☆ I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders
Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y. Rogov, Elena Tutubalina, Ivan Oseledets
Large Language Models (LLMs) have achieved remarkable success in natural
language processing. Recent advances have led to the developing of a new class
of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved
state-of-the-art performance by integrating deep thinking and complex
reasoning. Despite these impressive capabilities, the internal reasoning
mechanisms of such models remain unexplored. In this work, we employ Sparse
Autoencoders (SAEs), a method to learn a sparse decomposition of latent
representations of a neural network into interpretable features, to identify
features that drive reasoning in the DeepSeek-R1 series of models. First, we
propose an approach to extract candidate ''reasoning features'' from SAE
representations. We validate these features through empirical analysis and
interpretability methods, demonstrating their direct correlation with the
model's reasoning abilities. Crucially, we demonstrate that steering these
features systematically enhances reasoning performance, offering the first
mechanistic account of reasoning in LLMs. Code available at
https://github.com/AIRI-Institute/SAE-Reasoning
☆ Reasoning to Learn from Latent Thoughts
Compute scaling for language model (LM) pretraining has outpaced the growth
of human-written texts, leading to concerns that data will become the
bottleneck to LM scaling. To continue scaling pretraining in this
data-constrained regime, we propose that explicitly modeling and inferring the
latent thoughts that underlie the text generation process can significantly
improve pretraining data efficiency. Intuitively, our approach views web text
as the compressed final outcome of a verbose human thought process and that the
latent thoughts contain important contextual knowledge and reasoning steps that
are critical to data-efficient learning. We empirically demonstrate the
effectiveness of our approach through data-constrained continued pretraining
for math. We first show that synthetic data approaches to inferring latent
thoughts significantly improve data efficiency, outperforming training on the
same amount of raw data (5.7\% $\rightarrow$ 25.4\% on MATH). Furthermore, we
demonstrate latent thought inference without a strong teacher, where an LM
bootstraps its own performance by using an EM algorithm to iteratively improve
the capability of the trained LM and the quality of thought-augmented
pretraining data. We show that a 1B LM can bootstrap its performance across at
least three iterations and significantly outperform baselines trained on raw
data, with increasing gains from additional inference compute when performing
the E-step. The gains from inference scaling and EM iterations suggest new
opportunities for scaling data-constrained pretraining.
☆ EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments
We develop benchmarks for LLM agents that act in, learn from, and strategize
in unknown environments, the specifications of which the LLM agent must learn
over time from deliberate exploration. Our benchmarks consist of
decision-making tasks derived from key problems in economics. To forestall
saturation, the benchmark tasks are synthetically generated with scalable
difficulty levels. Additionally, we propose litmus tests, a new kind of
quantitative measure for LLMs and LLM agents. Unlike benchmarks, litmus tests
quantify differences in character, values, and tendencies of LLMs and LLM
agents, by considering their behavior when faced with tradeoffs (e.g.,
efficiency versus equality) where there is no objectively right or wrong
behavior. Overall, our benchmarks and litmus tests assess the abilities and
tendencies of LLM agents in tackling complex economic problems in diverse
settings spanning procurement, scheduling, task allocation, and pricing --
applications that should grow in importance as such agents are further
integrated into the economy.
☆ REALM: A Dataset of Real-World LLM Use Cases
Large Language Models, such as the GPT series, have driven significant
industrial applications, leading to economic and societal transformations.
However, a comprehensive understanding of their real-world applications remains
limited. To address this, we introduce REALM, a dataset of over 94,000 LLM use
cases collected from Reddit and news articles. REALM captures two key
dimensions: the diverse applications of LLMs and the demographics of their
users. It categorizes LLM applications and explores how users' occupations
relate to the types of applications they use. By integrating real-world data,
REALM offers insights into LLM adoption across different domains, providing a
foundation for future research on their evolving societal roles. A dedicated
dashboard https://realm-e7682.web.app/ presents the data.
comment: 9 pages, 5 figures
☆ BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache
The growing adoption of long-context Large Language Models (LLMs) has
introduced significant memory and computational challenges in autoregressive
decoding due to the expanding Key-Value (KV) cache. KV cache quantization has
emerged as a promising solution, with prior work showing that 4-bit or even
2-bit quantization can maintain model accuracy while reducing memory costs.
However, despite these benefits, preliminary implementations for the low-bit KV
cache struggle to deliver the expected speedup due to quantization and
dequantization overheads and the lack of Tensor Cores utilization. In this
work, we propose BitDecoding, a GPU-optimized framework that unlocks Tensor
Cores for efficient decoding with low-bit KV cache. Efficiently leveraging
Tensor Cores for low-bit KV cache is challenging due to the dynamic nature of
KV cache generation at each decoding step. BitDecoding addresses these
challenges with a Tensor Cores-Centric BitFusion Scheme that ensures data
layout compatibility to enable high utilization of Tensor Cores. Additionally,
BitDecoding incorporates a warp-efficient parallel decoding kernel and a
fine-grained asynchronous pipeline, minimizing dequantization overhead and
improving computational efficiency. Experiments show that BitDecoding achieves
up to 7.5x speedup on RTX 4090, 4.8x on A100, and 8.9x on H100, compared to
FP16 FlashDecoding-v2. It also outperforms the state-of-the-art low-bit KV
cache implementation (QServe) by up to 4.3x. On LLaMA-3.1-8B with a 128K
sequence length, BitDecoding reduces single-batch decoding latency by 3x,
demonstrating its effectiveness in long-context generation scenarios. The code
is available at https://github.com/DD-DuDa/BitDecoding.
☆ AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning
This paper presents AlphaSpace, a novel methodology designed to enhance the
spatial reasoning capabilities of large language models (LLMs) for 3D Cartesian
space navigation. AlphaSpace employs a semantics-based tokenization strategy,
encoding height information through specialized semantic tokens, and integrates
primarily symbolic synthetic reasoning data. This approach enables LLMs to
accurately manipulate objects by positioning them at specific [x, y, z]
coordinates. Experimental results demonstrate that AlphaSpace significantly
outperforms existing models on manipulation subtasks, achieving a total
accuracy of 66.67%, compared to 37.5% for GPT-4o and 29.17% for Claude 3.5
Sonnet.
☆ Synthetic Function Demonstrations Improve Generation in Low-Resource Programming Languages
A key consideration when training an LLM is whether the target language is
more or less resourced, whether this is English compared to Welsh, or Python
compared to Excel. Typical training data for programming languages consist of
real program demonstrations coupled with human-written comments. Here we
present novel approaches to the creation of such data for low resource
programming languages. We generate fully-synthetic, textbook-quality
demonstrations of common library functions in an example domain of Excel
formulas, using a teacher model. We then finetune an underperforming student
model, and show improvement on 2 question-answering datasets recast into the
Excel domain. We show advantages of finetuning over standard, off-the-shelf RAG
approaches, which can offer only modest improvement due to the unfamiliar
target domain.
☆ Construction Identification and Disambiguation Using BERT: A Case Study of NPN ACL
Construction Grammar hypothesizes that knowledge of a language consists
chiefly of knowledge of form-meaning pairs (''constructions'') that include
vocabulary, general grammar rules, and even idiosyncratic patterns. Recent work
has shown that transformer language models represent at least some
constructional patterns, including ones where the construction is rare overall.
In this work, we probe BERT's representation of the form and meaning of a minor
construction of English, the NPN (noun-preposition-noun) construction --
exhibited in such expressions as face to face and day to day -- which is known
to be polysemous. We construct a benchmark dataset of semantically annotated
corpus instances (including distractors that superficially resemble the
construction). With this dataset, we train and evaluate probing classifiers.
They achieve decent discrimination of the construction from distractors, as
well as sense disambiguation among true instances of the construction,
revealing that BERT embeddings carry indications of the construction's
semantics. Moreover, artificially permuting the word order of true construction
instances causes them to be rejected, indicating sensitivity to matters of
form. We conclude that BERT does latently encode at least some knowledge of the
NPN construction going beyond a surface syntactic pattern and lexical cues.
comment: 8 pages, ACL long-paper format (preprint)
☆ Predicting the Road Ahead: A Knowledge Graph based Foundation Model for Scene Understanding in Autonomous Driving
The autonomous driving field has seen remarkable advancements in various
topics, such as object recognition, trajectory prediction, and motion planning.
However, current approaches face limitations in effectively comprehending the
complex evolutions of driving scenes over time. This paper proposes FM4SU, a
novel methodology for training a symbolic foundation model (FM) for scene
understanding in autonomous driving. It leverages knowledge graphs (KGs) to
capture sensory observation along with domain knowledge such as road topology,
traffic rules, or complex interactions between traffic participants. A bird's
eye view (BEV) symbolic representation is extracted from the KG for each
driving scene, including the spatio-temporal information among the objects
across the scenes. The BEV representation is serialized into a sequence of
tokens and given to pre-trained language models (PLMs) for learning an inherent
understanding of the co-occurrence among driving scene elements and generating
predictions on the next scenes. We conducted a number of experiments using the
nuScenes dataset and KG in various scenarios. The results demonstrate that
fine-tuned models achieve significantly higher accuracy in all tasks. The
fine-tuned T5 model achieved a next scene prediction accuracy of 86.7%. This
paper concludes that FM4SU offers a promising foundation for developing more
comprehensive models for scene understanding in autonomous driving.
☆ Unsupervised Acquisition of Discrete Grammatical Categories
This article presents experiments performed using a computational laboratory
environment for language acquisition experiments. It implements a multi-agent
system consisting of two agents: an adult language model and a daughter
language model that aims to learn the mother language. Crucially, the daughter
agent does not have access to the internal knowledge of the mother language
model but only to the language exemplars the mother agent generates. These
experiments illustrate how this system can be used to acquire abstract
grammatical knowledge. We demonstrate how statistical analyses of patterns in
the input data corresponding to grammatical categories yield discrete
grammatical rules. These rules are subsequently added to the grammatical
knowledge of the daughter language model. To this end, hierarchical
agglomerative cluster analysis was applied to the utterances consecutively
generated by the mother language model. It is argued that this procedure can be
used to acquire structures resembling grammatical categories proposed by
linguists for natural languages. Thus, it is established that non-trivial
grammatical knowledge has been acquired. Moreover, the parameter configuration
of this computational laboratory environment determined using training data
generated by the mother language model is validated in a second experiment with
a test set similarly resulting in the acquisition of non-trivial categories.
comment: 34 pages, 3 figures, 7 tables
☆ Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models
Sarcasm detection, as a crucial research direction in the field of Natural
Language Processing (NLP), has attracted widespread attention. Traditional
sarcasm detection tasks have typically focused on single-modal approaches
(e.g., text), but due to the implicit and subtle nature of sarcasm, such
methods often fail to yield satisfactory results. In recent years, researchers
have shifted the focus of sarcasm detection to multi-modal approaches. However,
effectively leveraging multi-modal information to accurately identify sarcastic
content remains a challenge that warrants further exploration. Leveraging the
powerful integrated processing capabilities of Multi-Modal Large Language
Models (MLLMs) for various information sources, we propose an innovative
multi-modal Commander-GPT framework. Inspired by military strategy, we first
decompose the sarcasm detection task into six distinct sub-tasks. A central
commander (decision-maker) then assigns the best-suited large language model to
address each specific sub-task. Ultimately, the detection results from each
model are aggregated to identify sarcasm. We conducted extensive experiments on
MMSD and MMSD 2.0, utilizing four multi-modal large language models and six
prompting strategies. Our experiments demonstrate that our approach achieves
state-of-the-art performance, with a 19.3% improvement in F1 score, without
necessitating fine-tuning or ground-truth rationales.
☆ ArchSeek: Retrieving Architectural Case Studies Using Vision-Language Models
Efficiently searching for relevant case studies is critical in architectural
design, as designers rely on precedent examples to guide or inspire their
ongoing projects. However, traditional text-based search tools struggle to
capture the inherently visual and complex nature of architectural knowledge,
often leading to time-consuming and imprecise exploration. This paper
introduces ArchSeek, an innovative case study search system with recommendation
capability, tailored for architecture design professionals. Powered by the
visual understanding capabilities from vision-language models and cross-modal
embeddings, it enables text and image queries with fine-grained control, and
interaction-based design case recommendations. It offers architects a more
efficient, personalized way to discover design inspirations, with potential
applications across other visually driven design fields. The source code is
available at https://github.com/danruili/ArchSeek.
comment: 15 pages, 8 figures, 3 tables. Accepted by CAAD Futures 2025
☆ AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
Agents built on LLMs are increasingly deployed across diverse domains,
automating complex decision-making and task execution. However, their autonomy
introduces safety risks, including security vulnerabilities, legal violations,
and unintended harmful actions. Existing mitigation methods, such as
model-based safeguards and early enforcement strategies, fall short in
robustness, interpretability, and adaptability. To address these challenges, we
propose AgentSpec, a lightweight domain-specific language for specifying and
enforcing runtime constraints on LLM agents. With AgentSpec, users define
structured rules that incorporate triggers, predicates, and enforcement
mechanisms, ensuring agents operate within predefined safety boundaries. We
implement AgentSpec across multiple domains, including code execution, embodied
agents, and autonomous driving, demonstrating its adaptability and
effectiveness. Our evaluation shows that AgentSpec successfully prevents unsafe
executions in over 90% of code agent cases, eliminates all hazardous actions in
embodied agent tasks, and enforces 100% compliance by autonomous vehicles
(AVs). Despite its strong safety guarantees, AgentSpec remains computationally
lightweight, with overheads in milliseconds. By combining interpretability,
modularity, and efficiency, AgentSpec provides a practical and scalable
solution for enforcing LLM agent safety across diverse applications. We also
automate the generation of rules using LLMs and assess their effectiveness. Our
evaluation shows that the rules generated by OpenAI o1 achieve a precision of
95.56% and recall of 70.96% for embodied agents, successfully identifying
87.26% of the risky code, and prevent AVs from breaking laws in 5 out of 8
scenarios.
☆ ZeroLM: Data-Free Transformer Architecture Search for Language Models
Neural architecture search (NAS) provides a systematic framework for
automating the design of neural network architectures, yet its widespread
adoption is hindered by prohibitive computational requirements. Existing
zero-cost proxy methods, while reducing search overhead, demonstrate inadequate
performance in architecture ranking tasks, particularly for Transformer-based
models where they often underperform simple parameter counting metrics. Current
automated proxy discovery approaches suffer from extended search times,
susceptibility to data overfitting, and structural complexity. This paper
introduces a novel zero-cost proxy methodology that quantifies model capacity
through efficient weight statistics computation while decomposing Transformer
architectures into functionally distinct sub-modules, thereby optimizing the
balance of their contributions to overall performance. Our comprehensive
evaluation demonstrates the superiority of this approach, achieving a
Spearman's rho of 0.76 and Kendall's tau of 0.53 on the FlexiBERT benchmark.
The proposed method exhibits exceptional computational efficiency while
maintaining robust performance across diverse NAS benchmark tasks, offering a
practical solution for large-scale architecture search.
☆ LANGALIGN: Enhancing Non-English Language Models via Cross-Lingual Embedding Alignment
While Large Language Models have gained attention, many service developers
still rely on embedding-based models due to practical constraints. In such
cases, the quality of fine-tuning data directly impacts performance, and
English datasets are often used as seed data for training non-English models.
In this study, we propose LANGALIGN, which enhances target language processing
by aligning English embedding vectors with those of the target language at the
interface between the language model and the task header. Experiments on
Korean, Japanese, and Chinese demonstrate that LANGALIGN significantly improves
performance across all three languages. Additionally, we show that LANGALIGN
can be applied in reverse to convert target language data into a format that an
English-based model can process.
comment: now preparing
☆ LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL
Schema linking is a critical bottleneck in achieving human-level performance
in Text-to-SQL tasks, particularly in real-world large-scale multi-database
scenarios. Addressing schema linking faces two major challenges: (1) Database
Retrieval: selecting the correct database from a large schema pool in
multi-database settings, while filtering out irrelevant ones. (2) Schema Item
Grounding: accurately identifying the relevant tables and columns from within a
large and redundant schema for SQL generation. To address this, we introduce
LinkAlign, a novel framework that can effectively adapt existing baselines to
real-world environments by systematically addressing schema linking. Our
framework comprises three key steps: multi-round semantic enhanced retrieval
and irrelevant information isolation for Challenge 1, and schema extraction
enhancement for Challenge 2. We evaluate our method performance of schema
linking on the SPIDER and BIRD benchmarks, and the ability to adapt existing
Text-to-SQL models to real-world environments on the SPIDER 2.0-lite benchmark.
Experiments show that LinkAlign outperforms existing baselines in
multi-database settings, demonstrating its effectiveness and robustness. On the
other hand, our method ranks highest among models excluding those using long
chain-of-thought reasoning LLMs. This work bridges the gap between current
research and real-world scenarios, providing a practical solution for robust
and scalable schema linking. The codes are available at
https://github.com/Satissss/LinkAlign.
☆ ClinText-SP and RigoBERTa Clinical: a new set of open resources for Spanish Clinical NLP
We present a novel contribution to Spanish clinical natural language
processing by introducing the largest publicly available clinical corpus,
ClinText-SP, along with a state-of-the-art clinical encoder language model,
RigoBERTa Clinical. Our corpus was meticulously curated from diverse open
sources, including clinical cases from medical journals and annotated corpora
from shared tasks, providing a rich and diverse dataset that was previously
difficult to access. RigoBERTa Clinical, developed through domain-adaptive
pretraining on this comprehensive dataset, significantly outperforms existing
models on multiple clinical NLP benchmarks. By publicly releasing both the
dataset and the model, we aim to empower the research community with robust
resources that can drive further advancements in clinical NLP and ultimately
contribute to improved healthcare applications.
☆ Dense Retrieval for Low Resource Languages -- the Case of Amharic Language
This paper reports some difficulties and some results when using dense
retrievers on Amharic, one of the low-resource languages spoken by 120 millions
populations. The efforts put and difficulties faced by University Addis Ababa
toward Amharic Information Retrieval will be developed during the presentation.
comment: 4 pages, 2 figures
☆ Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures
The current era of Natural Language Processing (NLP) is dominated by
Transformer models. However, novel architectures relying on recurrent
mechanisms, such as xLSTM and Mamba, have been proposed as alternatives to
attention-based models. Although computation is done differently than with the
attention mechanism mechanism, these recurrent models yield good results and
sometimes even outperform state-of-the-art attention-based models. In this
work, we propose Distil-xLSTM, an xLSTM-based Small Language Model (SLM)
trained by distilling knowledge from a Large Language Model (LLM) that shows
promising results while being compute and scale efficient. Our Distil-xLSTM
focuses on approximating a transformer-based model attention parametrization
using its recurrent sequence mixing components and shows good results with
minimal training.
☆ Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models
Nariman Naderi, Seyed Amir Ahmad Safavi-Naini, Thomas Savage, Zahra Atf, Peter Lewis, Girish Nadkarni, Ali Soroush
This study evaluated self-reported response certainty across several large
language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen)
using 300 gastroenterology board-style questions. The highest-performing models
(GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of
0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improved
performance, all exhibited a consistent tendency towards overconfidence.
Uncertainty estimation presents a significant challenge to the safe use of LLMs
in healthcare. Keywords: Large Language Models; Confidence Elicitation;
Artificial Intelligence; Gastroenterology; Uncertainty Quantification
comment: 35 pages, 5 figures, 1 table, 7 supplementary figures
☆ Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models ICME2025
Despite the significant success of Large Vision-Language models(LVLMs), these
models still suffer hallucinations when describing images, generating answers
that include non-existent objects. It is reported that these models tend to
over-focus on certain irrelevant image tokens that do not contain critical
information for answering the question and distort the output. To address this,
we propose an Instruction-Aligned Visual Attention(IAVA) approach, which
identifies irrelevant tokens by comparing changes in attention weights under
two different instructions. By applying contrastive decoding, we dynamically
adjust the logits generated from original image tokens and irrelevant image
tokens, reducing the model's over-attention to irrelevant information. The
experimental results demonstrate that IAVA consistently outperforms existing
decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating
object hallucinations. Our IAVA approach is available online at
https://github.com/Lee-lab558/IAVA.
comment: Accepted by ICME2025
☆ Natural Language Processing for Electronic Health Records in Scandinavian Languages: Norwegian, Swedish, and Danish
Ashenafi Zebene Woldaregay, Jørgen Aarmo Lund, Phuong Dinh Ngo, Mariyam Tayefi, Joel Burman, Stine Hansen, Martin Hylleholt Sillesen, Hercules Dalianis, Robert Jenssen, Lindsetmo Rolf Ole, Karl Øyvind Mikalsen
Background: Clinical natural language processing (NLP) refers to the use of
computational methods for extracting, processing, and analyzing unstructured
clinical text data, and holds a huge potential to transform healthcare in
various clinical tasks. Objective: The study aims to perform a systematic
review to comprehensively assess and analyze the state-of-the-art NLP methods
for the mainland Scandinavian clinical text. Method: A literature search was
conducted in various online databases including PubMed, ScienceDirect, Google
Scholar, ACM digital library, and IEEE Xplore between December 2022 and
February 2024. Further, relevant references to the included articles were also
used to solidify our search. The final pool includes articles that conducted
clinical NLP in the mainland Scandinavian languages and were published in
English between 2010 and 2024. Results: Out of the 113 articles, 18% (n=21)
focus on Norwegian clinical text, 64% (n=72) on Swedish, 10% (n=11) on Danish,
and 8% (n=9) focus on more than one language. Generally, the review identified
positive developments across the region despite some observable gaps and
disparities between the languages. There are substantial disparities in the
level of adoption of transformer-based models. In essential tasks such as
de-identification, there is significantly less research activity focusing on
Norwegian and Danish compared to Swedish text. Further, the review identified a
low level of sharing resources such as data, experimentation code, pre-trained
models, and rate of adaptation and transfer learning in the region. Conclusion:
The review presented a comprehensive assessment of the state-of-the-art
Clinical NLP for electronic health records (EHR) text in mainland Scandinavian
languages and, highlighted the potential barriers and challenges that hinder
the rapid advancement of the field in the region.
comment: 45 pages including the appendix, 9 figures in the main manuscript and
11 figures in the Appendix
☆ SciClaims: An End-to-End Generative System for Biomedical Claim Analysis
Validating key claims in scientific literature, particularly in biomedical
research, is essential for ensuring accuracy and advancing knowledge. This
process is critical in sectors like the pharmaceutical industry, where rapid
scientific progress requires automation and deep domain expertise. However,
current solutions have significant limitations. They lack end-to-end pipelines
encompassing all claim extraction, evidence retrieval, and verification steps;
rely on complex NLP and information retrieval pipelines prone to multiple
failure points; and often fail to provide clear, user-friendly justifications
for claim verification outcomes. To address these challenges, we introduce
SciClaims, an advanced system powered by state-of-the-art large language models
(LLMs) that seamlessly integrates the entire scientific claim analysis process.
SciClaims outperforms previous approaches in both claim extraction and
verification without requiring additional fine-tuning, setting a new benchmark
for automated scientific claim analysis.
comment: Pre-print version
☆ Autoregressive Language Models for Knowledge Base Population: A case study in the space mission domain
Knowledge base population KBP plays a crucial role in populating and
maintaining knowledge bases up-to-date in organizations by leveraging domain
corpora. Motivated by the increasingly large context windows supported by large
language models, we propose to fine-tune an autoregressive language model for
end-toend KPB. Our case study involves the population of a space mission
knowledge graph. To fine-tune the model we generate a dataset for end-to-end
KBP tapping into existing domain resources. Our case study shows that
fine-tuned language models of limited size can achieve competitive and even
higher accuracy than larger models in the KBP task. Smaller models specialized
for KBP offer affordable deployment and lower-cost inference. Moreover, KBP
specialist models do not require the ontology to be included in the prompt,
allowing for more space in the context for additional input text or output
serialization.
comment: Pre-print version
☆ Verbal Process Supervision Elicits Better Coding Agents
The emergence of large language models and their applications as AI agents
have significantly advanced state-of-the-art code generation benchmarks,
transforming modern software engineering tasks. However, even with test-time
computed reasoning models, these systems still struggle with complex software
engineering challenges. This work introduces CURA, a code understanding and
reasoning agent system enhanced with verbal process supervision (VPS),
achieving a 3.65\% improvement over baseline models on challenging benchmarks
like BigCodeBench. Furthermore, CURA, when paired with the o3-mini model and
VPS techniques, attains state-of-the-art performance. This work represents a
step forward in integrating reasoning-driven architectures with LLM-based code
generation, enabling agentic reasoning for language models to solve complex
software engineering tasks.
☆ Safeguarding Mobile GUI Agent via Logic-based Action Verification
Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, Insik Shin
Large Foundation Models (LFMs) have unlocked new possibilities in
human-computer interaction, particularly with the rise of mobile Graphical User
Interface (GUI) Agents capable of interpreting GUIs. These agents promise to
revolutionize mobile computing by allowing users to automate complex mobile
tasks through simple natural language instructions. However, the inherent
probabilistic nature of LFMs, coupled with the ambiguity and context-dependence
of mobile tasks, makes LFM-based automation unreliable and prone to errors. To
address this critical challenge, we introduce VeriSafe Agent (VSA): a formal
verification system that serves as a logically grounded safeguard for Mobile
GUI Agents. VSA is designed to deterministically ensure that an agent's actions
strictly align with user intent before conducting an action. At its core, VSA
introduces a novel autoformalization technique that translates natural language
user instructions into a formally verifiable specification, expressed in our
domain-specific language (DSL). This enables runtime, rule-based verification,
allowing VSA to detect and prevent erroneous actions executing an action,
either by providing corrective feedback or halting unsafe behavior. To the best
of our knowledge, VSA is the first attempt to bring the rigor of formal
verification to GUI agent. effectively bridging the gap between LFM-driven
automation and formal software verification. We implement VSA using
off-the-shelf LLM services (GPT-4o) and evaluate its performance on 300 user
instructions across 18 widely used mobile apps. The results demonstrate that
VSA achieves 94.3%-98.33% accuracy in verifying agent actions, representing a
significant 20.4%-25.6% improvement over existing LLM-based verification
methods, and consequently increases the GUI agent's task completion rate by
90%-130%.
☆ MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering
Visual Question Answering (VQA) requires reasoning across visual and textual
modalities, yet Large Vision-Language Models (LVLMs) often lack integrated
commonsense knowledge, limiting their robustness in real-world scenarios. To
address this, we introduce MAGIC-VQA, a novel framework that enhances VQA by
systematically integrating commonsense knowledge with LVLMs. MAGIC-VQA employs
a three-stage process: (1) Explicit Knowledge Integration from external
sources, (2) By-Type Post-Processing for contextual refinement, and (3)
Implicit Knowledge Augmentation using a Graph Neural Network (GNN) for
structured reasoning. While GNNs bring greater depth to structured inference,
they enable superior relational inference beyond LVLMs. MAGIC-VQA bridges a key
gap by unifying commonsensse knowledge with LVLM-driven reasoning, eliminating
the need for extensive pre-training or complex prompt tuning. Our framework
achieves state-of-the-art performance on benchmark datasets, significantly
improving commonsense reasoning in VQA.
comment: 8 Pages, 5 figures
☆ Whispering in Amharic: Fine-tuning Whisper for Low-resource Language
Dawit Ketema Gete, Bedru Yimam Ahamed, Tadesse Destaw Belay, Yohannes Ayana Ejigu, Sukairaj Hafiz Imam, Alemu Belay Tessema, Mohammed Oumer Adem, Tadesse Amare Belay, Robert Geislinger, Umma Aliyu Musa, Martin Semmann, Shamsuddeen Hassan Muhammad, Henning Schreiber, Seid Muhie Yimam
This work explores fine-tuning OpenAI's Whisper automatic speech recognition
(ASR) model for Amharic, a low-resource language, to improve transcription
accuracy. While the foundational Whisper model struggles with Amharic due to
limited representation in its training data, we fine-tune it using datasets
like Mozilla Common Voice, FLEURS, and the BDU-speech dataset. The
best-performing model, Whispersmall-am, significantly improves when finetuned
on a mix of existing FLEURS data and new, unseen Amharic datasets. Training
solely on new data leads to poor performance, but combining it with FLEURS data
reinforces the model, enabling better specialization in Amharic. We also
demonstrate that normalizing Amharic homophones significantly enhances Word
Error Rate (WER) and Bilingual Evaluation Understudy (BLEU) scores. This study
underscores the importance of fine-tuning strategies and dataset composition
for improving ASR in low-resource languages, providing insights for future
Amharic speech recognition research.
☆ PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model
Junyuan Gao, Jiahe Song, Jiang Wu, Runchuan Zhu, Guanlin Shen, Shasha Wang, Xingjian Wei, Haote Yang, Songyang Zhang, Weijia Li, Bin Wang, Dahua Lin, Lijun Wu, Conghui He
Existing multilingual benchmarks for Large Vision Language Models (LVLMs)
suffer from limitations including language-specific content biases, disjointed
multimodal input formats, and a lack of safety evaluation. To address these
gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal
Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design
across 10 languages, enabling fair and accurate cross-lingual comparisons. It
includes the vision setting where text and queries are embedded in images,
requiring LVLMs to simultaneously "see", "read", and "think", aligning with
real-world applications. Additionally, PM\textsuperscript{4}Bench incorporates
safety evaluations, addressing critical oversight in existing multilingual
benchmarks. Using PM4Bench, we evaluate 11 mainstream LVLMs, revealing
significant cross-linguistic performance disparities, particularly in vision
settings, and identifying OCR capability as a key determinant of these
imbalances. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .
comment: Equal contribution: Junyuan Gao, Jiahe Song, Jiang Wu; Corresponding
author: Conghui He
☆ Global-Local Tree Search for Language Guided 3D Scene Generation CVPR 2025
Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable
success across various fields. However, there are few studies on 3D indoor
scene generation with VLMs. This paper considers this task as a planning
problem subject to spatial and layout common sense constraints. To solve the
problem with a VLM, we propose a new global-local tree search algorithm.
Globally, the method places each object sequentially and explores multiple
placements during each placement process, where the problem space is
represented as a tree. To reduce the depth of the tree, we decompose the scene
structure hierarchically, i.e. room level, region level, floor object level,
and supported object level. The algorithm independently generates the floor
objects in different regions and supported objects placed on different floor
objects. Locally, we also decompose the sub-task, the placement of each object,
into multiple steps. The algorithm searches the tree of problem space. To
leverage the VLM model to produce positions of objects, we discretize the
top-down view space as a dense grid and fill each cell with diverse emojis to
make to cells distinct. We prompt the VLM with the emoji grid and the VLM
produces a reasonable location for the object by describing the position with
the name of emojis. The quantitative and qualitative experimental results
illustrate our approach generates more plausible 3D scenes than
state-of-the-art approaches. Our source code is available at
https://github.com/dw-dengwei/TreeSearchGen .
comment: Accepted by CVPR 2025
☆ Words as Bridges: Exploring Computational Support for Cross-Disciplinary Translation Work
Scholars often explore literature outside of their home community of study.
This exploration process is frequently hampered by field-specific jargon. Past
computational work often focuses on supporting translation work by removing
jargon through simplification and summarization; here, we explore a different
approach that preserves jargon as useful bridges to new conceptual spaces.
Specifically, we cast different scholarly domains as different language-using
communities, and explore how to adapt techniques from unsupervised
cross-lingual alignment of word embeddings to explore conceptual alignments
between domain-specific word embedding spaces.We developed a prototype
cross-domain search engine that uses aligned domain-specific embeddings to
support conceptual exploration, and tested this prototype in two case studies.
We discuss qualitative insights into the promises and pitfalls of this approach
to translation work, and suggest design insights for future interfaces that
provide computational support for cross-domain information seeking.
comment: 26 pages, 8 tables, 6 figures
☆ StableGS: A Floater-Free Framework for 3D Gaussian Splatting
Recent years have witnessed remarkable success of 3D Gaussian Splatting
(3DGS) in novel view synthesis, surpassing prior differentiable rendering
methods in both quality and efficiency. However, its training process suffers
from coupled opacity-color optimization that frequently converges to local
minima, producing floater artifacts that degrade visual fidelity. We present
StableGS, a framework that eliminates floaters through cross-view depth
consistency constraints while introducing a dual-opacity GS model to decouple
geometry and material properties of translucent objects. To further enhance
reconstruction quality in weakly-textured regions, we integrate DUSt3R depth
estimation, significantly improving geometric stability. Our method
fundamentally addresses 3DGS training instabilities, outperforming existing
state-of-the-art methods across open-source datasets.
☆ On the Perception Bottleneck of VLMs for Chart Understanding
Chart understanding requires models to effectively analyze and reason about
numerical data, textual elements, and complex visual components. Our
observations reveal that the perception capabilities of existing large
vision-language models (LVLMs) constitute a critical bottleneck in this
process. In this study, we delve into this perception bottleneck by decomposing
it into two components: the vision encoder bottleneck, where the visual
representation may fail to encapsulate the correct information, and the
extraction bottleneck, where the language model struggles to extract the
necessary information from the provided visual representations. Through
comprehensive experiments, we find that (1) the information embedded within
visual representations is substantially richer than what is typically captured
by linear extractors, such as the widely used retrieval accuracy metric; (2)
While instruction tuning effectively enhances the extraction capability of
LVLMs, the vision encoder remains a critical bottleneck, demanding focused
attention and improvement. Therefore, we further enhance the visual encoder to
mitigate the vision encoder bottleneck under a contrastive learning framework.
Empirical results demonstrate that our approach significantly mitigates the
perception bottleneck and improves the ability of LVLMs to comprehend charts.
Code is publicly available at https://github.com/hkust-nlp/Vision4Chart.
☆ Teaching LLMs for Step-Level Automatic Math Correction via Reinforcement Learning
Junsong Li, Jie Zhou, Yutao Yang, Bihao Zhan, Qianjun Pan, Yuyang Ding, Qin Chen, Jiang Bo, Xin Lin, Liang He
Automatic math correction aims to check students' solutions to mathematical
problems via artificial intelligence technologies. Most existing studies focus
on judging the final answer at the problem level, while they ignore detailed
feedback on each step in a math problem-solving process, which requires
abilities of semantic understanding and reasoning. In this paper, we propose a
reinforcement learning (RL)-based method to boost large language model (LLM)
for step-level automatic math correction, named StepAMC. Particularly, we
convert the step-level automatic math correction within the text classification
task into an RL problem to enhance the reasoning capabilities of LLMs. Then, we
design a space-constrained policy network to improve the stability of RL. Then,
we introduce a fine-grained reward network to convert the binary human feedback
into a continuous value. We conduct extensive experiments over two benchmark
datasets and the results show that our model outperforms the eleven strong
baselines.
☆ Solving Situation Puzzles with Large Language Model and External Reformulation
Kun Li, Xinwei Chen, Tianyou Song, Chengrui Zhou, Zhuoran Liu, Zhenyan Zhang, Jiangjian Guo, Qing Shan
In recent years, large language models (LLMs) have shown an impressive
ability to perform arithmetic and symbolic reasoning tasks. However, we found
that LLMs (e.g., ChatGPT) cannot perform well on reasoning that requires
multiple rounds of dialogue, especially when solving situation puzzles.
Specifically, LLMs intend to ask very detailed questions focusing on a specific
aspect or same/similar questions after several rounds of Q&As. To help LLMs get
out of the above dilemma, we propose a novel external reformulation
methodology, where the situation puzzle will be reformulated after several
rounds of Q&A or when the LLMs raise an incorrect guess. Experiments show
superior performance (e.g., win rate, number of question/guess attempts) of our
method than directly using LLMs for solving situation puzzles, highlighting the
potential of strategic problem reformulation to enhance the reasoning
capabilities of LLMs in complex interactive scenarios.
☆ J&H: Evaluating the Robustness of Large Language Models Under Knowledge-Injection Attacks in Legal Domain
Yiran Hu, Huanghai Liu, Qingjing Chen, Ning Zheng, Chong Wang, Yun Liu, Charles L. A. Clarke, Weixing Shen
As the scale and capabilities of Large Language Models (LLMs) increase, their
applications in knowledge-intensive fields such as legal domain have garnered
widespread attention. However, it remains doubtful whether these LLMs make
judgments based on domain knowledge for reasoning. If LLMs base their judgments
solely on specific words or patterns, rather than on the underlying logic of
the language, the ''LLM-as-judges'' paradigm poses substantial risks in the
real-world applications. To address this question, we propose a method of legal
knowledge injection attacks for robustness testing, thereby inferring whether
LLMs have learned legal knowledge and reasoning logic. In this paper, we
propose J&H: an evaluation framework for detecting the robustness of LLMs under
knowledge injection attacks in the legal domain. The aim of the framework is to
explore whether LLMs perform deductive reasoning when accomplishing legal
tasks. To further this aim, we have attacked each part of the reasoning logic
underlying these tasks (major premise, minor premise, and conclusion
generation). We have collected mistakes that legal experts might make in
judicial decisions in the real world, such as typos, legal synonyms, inaccurate
external legal statutes retrieval. However, in real legal practice, legal
experts tend to overlook these mistakes and make judgments based on logic.
However, when faced with these errors, LLMs are likely to be misled by
typographical errors and may not utilize logic in their judgments. We conducted
knowledge injection attacks on existing general and domain-specific LLMs.
Current LLMs are not robust against the attacks employed in our experiments. In
addition we propose and compare several methods to enhance the knowledge
robustness of LLMs.
comment: 10 pages, 5 figures
☆ Bridging Writing Manner Gap in Visual Instruction Tuning by Creating LLM-aligned Instructions
In the realm of Large Multi-modal Models (LMMs), the instruction quality
during the visual instruction tuning stage significantly influences the
performance of modality alignment. In this paper, we assess the instruction
quality from a unique perspective termed \textbf{Writing Manner}, which
encompasses the selection of vocabulary, grammar and sentence structure to
convey specific semantics. We argue that there exists a substantial writing
manner gap between the visual instructions and the base Large Language Models
(LLMs) within LMMs. This gap forces the pre-trained base LLMs to deviate from
their original writing styles, leading to capability degradation of both base
LLMs and LMMs. To bridge the writing manner gap while preserving the original
semantics, we propose directly leveraging the base LLM to align the writing
manner of soft-format visual instructions with that of the base LLM itself,
resulting in novel LLM-aligned instructions. The manual writing manner
evaluation results demonstrate that our approach successfully minimizes the
writing manner gap. By utilizing LLM-aligned instructions, the baseline models
LLaVA-7B and QwenVL demonstrate enhanced resistance to hallucinations and
non-trivial comprehensive improvements across all $15$ visual and language
benchmarks.
☆ Surgical Action Planning with Large Language Models
In robot-assisted minimally invasive surgery, we introduce the Surgical
Action Planning (SAP) task, which generates future action plans from visual
inputs to address the absence of intraoperative predictive planning in current
intelligent applications. SAP shows great potential for enhancing
intraoperative guidance and automating procedures. However, it faces challenges
such as understanding instrument-action relationships and tracking surgical
progress. Large Language Models (LLMs) show promise in understanding surgical
video content but remain underexplored for predictive decision-making in SAP,
as they focus mainly on retrospective analysis. Challenges like data privacy,
computational demands, and modality-specific constraints further highlight
significant research gaps. To tackle these challenges, we introduce LLM-SAP, a
Large Language Models-based Surgical Action Planning framework that predicts
future actions and generates text responses by interpreting natural language
prompts of surgical goals. The text responses potentially support surgical
education, intraoperative decision-making, procedure documentation, and skill
analysis. LLM-SAP integrates two novel modules: the Near-History Focus Memory
Module (NHF-MM) for modeling historical states and the prompts factory for
action planning. We evaluate LLM-SAP on our constructed CholecT50-SAP dataset
using models like Qwen2.5 and Qwen2-VL, demonstrating its effectiveness in
next-action prediction. Pre-trained LLMs are tested zero-shot, and supervised
fine-tuning (SFT) with LoRA is implemented to address data privacy concerns.
Our experiments show that Qwen2.5-72B-SFT surpasses Qwen2.5-72B with a 19.3%
higher accuracy.
comment: 10 pages,4 figures
☆ Fact-checking AI-generated news reports: Can LLMs catch their own lies?
In this paper, we evaluate the ability of Large Language Models (LLMs) to
assess the veracity of claims in ''news reports'' generated by themselves or
other LLMs. Our goal is to determine whether LLMs can effectively fact-check
their own content, using methods similar to those used to verify claims made by
humans. Our findings indicate that LLMs are more effective at assessing claims
in national or international news stories than in local news stories, better at
evaluating static information than dynamic information, and better at verifying
true claims compared to false ones. We hypothesize that this disparity arises
because the former types of claims are better represented in the training data.
Additionally, we find that incorporating retrieved results from a search engine
in a Retrieval-Augmented Generation (RAG) setting significantly reduces the
number of claims an LLM cannot assess. However, this approach also increases
the occurrence of incorrect assessments, partly due to irrelevant or
low-quality search results. This diagnostic study highlights the need for
future research on fact-checking machine-generated reports to prioritize
improving the precision and relevance of retrieved information to better
support fact-checking efforts. Furthermore, claims about dynamic events and
local news may require human-in-the-loop fact-checking systems to ensure
accuracy and reliability.
☆ When is dataset cartography ineffective? Using training dynamics does not improve robustness against Adversarial SQuAD
In this paper, I investigate the effectiveness of dataset cartography for
extractive question answering on the SQuAD dataset. I begin by analyzing
annotation artifacts in SQuAD and evaluate the impact of two adversarial
datasets, AddSent and AddOneSent, on an ELECTRA-small model. Using training
dynamics, I partition SQuAD into easy-to-learn, ambiguous, and hard-to-learn
subsets. I then compare the performance of models trained on these subsets to
those trained on randomly selected samples of equal size. Results show that
training on cartography-based subsets does not improve generalization to the
SQuAD validation set or the AddSent adversarial set. While the hard-to-learn
subset yields a slightly higher F1 score on the AddOneSent dataset, the overall
gains are limited. These findings suggest that dataset cartography provides
little benefit for adversarial robustness in SQuAD-style QA tasks. I conclude
by comparing these results to prior findings on SNLI and discuss possible
reasons for the observed differences.
comment: 5 pages, 3 figures, 4 tables
☆ Sun-Shine: A Large Language Model for Tibetan Culture
Cheng Huang, Fan Gao, Nyima Tashi, Yutong Liu, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Yongbin Yu
Tibetan, a minority language in China, features a highly intricate
grammatical structure, characterized by four verb tenses and a tense system
with frequent irregularities, contributing to its extensive inflectional
diversity. Recently, advances in Large Language Models (LLMs) have transformed
the paradigm in many domains. Despite the success in other fields, current LLMs
often fall short in catering to the needs of domain experts like Tibetans, and
the potential of LLMs for Tibetan culture is under-explored. The intrinsic
reasons are the immense and intricate nature of Tibetan culture as well as the
necessity for higher granularity and richness in knowledge. Simultaneously, the
complexity and uniqueness of its grammatical structure, coupled with its status
as a minority ethnic language, contribute to data scarcity, which remains a
fundamental challenge. To alleviate these issues, we introduce Llama-Sunshine
(Sun-Shine), the first large language model for Tibetan culture, which is
expert in various Tibetan language processing tasks. Sun-Shine incorporates
state-of-the-art model architectures optimized for Tibetan's linguistic
features. We also propose TIB-STC, a comprehensive dataset comprising diverse
Tibetan texts such as literature, religious scripts, news, and conversational
data, which is also the first large-scale dataset for Tibetan culture. Though
comprehensive experiments, Sun-Shine not only demonstrates a higher level of
knowledge expertise for Tibetan culture but also gains preliminary embodied
intelligence capabilities in Tibetan language processing tasks, like language
modeling, text classification, machine translation, and syntactic analysis.
Moreover, it excels in low-resource scenarios, showcasing strong generalization
capabilities.
☆ Bridging Emotions and Architecture: Sentiment Analysis in Modern Distributed Systems
Sentiment analysis is a field within NLP that has gained importance because
it is applied in various areas such as; social media surveillance, customer
feedback evaluation and market research. At the same time, distributed systems
allow for effective processing of large amounts of data. Therefore, this paper
examines how sentiment analysis converges with distributed systems by
concentrating on different approaches, challenges and future investigations.
Furthermore, we do an extensive experiment where we train sentiment analysis
models using both single node configuration and distributed architecture to
bring out the benefits and shortcomings of each method in terms of performance
and accuracy.
comment: IEEE 3rd International Conference on Advancements in Smart, Secure
and Intelligent Computing (ASSIC)
☆ Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages
Tadesse Destaw Belay, Dawit Ketema Gete, Abinew Ali Ayele, Olga Kolesnikova, Grigori Sidorov, Seid Muhie Yimam
In this digital world, people freely express their emotions using different
social media platforms. As a result, modeling and integrating
emotion-understanding models are vital for various human-computer interaction
tasks such as decision-making, product and customer feedback analysis,
political promotions, marketing research, and social media monitoring. As users
express different emotions simultaneously in a single instance, annotating
emotions in a multilabel setting such as the EthioEmo (Belay et al., 2025)
dataset effectively captures this dynamic. Additionally, incorporating
intensity, or the degree of emotion, is crucial, as emotions can significantly
differ in their expressive strength and impact. This intensity is significant
for assessing whether further action is necessary in decision-making processes,
especially concerning negative emotions in applications such as healthcare and
mental health studies. To enhance the EthioEmo dataset, we include annotations
for the intensity of each labeled emotion. Furthermore, we evaluate various
state-of-the-art encoder-only Pretrained Language Models (PLMs) and
decoder-only Large Language Models (LLMs) to provide comprehensive
benchmarking.
☆ PAD: Towards Efficient Data Generation for Transfer Learning Using Phrase Alignment
Transfer learning leverages the abundance of English data to address the
scarcity of resources in modeling non-English languages, such as Korean. In
this study, we explore the potential of Phrase Aligned Data (PAD) from
standardized Statistical Machine Translation (SMT) to enhance the efficiency of
transfer learning. Through extensive experiments, we demonstrate that PAD
synergizes effectively with the syntactic characteristics of the Korean
language, mitigating the weaknesses of SMT and significantly improving model
performance. Moreover, we reveal that PAD complements traditional data
construction methods and enhances their effectiveness when combined. This
innovative approach not only boosts model performance but also suggests a
cost-efficient solution for resource-scarce languages.
comment: Preparing for conference
☆ AfroXLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text
Tadesse Destaw Belay, Israel Abebe Azime, Ibrahim Said Ahmad, Idris Abdulmumin, Abinew Ali Ayele, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam
Pretrained Language Models (PLMs) built from various sources are the
foundation of today's NLP progress. Language representations learned by such
models achieve strong performance across many tasks with datasets of varying
sizes drawn from various sources. We explore a thorough analysis of domain and
task adaptive continual pretraining approaches for low-resource African
languages and a promising result is shown for the evaluated tasks. We create
AfriSocial, a corpus designed for domain adaptive finetuning that passes
through quality pre-processing steps. Continual pretraining PLMs using
AfriSocial as domain adaptive pretraining (DAPT) data, consistently improves
performance on fine-grained emotion classification task of 16 targeted
languages from 1% to 28.27% macro F1 score. Likewise, using the task adaptive
pertaining (TAPT) approach, further finetuning with small unlabeled but similar
task data shows promising results. For example, unlabeled sentiment data
(source) for fine-grained emotion classification task (target) improves the
base model results by an F1 score ranging from 0.55% to 15.11%. Combining the
two methods, DAPT + TAPT, achieves also better results than base models. All
the resources will be available to improve low-resource NLP tasks, generally,
as well as other similar domain tasks such as hate speech and sentiment tasks.
♻ ☆ Large Language Models Empowered Personalized Web Agents WWW 2025
Web agents have emerged as a promising direction to automate Web task
completion based on user instructions, significantly enhancing user experience.
Recently, Web agents have evolved from traditional agents to Large Language
Models (LLMs)-based Web agents. Despite their success, existing LLM-based Web
agents overlook the importance of personalized data (e.g., user profiles and
historical Web behaviors) in assisting the understanding of users' personalized
instructions and executing customized actions. To overcome the limitation, we
first formulate the task of LLM-empowered personalized Web agents, which
integrate personalized data and user instructions to personalize instruction
comprehension and action execution. To address the absence of a comprehensive
evaluation benchmark, we construct a Personalized Web Agent Benchmark
(PersonalWAB), featuring user instructions, personalized user data, Web
functions, and two evaluation paradigms across three personalized Web tasks.
Moreover, we propose a Personalized User Memory-enhanced Alignment (PUMA)
framework to adapt LLMs to the personalized Web agent task. PUMA utilizes a
memory bank with a task-specific retrieval strategy to filter relevant
historical Web behaviors. Based on the behaviors, PUMA then aligns LLMs for
personalized action execution through fine-tuning and direct preference
optimization. Extensive experiments validate the superiority of PUMA over
existing Web agents on PersonalWAB.
comment: Accepted to WWW 2025. The code and data are available on the project
website https://hongrucai.github.io/PersonalWAB/
♻ ☆ GroundCap: A Visually Grounded Image Captioning Dataset
Current image captioning systems lack the ability to link descriptive text to
specific visual elements, making their outputs difficult to verify. While
recent approaches offer some grounding capabilities, they cannot track object
identities across multiple references or ground both actions and objects
simultaneously. We propose a novel ID-based grounding system that enables
consistent object reference tracking and action-object linking, and present
GroundCap, a dataset containing 52,016 images from 77 movies, with 344
human-annotated and 52,016 automatically generated captions. Each caption is
grounded on detected objects (132 classes) and actions (51 classes) using a tag
system that maintains object identity while linking actions to the
corresponding objects. Our approach features persistent object IDs for
reference tracking, explicit action-object linking, and segmentation of
background elements through K-means clustering. We propose gMETEOR, a metric
combining caption quality with grounding accuracy, and establish baseline
performance by fine-tuning Pixtral-12B. Human evaluation demonstrates our
approach's effectiveness in producing verifiable descriptions with coherent
object references.
comment: 37 pages
♻ ☆ Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment ICLR 2025
Dongping Chen, Ruoxi Chen, Shu Pu, Zhaoyi Liu, Yanru Wu, Caixi Chen, Benlin Liu, Yue Huang, Yao Wan, Pan Zhou, Ranjay Krishna
Many real-world user queries (e.g. "How do to make egg fried rice?") could
benefit from systems capable of generating responses with both textual steps
with accompanying images, similar to a cookbook. Models designed to generate
interleaved text and images face challenges in ensuring consistency within and
across these modalities. To address these challenges, we present ISG, a
comprehensive evaluation framework for interleaved text-and-image generation.
ISG leverages a scene graph structure to capture relationships between text and
image blocks, evaluating responses on four levels of granularity: holistic,
structural, block-level, and image-specific. This multi-tiered evaluation
allows for a nuanced assessment of consistency, coherence, and accuracy, and
provides interpretable question-answer feedback. In conjunction with ISG, we
introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8
categories and 21 subcategories. This benchmark dataset includes complex
language-vision dependencies and golden answers to evaluate models effectively
on vision-centric tasks such as style transfer, a challenging area for current
models. Using ISG-Bench, we demonstrate that recent unified vision-language
models perform poorly on generating interleaved content. While compositional
approaches that combine separate language and image models show a 111%
improvement over unified models at the holistic level, their performance
remains suboptimal at both block and image levels. To facilitate future work,
we develop ISG-Agent, a baseline agent employing a "plan-execute-refine"
pipeline to invoke tools, achieving a 122% performance improvement.
comment: Accepted by ICLR 2025 as Spotlight. Project homepage:
https://interleave-eval.github.io/
♻ ☆ HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Transformers have become the de facto architecture for a wide range of
machine learning tasks, particularly in large language models (LLMs). Despite
their remarkable performance, challenges remain in training deep transformer
networks, especially regarding the location of layer normalization. While
Pre-Norm structures facilitate easier training due to their more prominent
identity path, they often yield suboptimal performance compared to Post-Norm.
In this paper, we propose $\textbf{HybridNorm}$, a straightforward yet
effective hybrid normalization strategy that integrates the advantages of both
Pre-Norm and Post-Norm approaches. Specifically, HybridNorm employs QKV
normalization within the attention mechanism and Post-Norm in the feed-forward
network (FFN) of each transformer block. This design not only stabilizes
training but also enhances performance, particularly in the context of LLMs.
Comprehensive experiments in both dense and sparse architectures show that
HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches,
achieving state-of-the-art results across various benchmarks. These findings
highlight the potential of HybridNorm as a more stable and effective technique
for improving the training and performance of deep transformer models. Code is
available at https://github.com/BryceZhuo/HybridNorm.
♻ ☆ Federated Incremental Named Entity Recognition
Federated Named Entity Recognition (FNER) boosts model training within each
local client by aggregating the model updates of decentralized local clients,
without sharing their private data. However, existing FNER methods assume fixed
entity types and local clients in advance, leading to their ineffectiveness in
practical applications. In a more realistic scenario, local clients receive new
entity types continuously, while new local clients collecting novel data may
irregularly join the global FNER training. This challenging setup, referred to
here as Federated Incremental NER, renders the global model suffering from
heterogeneous forgetting of old entity types from both intra-client and
inter-client perspectives. To overcome these challenges, we propose a
Local-Global Forgetting Defense (LGFD) model. Specifically, to address
intra-client forgetting, we develop a structural knowledge distillation loss to
retain the latent space's feature structure and a pseudo-label-guided
inter-type contrastive loss to enhance discriminative capability over different
entity types, effectively preserving previously learned knowledge within local
clients. To tackle inter-client forgetting, we propose a task switching monitor
that can automatically identify new entity types under privacy protection and
store the latest old global model for knowledge distillation and
pseudo-labeling. Experiments demonstrate significant improvement of our LGFD
model over comparison methods.
comment: Accepted by IEEE/ACM Transactions on Audio, Speech and Language
Processing
♻ ☆ Assessing the Reliability and Validity of GPT-4 in Annotating Emotion Appraisal Ratings
Appraisal theories suggest that emotions arise from subjective evaluations of
events, referred to as appraisals. The taxonomy of appraisals is quite diverse,
and they are usually given ratings on a Likert scale to be annotated in an
experiencer-annotator or reader-annotator paradigm. This paper studies GPT-4 as
a reader-annotator of 21 specific appraisal ratings in different prompt
settings, aiming to evaluate and improve its performance compared to human
annotators. We found that GPT-4 is an effective reader-annotator that performs
close to or even slightly better than human annotators, and its results can be
significantly improved by using a majority voting of five completions. GPT-4
also effectively predicts appraisal ratings and emotion labels using a single
prompt, but adding instruction complexity results in poorer performance. We
also found that longer event descriptions lead to more accurate annotations for
both model and human annotator ratings. This work contributes to the growing
usage of LLMs in psychology and the strategies for improving GPT-4 performance
in annotating appraisals.
comment: CLPsych 2025
♻ ☆ Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching CVPR2025
Formula recognition presents significant challenges due to the complicated
structure and varied notation of mathematical expressions. Despite continuous
advancements in formula recognition models, the evaluation metrics employed by
these models, such as BLEU and Edit Distance, still exhibit notable
limitations. They overlook the fact that the same formula has diverse
representations and is highly sensitive to the distribution of training data,
thereby causing unfairness in formula recognition evaluation. To this end, we
propose a Character Detection Matching (CDM) metric, ensuring the evaluation
objectivity by designing an image-level rather than a LaTeX-level metric score.
Specifically, CDM renders both the model-predicted LaTeX and the ground-truth
LaTeX formulas into image-formatted formulas, then employs visual feature
extraction and localization techniques for precise character-level matching,
incorporating spatial position information. Such a spatially-aware and
character-matching method offers a more accurate and equitable evaluation
compared with previous BLEU and Edit Distance metrics that rely solely on
text-based character matching. Experimentally, we evaluated various formula
recognition models using CDM, BLEU, and ExpRate metrics. Their results
demonstrate that the CDM aligns more closely with human evaluation standards
and provides a fairer comparison across different models by eliminating
discrepancies caused by diverse formula representations. Code is available at
https://github.com/opendatalab/UniMERNet/tree/main/cdm.
comment: Accepted by CVPR2025
♻ ☆ How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexander Panchenko, Mikhail Salnikov
The performance of Large Language Models (LLMs) on many tasks is greatly
limited by the knowledge learned during pre-training and stored in the model's
parameters. Low-rank adaptation (LoRA) is a popular and efficient training
technique for updating or domain-specific adaptation of LLMs. In this study, we
investigate how new facts can be incorporated into the LLM using LoRA without
compromising the previously learned knowledge. We fine-tuned
Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our
experiments have shown that the best results are obtained when the training
data contains a mixture of known and new facts. However, this approach is still
potentially harmful because the model's performance on external
question-answering benchmarks declines after such fine-tuning. When the
training data is biased towards certain entities, the model tends to regress to
few overrepresented answers. In addition, we found that the model becomes more
confident and refuses to provide an answer in only few cases. These findings
highlight the potential pitfalls of LoRA-based LLM updates and underscore the
importance of training data composition and tuning parameters to balance new
knowledge integration and general model capabilities.
♻ ☆ AutoTRIZ: Automating Engineering Innovation with TRIZ and Large Language Models
Various ideation methods, such as morphological analysis and
design-by-analogy, have been developed to aid creative problem-solving and
innovation. Among them, the Theory of Inventive Problem Solving (TRIZ) stands
out as one of the best-known methods. However, the complexity of TRIZ and its
reliance on users' knowledge, experience, and reasoning capabilities limit its
practicality. To address this, we introduce AutoTRIZ, an artificial ideation
system that integrates Large Language Models (LLMs) to automate and enhance the
TRIZ methodology. By leveraging LLMs' vast pre-trained knowledge and advanced
reasoning capabilities, AutoTRIZ offers a novel, generative, and interpretable
approach to engineering innovation. AutoTRIZ takes a problem statement from the
user as its initial input, automatically conduct the TRIZ reasoning process and
generates a structured solution report. We demonstrate and evaluate the
effectiveness of AutoTRIZ through comparative experiments with textbook cases
and a real-world application in the design of a Battery Thermal Management
System (BTMS). Moreover, the proposed LLM-based framework holds the potential
for extension to automate other knowledge-based ideation methods, such as
SCAMPER, Design Heuristics, and Design-by-Analogy, paving the way for a new era
of AI-driven innovation tools.
comment: 28 pages, 12 figures
♻ ☆ GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding ICLR 2025
Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun
Recently, Multimodal Large Language Models (MLLMs) have been used as agents
to control keyboard and mouse inputs by directly perceiving the Graphical User
Interface (GUI) and generating corresponding commands. However, current agents
primarily demonstrate strong understanding capabilities in static environments
and are mainly applied to relatively simple domains, such as Web or mobile
interfaces. We argue that a robust GUI agent should be capable of perceiving
temporal information on the GUI, including dynamic Web content and multi-step
tasks. Additionally, it should possess a comprehensive understanding of various
GUI scenarios, including desktop software and multi-window interactions. To
this end, this paper introduces a new dataset, termed GUI-World, which features
meticulously crafted Human-MLLM annotations, extensively covering six GUI
scenarios and eight types of GUI-oriented questions in three formats. We
evaluate the capabilities of current state-of-the-art MLLMs, including Image
LLMs and Video LLMs, in understanding various types of GUI content, especially
dynamic and sequential content. Our findings reveal that current models
struggle with dynamic GUI content without manually annotated keyframes or
operation history. On the other hand, Video LLMs fall short in all GUI-oriented
tasks given the sparse GUI video dataset. Therefore, we take the initial step
of leveraging a fine-tuned Video LLM, GUI-Vid, as a GUI-oriented assistant,
demonstrating an improved understanding of various GUI tasks. However, due to
the limitations in the performance of base LLMs, we conclude that using video
LLMs as GUI agents remains a significant challenge. We believe our work
provides valuable insights for future research in dynamic GUI content
understanding. All the dataset and code are publicly available at:
https://gui-world.github.io.
comment: Accepted by ICLR 2025
♻ ☆ ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
Recent work shows Large Language Models (LLMs) struggle to understand natural
language constraints for various text generation tasks in zero- and few-shot
settings. While, in the code domain, there is wide usage of constraints in code
format to maintain the integrity of code written in Domain-Specific Languages
(DSLs) like JSON and YAML which are widely used for system-level programming
tasks in enterprises. Given that LLMs are increasingly used for system-level
code tasks, evaluating if they can comprehend these code constraints is
crucial. However, no work has been done to evaluate their controllability over
code constraints. Hence, we introduce ConCodeEval, a first-of-its-kind
benchmark having two novel tasks for code constraints across five
representations. Our findings suggest that language models struggle with code
constraints. Code languages that perform excellently for normal code tasks do
not perform well when the same languages represent fine-grained constraints.
♻ ☆ Toward a method for LLM-enabled Indoor Navigation
Alberto Coffrini, Mohammad Amin Zadenoori, Paolo Barsocchi, Francesco Furfari, Antonino Crivello, Alessio Ferrari
Indoor navigation presents unique challenges due to complex layouts, lack of
GPS signals, and accessibility concerns. Existing solutions often struggle with
real-time adaptability and user-specific needs. In this work, we explore the
potential of a Large Language Model (LLM), i.e., ChatGPT, to generate natural,
context-aware navigation instructions from indoor map images. We design and
evaluate test cases across different real-world environments, analyzing the
effectiveness of LLMs in interpreting spatial layouts, handling user
constraints, and planning efficient routes. Our findings demonstrate the
potential of LLMs for supporting personalized indoor navigation, with an
average of 50.54% correct indications and a maximum of 77.78%. The results do
not appear to depend on the complexity of the layout or the complexity of the
expected path, but rather on the number of points of interest and the abundance
of visual information, which negatively affect the performance.
comment: 7 pages, 3 figures, 5 tables
♻ ☆ LLM Post-Training: A Deep Dive into Reasoning Large Language Models
Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr, Fahad Shahbaz Khan, Salman Khan
Large Language Models (LLMs) have transformed the natural language processing
landscape and brought to life diverse applications. Pretraining on vast
web-scale data has laid the foundation for these models, yet the research
community is now increasingly shifting focus toward post-training techniques to
achieve further breakthroughs. While pretraining provides a broad linguistic
foundation, post-training methods enable LLMs to refine their knowledge,
improve reasoning, enhance factual accuracy, and align more effectively with
user intents and ethical considerations. Fine-tuning, reinforcement learning,
and test-time scaling have emerged as critical strategies for optimizing LLMs
performance, ensuring robustness, and improving adaptability across various
real-world tasks. This survey provides a systematic exploration of
post-training methodologies, analyzing their role in refining LLMs beyond
pretraining, addressing key challenges such as catastrophic forgetting, reward
hacking, and inference-time trade-offs. We highlight emerging directions in
model alignment, scalable adaptation, and inference-time reasoning, and outline
future research directions. We also provide a public repository to continually
track developments in this fast-evolving field:
https://github.com/mbzuai-oryx/Awesome-LLM-Post-training.
comment: 32 pages, 7 figures, 3 tables, 377 references. Github Repo:
https://github.com/mbzuai-oryx/Awesome-LLM-Post-training
♻ ☆ ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain
Publicly available information contains valuable information for Cyber Threat
Intelligence (CTI). This can be used to prevent attacks that have already taken
place on other systems. Ideally, only the initial attack succeeds and all
subsequent ones are detected and stopped. But while there are different
standards to exchange this information, a lot of it is shared in articles or
blog posts in non-standardized ways. Manually scanning through multiple online
portals and news pages to discover new threats and extracting them is a
time-consuming task. To automize parts of this scanning process, multiple
papers propose extractors that use Natural Language Processing (NLP) to extract
Indicators of Compromise (IOCs) from documents. However, while this already
solves the problem of extracting the information out of documents, the search
for these documents is rarely considered. In this paper, a new focused crawler
is proposed called ThreatCrawl, which uses Bidirectional Encoder
Representations from Transformers (BERT)-based models to classify documents and
adapt its crawling path dynamically. While ThreatCrawl has difficulties to
classify the specific type of Open Source Intelligence (OSINT) named in texts,
e.g., IOC content, it can successfully find relevant documents and modify its
path accord ingly. It yields harvest rates of up to 52%, which are, to the best
of our knowledge, better than the current state of the art. The results and
source code will be made publicly available upon acceptance.
comment: 11 pages, 9 figures, 5 tables
♻ ☆ Human-like conceptual representations emerge from language prediction
People acquire concepts through rich physical and social experiences and use
them to understand the world. In contrast, large language models (LLMs),
trained exclusively through next-token prediction over language data, exhibit
remarkably human-like behaviors. Are these models developing concepts akin to
humans, and if so, how are such concepts represented and organized? To address
these questions, we reframed the classic reverse dictionary task to simulate
human concept inference in context and investigated the emergence of human-like
conceptual representations within LLMs. Our results demonstrate that LLMs can
flexibly derive concepts from linguistic descriptions in relation to contextual
cues about other concepts. The derived representations converged towards a
shared, context-independent structure that effectively predicted human behavior
across key psychological phenomena, including computation of similarities,
categories and semantic scales. Moreover, these representations aligned well
with neural activity patterns in the human brain, even in response to visual
rather than linguistic stimuli, providing evidence for biological plausibility.
These findings establish that structured, human-like conceptual representations
can naturally emerge from language prediction without real-world grounding.
More broadly, our work positions LLMs as promising computational tools for
understanding complex human cognition and paves the way for better alignment
between artificial and human intelligence.
comment: 51 pages
♻ ☆ Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention
In multimodal sentiment analysis, collecting text data is often more
challenging than video or audio due to higher annotation costs and inconsistent
automatic speech recognition (ASR) quality. To address this challenge, our
study has developed a robust model that effectively integrates multimodal
sentiment information, even in the absence of text modality. Specifically, we
have developed a Double-Flow Self-Distillation Framework, including Unified
Modality Cross-Attention (UMCA) and Modality Imagination Autoencoder (MIA),
which excels at processing both scenarios with complete modalities and those
with missing text modality. In detail, when the text modality is missing, our
framework uses the LLM-based model to simulate the text representation from the
audio modality, while the MIA module supplements information from the other two
modalities to make the simulated text representation similar to the real text
representation. To further align the simulated and real representations, and to
enable the model to capture the continuous nature of sample orders in sentiment
valence regression tasks, we have also introduced the Rank-N Contrast (RNC)
loss function. When testing on the CMU-MOSEI, our model achieved outstanding
performance on MAE and significantly outperformed other models when text
modality is missing. The code is available at:
https://github.com/WarmCongee/SDUMC
♻ ☆ JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation CVPR 2025
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, Chong Ruan
We present JanusFlow, a powerful framework that unifies image understanding
and generation in a single model. JanusFlow introduces a minimalist
architecture that integrates autoregressive language models with rectified
flow, a state-of-the-art method in generative modeling. Our key finding
demonstrates that rectified flow can be straightforwardly trained within the
large language model framework, eliminating the need for complex architectural
modifications. To further improve the performance of our unified model, we
adopt two key strategies: (i) decoupling the understanding and generation
encoders, and (ii) aligning their representations during unified training.
Extensive experiments show that JanusFlow achieves comparable or superior
performance to specialized models in their respective domains, while
significantly outperforming existing unified approaches across standard
benchmarks. This work represents a step toward more efficient and versatile
vision-language models.
comment: Accepted by CVPR 2025
♻ ☆ CSCE: Boosting LLM Reasoning by Simultaneous Enhancing of Causal Significance and Consistency
Chain-based reasoning methods like chain of thought (CoT) play a rising role
in solving reasoning tasks for large language models (LLMs). However, the
causal hallucinations between a step of reasoning and corresponding state
transitions are becoming a significant obstacle to advancing LLMs' reasoning
capabilities, especially in long-range reasoning tasks. This paper proposes a
non-chain-based reasoning framework for simultaneous consideration of causal
significance and consistency, i.e., the Causal Significance and Consistency
Enhancer (CSCE). We customize LLM's loss function utilizing treatment effect
assessments to enhance its reasoning ability from two aspects: causal
significance and consistency. This ensures that the model captures essential
causal relationships and maintains robust and consistent performance across
various scenarios. Additionally, we transform the reasoning process from the
cascading multiple one-step reasoning commonly used in Chain-Based methods,
like CoT, to a causal-enhanced method that outputs the entire reasoning process
in one go, further improving the model's reasoning efficiency. Extensive
experiments show that our method improves both the reasoning success rate and
speed. These improvements further demonstrate that non-chain-based methods can
also aid LLMs in completing reasoning tasks.
comment: 6 pages,4 figures. This paper has been accepted for presentation at
IEEE International Conference on Multimedia & Expo 2025
♻ ☆ TriG-NER: Triplet-Grid Framework for Discontinuous Named Entity Recognition WWW'25
Rina Carines Cabral, Soyeon Caren Han, Areej Alhassan, Riza Batista-Navarro, Goran Nenadic, Josiah Poon
Discontinuous Named Entity Recognition (DNER) presents a challenging problem
where entities may be scattered across multiple non-adjacent tokens, making
traditional sequence labelling approaches inadequate. Existing methods
predominantly rely on custom tagging schemes to handle these discontinuous
entities, resulting in models tightly coupled to specific tagging strategies
and lacking generalisability across diverse datasets. To address these
challenges, we propose TriG-NER, a novel Triplet-Grid Framework that introduces
a generalisable approach to learning robust token-level representations for
discontinuous entity extraction. Our framework applies triplet loss at the
token level, where similarity is defined by word pairs existing within the same
entity, effectively pulling together similar and pushing apart dissimilar ones.
This approach enhances entity boundary detection and reduces the dependency on
specific tagging schemes by focusing on word-pair relationships within a
flexible grid structure. We evaluate TriG-NER on three benchmark DNER datasets
and demonstrate significant improvements over existing grid-based
architectures. These results underscore our framework's effectiveness in
capturing complex entity structures and its adaptability to various tagging
schemes, setting a new benchmark for discontinuous entity extraction.
comment: Accepted at The ACM Web Conference WWW'25. Code available at
https://github.com/adlnlp/trig_ner
♻ ☆ EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning
Xiaoqian Liu, Ke Wang, Yongbin Li, Yuchuan Wu, Wentao Ma, Aobo Kong, Fei Huang, Jianbin Jiao, Junge Zhang
Large Language Models (LLMs) have shown impressive reasoning capabilities in
well-defined problems with clear solutions, such as mathematics and coding.
However, they still struggle with complex real-world scenarios like business
negotiations, which require strategic reasoning-an ability to navigate dynamic
environments and align long-term goals amidst uncertainty. Existing methods for
strategic reasoning face challenges in adaptability, scalability, and
transferring strategies to new contexts. To address these issues, we propose
explicit policy optimization (EPO) for strategic reasoning, featuring an LLM
that provides strategies in open-ended action space and can be plugged into
arbitrary LLM agents to motivate goal-directed behavior. To improve
adaptability and policy transferability, we train the strategic reasoning model
via multi-turn reinforcement learning (RL) using process rewards and iterative
self-play, without supervised fine-tuning (SFT) as a preliminary step.
Experiments across social and physical domains demonstrate EPO's ability of
long-term goal alignment through enhanced strategic reasoning, achieving
state-of-the-art performance on social dialogue and web navigation tasks. Our
findings reveal various collaborative reasoning mechanisms emergent in EPO and
its effectiveness in generating novel strategies, underscoring its potential
for strategic reasoning in real-world applications.
comment: 22 pages, 4 figures
♻ ☆ Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment NAACL 2025
Allophony refers to the variation in the phonetic realization of a phoneme
based on its phonetic environment. Modeling allophones is crucial for atypical
pronunciation assessment, which involves distinguishing atypical from typical
pronunciations. However, recent phoneme classifier-based approaches often
simplify this by treating various realizations as a single phoneme, bypassing
the complexity of modeling allophonic variation. Motivated by the acoustic
modeling capabilities of frozen self-supervised speech model (S3M) features, we
propose MixGoP, a novel approach that leverages Gaussian mixture models to
model phoneme distributions with multiple subclusters. Our experiments show
that MixGoP achieves state-of-the-art performance across four out of five
datasets, including dysarthric and non-native speech. Our analysis further
suggests that S3M features capture allophonic variation more effectively than
MFCCs and Mel spectrograms, highlighting the benefits of integrating MixGoP
with S3M features.
comment: Accepted to NAACL 2025. Codebase available at
https://github.com/juice500ml/acoustic-units-for-ood
♻ ☆ A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet
their transition to real-world applications reveals a critical limitation: the
inability to adapt to individual preferences while maintaining alignment with
universal human values. Current alignment techniques adopt a one-size-fits-all
approach that fails to accommodate users' diverse backgrounds and needs. This
paper presents the first comprehensive survey of personalized alignment-a
paradigm that enables LLMs to adapt their behavior within ethical boundaries
based on individual preferences. We propose a unified framework comprising
preference memory management, personalized generation, and feedback-based
alignment, systematically analyzing implementation approaches and evaluating
their effectiveness across various scenarios. By examining current techniques,
potential risks, and future challenges, this survey provides a structured
foundation for developing more adaptable and ethically-aligned LLMs.
comment: 9 pages
♻ ☆ Similarity-Dissimilarity Loss for Multi-label Supervised Contrastive Learning
Supervised contrastive learning has been explored in making use of label
information for multi-label classification, but determining positive samples in
multi-label scenario remains challenging. Previous studies have examined
strategies for identifying positive samples, considering label overlap
proportion between anchors and samples. However, they ignore various relations
between given anchors and samples, as well as how to dynamically adjust the
weights in contrastive loss functions based on different relations, leading to
great ambiguity. In this paper, we introduce five distinct relations between
multi-label samples and propose a Similarity-Dissimilarity Loss with
contrastive learning for multi-label classification. Our loss function
re-weights the loss by computing the similarity and dissimilarity between
positive samples and a given anchor based on the introduced relations. We
mainly conduct experiments for multi-label text classification on MIMIC
datasets, then further extend the evaluation on MS-COCO. The Experimental
results show that our proposed loss effectively improves the performance on all
encoders under supervised contrastive learning paradigm, demonstrating its
effectiveness and robustness.
♻ ☆ R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs
Recent studies have combined Large Language Models (LLMs) with Knowledge
Graphs (KGs) to enhance reasoning, improving inference accuracy without
additional training while mitigating hallucination. However, existing
frameworks are often rigid, struggling to adapt to KG or task changes. They
also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning.
To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that
separates reasoning into two roles: an Operator (a low-capacity LLM) that
gathers evidence and a Supervisor (a high-capacity LLM) that makes final
judgments. This design is cost-efficient for LLM inference while still
maintaining strong reasoning accuracy. Additionally, R2-KG employs an
Abstention mechanism, generating answers only when sufficient evidence is
collected from KG, which significantly enhances reliability. Experiments across
multiple KG-based reasoning tasks show that R2-KG consistently outperforms
baselines in both accuracy and reliability, regardless of the inherent
capability of LLMs used as the Operator. Further experiments reveal that the
single-agent version of R2-KG, equipped with a strict self-consistency
strategy, achieves significantly higher-than-baseline reliability while
reducing inference cost. However, it also leads to a higher abstention rate in
complex KGs. Our findings establish R2-KG as a flexible and cost-effective
solution for KG-based reasoning. It reduces reliance on high-capacity LLMs
while ensuring trustworthy inference.
♻ ☆ Design and Implementation of an FPGA-Based Tiled Matrix Multiplication Accelerator for Transformer Self-Attention on the Xilinx KV260 SoM
Transformer-based large language models (LLMs) rely heavily on intensive
matrix multiplications for attention and feed-forward layers, with the Q, K,
and V linear projections in the Multi-Head Self-Attention (MHA) module
constituting a decisive performance bottleneck. In this work, we introduce a
highly optimized tiled matrix multiplication accelerator on a
resource-constrained Xilinx KV260 FPGA that not only addresses this challenge
but sets a new standard for efficiency and performance. Our design exploits
persistent on-chip storage, a robust two-level tiling strategy for maximal data
reuse, and a systolic-like unrolled compute engine that together deliver
unparalleled speed and energy efficiency. Integrated with DistilBERT for Q, K,
and V projections, our accelerator achieves an unequivocal 7x speedup over ARM
CPU implementations (PyTorch) and an extraordinary 200x improvement over naive
NumPy, reaching a throughput of up to 3.1~GFLOPs for matrix multiplications on
(64,768) x (768,3072) matrices while operating at a conservative 100 MHz. These
results decisively demonstrate the transformative potential of FPGA-based
acceleration for critical Transformer operations, paving the way for scalable
and energy-efficient deep learning inference on edge devices.
comment: 7 pages, 4 figures, 2 tables. Prepared in ACM conference style.
Preprint under review
♻ ☆ Advancing Mathematical Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages ICLR 2025
Mathematical reasoning remains a challenging area for large language models
(LLMs), prompting the development of math-specific LLMs such as LLEMMA,
DeepSeekMath, and Qwen2-Math, among others. These models typically follow a
two-stage training paradigm: pre-training with math-related corpora and
post-training with problem datasets for supervised fine-tuning (SFT). Despite
these efforts, the improvements in mathematical reasoning achieved through
continued pre-training (CPT) are often less significant compared to those
obtained via SFT. This study addresses this discrepancy by exploring
alternative strategies during the pre-training phase, focusing on the use of
problem-solving data over general mathematical corpora. We investigate three
primary research questions: (1) Can problem-solving data enhance the model's
mathematical reasoning capabilities more effectively than general mathematical
corpora during CPT? (2) Are synthetic data from the same source equally
effective, and which synthesis methods are most efficient? (3) How do the
capabilities developed from the same problem-solving data differ between the
CPT and SFT stages, and what factors contribute to these differences? Our
findings indicate that problem-solving data significantly enhances the model's
mathematical capabilities compared to general mathematical corpora. We also
identify effective data synthesis methods, demonstrating that the tutorship
amplification synthesis method achieves the best performance. Furthermore,
while SFT facilitates instruction-following abilities, it underperforms
compared to CPT with the same data, which can be partially attributed to its
poor learning capacity for more challenging problem-solving data. These
insights provide valuable guidance for optimizing the mathematical reasoning
capabilities of LLMs, culminating in our development of a powerful mathematical
base model called MathGPT-8B.
comment: ICLR 2025
♻ ☆ ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
Video Large Language Models (VideoLLMs) have made significant strides in
video understanding but struggle with long videos due to the limitations of
their backbone LLMs. Existing solutions rely on length extrapolation, which is
memory-constrained, or visual token compression, which primarily leverages
low-level temporal redundancy while overlooking the more effective high-level
knowledge redundancy. To address this, we propose $\textbf{ReTaKe}$, a
training-free method with two novel modules DPSelect and PivotKV, to jointly
reduce both temporal visual redundancy and knowledge redundancy for video
compression. To align with the way of human temporal perception, DPSelect
identifies keyframes based on inter-frame distance peaks. To leverage LLMs'
learned prior knowledge, PivotKV marks the keyframes as pivots and compress
non-pivot frames by pruning low-attention tokens in their KV cache. ReTaKe
enables VideoLLMs to process 8 times longer frames (up to 2048), outperforming
similar-sized models by 3-5% and even rivaling much larger ones on VideoMME,
MLVU, LongVideoBench, and LVBench. Moreover, by overlapping compression
operations with prefilling, ReTaKe introduces only ~10% prefilling latency
overhead while reducing decoding latency by ~20%. Our code is available at
https://github.com/SCZwangxiao/video-ReTaKe.
comment: Rewrite the methods section. Add more ablation studies and results in
LongVideoBench. Update metadata
♻ ☆ Inside-Out: Hidden Factual Knowledge in LLMs
Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, Roi Reichart
This work presents a framework for assessing whether large language models
(LLMs) encode more factual knowledge in their parameters than what they express
in their outputs. While a few studies hint at this possibility, none has
clearly defined or demonstrated this phenomenon. We first propose a formal
definition of knowledge, quantifying it for a given question as the fraction of
correct-incorrect answer pairs where the correct one is ranked higher. This
gives rise to external and internal knowledge, depending on the information
used to score individual answer candidates: either the model's observable
token-level probabilities or its intermediate computations. Hidden knowledge
arises when internal knowledge exceeds external knowledge. We then present a
case study, applying this framework to three popular open-weights LLMs in a
closed-book QA setup. Our results indicate that: (1) LLMs consistently encode
more factual knowledge internally than what they express externally, with an
average relative gap of 40%. (2) Surprisingly, some knowledge is so deeply
hidden that a model can internally know an answer perfectly, yet fail to
generate it even once, despite large-scale repeated sampling of 1,000 answers.
This reveals fundamental limitations in the generation capabilities of LLMs,
which (3) put a practical constraint on scaling test-time compute via repeated
answer sampling in closed-book QA: significant performance improvements remain
inaccessible because some answers are practically never sampled, yet if they
were, we would be guaranteed to rank them first.
♻ ☆ Training and Evaluating with Human Label Variation: An Empirical Study
Human label variation (HLV) challenges the standard assumption that a
labelled instance has a single ground truth, instead embracing the natural
variation in human annotation to train and evaluate models. While various
training methods and metrics for HLV have been proposed, it is still unclear
which methods and metrics perform best in what settings. We propose new
evaluation metrics for HLV leveraging fuzzy set theory. Since these new
proposed metrics are differentiable, we then in turn experiment with employing
these metrics as training objectives. We conduct an extensive study over 6 HLV
datasets testing 14 training methods and 6 evaluation metrics. We find that
training on either disaggregated annotations or soft labels performs best
across metrics, outperforming training using the proposed training objectives
with differentiable metrics. We also show that our proposed soft metric is more
interpretable and correlates best with human preference.
comment: 25 pages