Computation and Language
☆ UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations ACL 2025
Fengran Mo, Yifan Gao, Chuan Meng, Xin Liu, Zhuofeng Wu, Kelong Mao, Zhengyang Wang, Pei Chen, Zheng Li, Xian Li, Bing Yin, Meng Jiang
The rapid advancement of conversational search systems revolutionizes how
information is accessed by enabling the multi-turn interaction between the user
and the system. Existing conversational search systems are usually built with
two different models. This separation restricts the system from leveraging the
intrinsic knowledge of the models simultaneously, which cannot ensure the
effectiveness of retrieval benefiting the generation. The existing studies for
developing unified models cannot fully address the aspects of understanding
conversational context, managing retrieval independently, and generating
responses. In this paper, we explore how to unify dense retrieval and response
generation for large language models in conversation. We conduct joint
fine-tuning with different objectives and design two mechanisms to reduce the
inconsistency risks while mitigating data discrepancy. The evaluations on five
conversational search datasets demonstrate that our unified model can mutually
improve both tasks and outperform the existing baselines.
comment: Accepted by ACL 2025 (main)
☆ FlexOlmo: Open Language Models for Flexible Data Use
Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min
We introduce FlexOlmo, a new class of language models (LMs) that supports (1)
distributed training without data sharing, where different model parameters are
independently trained on closed datasets, and (2) data-flexible inference,
where these parameters along with their associated data can be flexibly
included or excluded from model inferences with no further training. FlexOlmo
employs a mixture-of-experts (MoE) architecture where each expert is trained
independently on closed datasets and later integrated through a new
domain-informed routing without any joint training. FlexOlmo is trained on
FlexMix, a corpus we curate comprising publicly available datasets alongside
seven domain-specific sets, representing realistic approximations of closed
sets. We evaluate models with up to 37 billion parameters (20 billion active)
on 31 diverse downstream tasks. We show that a general expert trained on public
data can be effectively combined with independently trained experts from other
data owners, leading to an average 41% relative improvement while allowing
users to opt out of certain data based on data licensing or permission
requirements. Our approach also outperforms prior model merging methods by
10.1% on average and surpasses the standard MoE trained without data
restrictions using the same training FLOPs. Altogether, this research presents
a solution for both data owners and researchers in regulated industries with
sensitive or protected data. FlexOlmo enables benefiting from closed data while
respecting data owners' preferences by keeping their data local and supporting
fine-grained control of data access during inference.
☆ Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs
Reasoning is a key capability for large language models (LLMs), particularly
when applied to complex tasks such as mathematical problem solving. However,
multimodal reasoning research still requires further exploration of modality
alignment and training costs. Many of these approaches rely on additional data
annotation and relevant rule-based rewards to enhance the understanding and
reasoning ability, which significantly increases training costs and limits
scalability. To address these challenges, we propose the
Deliberate-to-Intuitive reasoning framework (D2I) that improves the
understanding and reasoning ability of multimodal LLMs (MLLMs) without extra
annotations and complex rewards. Specifically, our method sets deliberate
reasoning strategies to enhance modality alignment only through the rule-based
format reward during training. While evaluating, the reasoning style shifts to
intuitive, which removes deliberate reasoning strategies during training and
implicitly reflects the model's acquired abilities in the response. D2I
outperforms baselines across both in-domain and out-of-domain benchmarks. Our
findings highlight the role of format reward in fostering transferable
reasoning skills in MLLMs, and inspire directions for decoupling training-time
reasoning depth from test-time response flexibility.
comment: Work in progress
☆ FRaN-X: FRaming and Narratives-eXplorer EMNLP 2025
Artur Muratov, Hana Fatima Shaikh, Vanshikaa Jani, Tarek Mahmoud, Zhuohan Xie, Daniil Orel, Aaryamonvikram Singh, Yuxia Wang, Aadi Joshi, Hasan Iqbal, Ming Shan Hee, Dhruv Sahnan, Nikolaos Nikolaidis, Purificação Silvano, Dimitar Dimitrov, Roman Yangarber, Ricardo Campos, Alípio Jorge, Nuno Guimarães, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov
We present FRaN-X, a Framing and Narratives Explorer that automatically
detects entity mentions and classifies their narrative roles directly from raw
text. FRaN-X comprises a two-stage system that combines sequence labeling with
fine-grained role classification to reveal how entities are portrayed as
protagonists, antagonists, or innocents, using a unique taxonomy of 22
fine-grained roles nested under these three main categories. The system
supports five languages (Bulgarian, English, Hindi, Russian, and Portuguese)
and two domains (the Russia-Ukraine Conflict and Climate Change). It provides
an interactive web interface for media analysts to explore and compare framing
across different sources, tackling the challenge of automatically detecting and
labeling how entities are framed. Our system allows end users to focus on a
single article as well as analyze up to four articles simultaneously. We
provide aggregate level analysis including an intuitive graph visualization
that highlights the narrative a group of articles are pushing. Our system
includes a search feature for users to look up entities of interest, along with
a timeline view that allows analysts to track an entity's role transitions
across different contexts within the article. The FRaN-X system and the trained
models are licensed under an MIT License. FRaN-X is publicly accessible at
https://fran-x.streamlit.app/ and a video demonstration is available at
https://youtu.be/VZVi-1B6yYk.
comment: 19 pages, 13 figures, submitted to EMNLP 2025 - Demo Track
☆ Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report
Instruction tuning has become a foundation for unlocking the capabilities of
large-scale pretrained models and improving their performance on complex tasks.
Thus, the construction of high-quality instruction datasets is crucial for
enhancing model performance and generalizability. Although current instruction
datasets have reached tens of millions of samples, models finetuned on them may
still struggle with complex instruction following and tasks in rare domains.
This is primarily due to limited expansion in both ``coverage'' (coverage of
task types and knowledge areas) and ``depth'' (instruction complexity) of the
instruction set. To address this issue, we propose a systematic instruction
data construction framework, which integrates a hierarchical labeling system,
an informative seed selection algorithm, an evolutionary data synthesis
process, and a model deficiency diagnosis with targeted data generation. These
components form an iterative closed-loop to continuously enhance the coverage
and depth of instruction data. Based on this framework, we construct
InfinityInstruct-Subject, a high-quality dataset containing ~1.5 million
instructions. Experiments on multiple foundation models and benchmark tasks
demonstrate its effectiveness in improving instruction-following capabilities.
Further analyses suggest that InfinityInstruct-Subject shows enlarged coverage
and depth compared to comparable synthesized instruction datasets. Our work
lays a theoretical and practical foundation for the efficient, continuous
evolution of instruction datasets, moving from data quantity expansion to
qualitative improvement.
☆ Investigating the Robustness of Retrieval-Augmented Generation at the Query Level ACL 2025
Sezen Perçin, Xin Su, Qutub Sha Syed, Phillip Howard, Aleksei Kuvshinov, Leo Schwinn, Kay-Ulrich Scholl
Large language models (LLMs) are very costly and inefficient to update with
new information. To address this limitation, retrieval-augmented generation
(RAG) has been proposed as a solution that dynamically incorporates external
knowledge during inference, improving factual consistency and reducing
hallucinations. Despite its promise, RAG systems face practical challenges-most
notably, a strong dependence on the quality of the input query for accurate
retrieval. In this paper, we investigate the sensitivity of different
components in the RAG pipeline to various types of query perturbations. Our
analysis reveals that the performance of commonly used retrievers can degrade
significantly even under minor query variations. We study each module in
isolation as well as their combined effect in an end-to-end question answering
setting, using both general-domain and domain-specific datasets. Additionally,
we propose an evaluation framework to systematically assess the query-level
robustness of RAG pipelines and offer actionable recommendations for
practitioners based on the results of more than 1092 experiments we performed.
comment: Accepted to Generation, Evaluation & Metrics (GEM) Workshop at ACL
2025
☆ Rethinking Verification for LLM Code Generation: From Generation to Testing
Large language models (LLMs) have recently achieved notable success in
code-generation benchmarks such as HumanEval and LiveCodeBench. However, a
detailed examination reveals that these evaluation suites often comprise only a
limited number of homogeneous test cases, resulting in subtle faults going
undetected. This not only artificially inflates measured performance but also
compromises accurate reward estimation in reinforcement learning frameworks
utilizing verifiable rewards (RLVR). To address these critical shortcomings, we
systematically investigate the test-case generation (TCG) task by proposing
multi-dimensional metrics designed to rigorously quantify test-suite
thoroughness. Furthermore, we introduce a human-LLM collaborative method
(SAGA), leveraging human programming expertise with LLM reasoning capability,
aimed at significantly enhancing both the coverage and the quality of generated
test cases. In addition, we develop a TCGBench to facilitate the study of the
TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a
verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc)
of the code generation evaluation benchmark synthesized by SAGA is 10.78%
higher than that of LiveCodeBench-v6. These results demonstrate the
effectiveness of our proposed method. We hope this work contributes to building
a scalable foundation for reliable LLM code evaluation, further advancing RLVR
in code generation, and paving the way for automated adversarial test synthesis
and adaptive benchmark integration.
☆ Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues
Tutoring dialogues have gained significant attention in recent years, given
the prominence of online learning and the emerging tutoring abilities of
artificial intelligence (AI) agents powered by large language models (LLMs).
Recent studies have shown that the strategies used by tutors can have
significant effects on student outcomes, necessitating methods to predict how
tutors will behave and how their actions impact students. However, few works
have studied predicting tutor strategy in dialogues. Therefore, in this work we
investigate the ability of modern LLMs, particularly Llama 3 and GPT-4o, to
predict both future tutor moves and student outcomes in dialogues, using two
math tutoring dialogue datasets. We find that even state-of-the-art LLMs
struggle to predict future tutor strategy while tutor strategy is highly
indicative of student outcomes, outlining a need for more powerful methods to
approach this task.
comment: Published in BEA 2025: 20th Workshop on Innovative Use of NLP for
Building Educational Applications
☆ MultiJustice: A Chinese Dataset for Multi-Party, Multi-Charge Legal Prediction NLPCC 2025
Legal judgment prediction offers a compelling method to aid legal
practitioners and researchers. However, the research question remains
relatively under-explored: Should multiple defendants and charges be treated
separately in LJP? To address this, we introduce a new dataset namely
multi-person multi-charge prediction (MPMCP), and seek the answer by evaluating
the performance of several prevailing legal large language models (LLMs) on
four practical legal judgment scenarios: (S1) single defendant with a single
charge, (S2) single defendant with multiple charges, (S3) multiple defendants
with a single charge, and (S4) multiple defendants with multiple charges. We
evaluate the dataset across two LJP tasks, i.e., charge prediction and penalty
term prediction. We have conducted extensive experiments and found that the
scenario involving multiple defendants and multiple charges (S4) poses the
greatest challenges, followed by S2, S3, and S1. The impact varies
significantly depending on the model. For example, in S4 compared to S1,
InternLM2 achieves approximately 4.5% lower F1-score and 2.8% higher LogD,
while Lawformer demonstrates around 19.7% lower F1-score and 19.0% higher LogD.
Our dataset and code are available at
https://github.com/lololo-xiao/MultiJustice-MPMCP.
comment: Accepted by NLPCC 2025
☆ MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection ACL 2025
The rapid expansion of memes on social media has highlighted the urgent need
for effective approaches to detect harmful content. However, traditional
data-driven approaches struggle to detect new memes due to their evolving
nature and the lack of up-to-date annotated data. To address this issue, we
propose MIND, a multi-agent framework for zero-shot harmful meme detection that
does not rely on annotated data. MIND implements three key strategies: 1) We
retrieve similar memes from an unannotated reference set to provide contextual
information. 2) We propose a bi-directional insight derivation mechanism to
extract a comprehensive understanding of similar memes. 3) We then employ a
multi-agent debate mechanism to ensure robust decision-making through reasoned
arbitration. Extensive experiments on three meme datasets demonstrate that our
proposed framework not only outperforms existing zero-shot approaches but also
shows strong generalization across different model architectures and parameter
scales, providing a scalable solution for harmful meme detection. The code is
available at https://github.com/destroy-lonely/MIND.
comment: ACL 2025
☆ VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation
Graphical User Interface (GUI) agents powered by Large Vision-Language Models
(LVLMs) have emerged as a revolutionary approach to automating human-machine
interactions, capable of autonomously operating personal devices (e.g., mobile
phones) or applications within the device to perform complex real-world tasks
in a human-like manner. However, their close integration with personal devices
raises significant security concerns, with many threats, including backdoor
attacks, remaining largely unexplored. This work reveals that the visual
grounding of GUI agent-mapping textual plans to GUI elements-can introduce
vulnerabilities, enabling new types of backdoor attacks. With backdoor attack
targeting visual grounding, the agent's behavior can be compromised even when
given correct task-solving plans. To validate this vulnerability, we propose
VisualTrap, a method that can hijack the grounding by misleading the agent to
locate textual plans to trigger locations instead of the intended targets.
VisualTrap uses the common method of injecting poisoned data for attacks, and
does so during the pre-training of visual grounding to ensure practical
feasibility of attacking. Empirical results show that VisualTrap can
effectively hijack visual grounding with as little as 5% poisoned data and
highly stealthy visual triggers (invisible to the human eye); and the attack
can be generalized to downstream tasks, even after clean fine-tuning. Moreover,
the injected trigger can remain effective across different GUI environments,
e.g., being trained on mobile/web and generalizing to desktop environments.
These findings underscore the urgent need for further research on backdoor
attack risks in GUI agents.
☆ SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN
The growing demand for efficient knowledge graph (KG) enrichment leveraging
external corpora has intensified interest in relation extraction (RE),
particularly under low-supervision settings. To address the need for adaptable
and noise-resilient RE solutions that integrate seamlessly with pre-trained
large language models (PLMs), we introduce SCoRE, a modular and cost-effective
sentence-level RE system. SCoRE enables easy PLM switching, requires no
finetuning, and adapts smoothly to diverse corpora and KGs. By combining
supervised contrastive learning with a Bayesian k-Nearest Neighbors (kNN)
classifier for multi-label classification, it delivers robust performance
despite the noisy annotations of distantly supervised corpora. To improve RE
evaluation, we propose two novel metrics: Correlation Structure Distance (CSD),
measuring the alignment between learned relational patterns and KG structures,
and Precision at R (P@R), assessing utility as a recommender system. We also
release Wiki20d, a benchmark dataset replicating real-world RE conditions where
only KG-derived annotations are available. Experiments on five benchmarks show
that SCoRE matches or surpasses state-of-the-art methods while significantly
reducing energy consumption. Further analyses reveal that increasing model
complexity, as seen in prior work, degrades performance, highlighting the
advantages of SCoRE's minimal design. Combining efficiency, modularity, and
scalability, SCoRE stands as an optimal choice for real-world RE applications.
☆ Developing and Maintaining an Open-Source Repository of AI Evaluations: Challenges and Insights
AI evaluations have become critical tools for assessing large language model
capabilities and safety. This paper presents practical insights from eight
months of maintaining $inspect\_evals$, an open-source repository of 70+
community-contributed AI evaluations. We identify key challenges in
implementing and maintaining AI evaluations and develop solutions including:
(1) a structured cohort management framework for scaling community
contributions, (2) statistical methodologies for optimal resampling and
cross-model comparison with uncertainty quantification, and (3) systematic
quality control processes for reproducibility. Our analysis reveals that AI
evaluation requires specialized infrastructure, statistical rigor, and
community coordination beyond traditional software development practices.
☆ Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model
Reinforcement Learning (RL) has demonstrated its potential to improve the
reasoning ability of Large Language Models (LLMs). One major limitation of most
existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL
in nature, i.e., data generated during the past learning process is not fully
utilized. This inevitably comes at a significant cost of compute and time,
posing a stringent bottleneck on continuing economic and efficient scaling. To
this end, we launch the renaissance of off-policy RL and propose Reincarnating
Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable
on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix
consists of three major components: (1) Mix-policy proximal policy gradient
with an increased Update-To-Data (UTD) ratio for efficient training; (2)
KL-Convex policy constraint to balance the trade-off between stability and
flexibility; (3) Policy reincarnation to achieve a seamless transition from
efficient early-stage learning to steady asymptotic improvement. In our
experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base
models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with
0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B
model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math
reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and
MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level
performance with an over 30x to 450x reduction in training cost in terms of
rollout data volume. In addition, we reveal insightful findings via
multifaceted analysis, including the implicit preference for shorter responses
due to the Whipping Effect of off-policy discrepancy, the collapse mode of
self-reflection behavior under the presence of severe off-policyness, etc.
comment: Preliminary version. Project page:
https://anitaleungxx.github.io/ReMix
☆ Shifting from Ranking to Set Selection for Retrieval Augmented Generation ACL 2025
Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved
passages are not only individually relevant but also collectively form a
comprehensive set. Existing approaches primarily rerank top-k passages based on
their individual relevance, often failing to meet the information needs of
complex queries in multi-hop question answering. In this work, we propose a
set-wise passage selection approach and introduce SETR, which explicitly
identifies the information requirements of a query through Chain-of-Thought
reasoning and selects an optimal set of passages that collectively satisfy
those requirements. Experiments on multi-hop RAG benchmarks show that SETR
outperforms both proprietary LLM-based rerankers and open-source baselines in
terms of answer correctness and retrieval quality, providing an effective and
efficient alternative to traditional rerankers in RAG systems. The code is
available at https://github.com/LGAI-Research/SetR
comment: Accepted to ACL 2025 Oral
☆ Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework
Zenan Xu, Zexuan Qiu, Guanhua Huang, Kun Li, Siheng Li, Chenchen Zhang, Kejiao Li, Qi Yi, Yuhao Jiang, Bo Zhou, Fengzong Lian, Zhanhui Kang
Recent advances in large language models (LLMs) have accelerated progress
toward artificial general intelligence, with inference-time scaling emerging as
a key technique. Contemporary approaches leverage either sequential reasoning
(iteratively extending chains of thought) or parallel reasoning (generating
multiple solutions simultaneously) to scale inference. However, both paradigms
face fundamental limitations: sequential scaling typically relies on arbitrary
token budgets for termination, leading to inefficiency or premature cutoff;
while parallel scaling often lacks coordination among parallel branches and
requires intrusive fine-tuning to perform effectively. In light of these
challenges, we aim to design a flexible test-time collaborative inference
framework that exploits the complementary strengths of both sequential and
parallel reasoning paradigms. Towards this goal, the core challenge lies in
developing an efficient and accurate intrinsic quality metric to assess model
responses during collaborative inference, enabling dynamic control and early
termination of the reasoning trace. To address this challenge, we introduce
semantic entropy (SE), which quantifies the semantic diversity of parallel
model responses and serves as a robust indicator of reasoning quality due to
its strong negative correlation with accuracy...
comment: 13 pages, 5 fiures
☆ Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams
This paper contributes to speeding up the design and deployment of
engineering dynamical systems by proposing a strategy for exploiting domain and
expert knowledge for the automated generation of dynamical system computational
model starting from a corpus of document relevant to the dynamical system of
interest and an input document describing the specific system. This strategy is
implemented in five steps and, crucially, it uses system modeling language
diagrams (SysML) to extract accurate information about the dependencies,
attributes, and operations of components. Natural Language Processing (NLP)
strategies and Large Language Models (LLMs) are employed in specific tasks to
improve intermediate outputs of the SySML diagrams automated generation, such
as: list of key nouns; list of extracted relationships; list of key phrases and
key relationships; block attribute values; block relationships; and BDD diagram
generation. The applicability of automated SysML diagram generation is
illustrated with different case studies. The computational models of complex
dynamical systems from SysML diagrams are then obtained via code generation and
computational model generation steps. In the code generation step, NLP
strategies are used for summarization, while LLMs are used for validation only.
The proposed approach is not limited to a specific system, domain, or
computational software. The applicability of the proposed approach is shown via
an end-to-end example from text to model of a simple pendulum, showing improved
performance compared to results yielded by LLMs only.
☆ Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining: Method, Evaluation and Applications
Seonwu Kim, Yohan Na, Kihun Kim, Hanhee Cho, Geun Lim, Mintae Kim, Seongik Park, Ki Hyun Kim, Youngsub Han, Byoung-Ki Jeon
The emergence of open-source large language models (LLMs) has expanded
opportunities for enterprise applications; however, many organizations still
lack the infrastructure to deploy and maintain large-scale models. As a result,
small LLMs (sLLMs) have become a practical alternative, despite their inherent
performance limitations. While Domain Adaptive Continual Pretraining (DACP) has
been previously explored as a method for domain adaptation, its utility in
commercial applications remains under-examined. In this study, we validate the
effectiveness of applying a DACP-based recipe across diverse foundation models
and service domains. Through extensive experiments and real-world evaluations,
we demonstrate that DACP-applied sLLMs achieve substantial gains in target
domain performance while preserving general capabilities, offering a
cost-efficient and scalable solution for enterprise-level deployment.
comment: under review
☆ Checklist Engineering Empowers Multilingual LLM Judges
Automated text evaluation has long been a central issue in Natural Language
Processing (NLP). Recently, the field has shifted toward using Large Language
Models (LLMs) as evaluators-a trend known as the LLM-as-a-Judge paradigm. While
promising and easily adaptable across tasks, this approach has seen limited
exploration in multilingual contexts. Existing multilingual studies often rely
on proprietary models or require extensive training data for fine-tuning,
raising concerns about cost, time, and efficiency. In this paper, we propose
Checklist Engineering based LLM-as-a-Judge (CE-Judge), a training-free
framework that uses checklist intuition for multilingual evaluation with an
open-source model. Experiments across multiple languages and three benchmark
datasets, under both pointwise and pairwise settings, show that our method
generally surpasses the baselines and performs on par with the GPT-4o model.
☆ KAConvText: Novel Approach to Burmese Sentence Classification using Kolmogorov-Arnold Convolution
This paper presents the first application of Kolmogorov-Arnold Convolution
for Text (KAConvText) in sentence classification, addressing three tasks:
imbalanced binary hate speech detection, balanced multiclass news
classification, and imbalanced multiclass ethnic language identification. We
investigate various embedding configurations, comparing random to fastText
embeddings in both static and fine-tuned settings, with embedding dimensions of
100 and 300 using CBOW and Skip-gram models. Baselines include standard CNNs
and CNNs augmented with a Kolmogorov-Arnold Network (CNN-KAN). In addition, we
investigated KAConvText with different classification heads - MLP and KAN,
where using KAN head supports enhanced interpretability. Results show that
KAConvText-MLP with fine-tuned fastText embeddings achieves the best
performance of 91.23% accuracy (F1-score = 0.9109) for hate speech detection,
92.66% accuracy (F1-score = 0.9267) for news classification, and 99.82%
accuracy (F1-score = 0.9982) for language identification.
comment: 10 pages, 3 figures, 4 tables
☆ Civil Society in the Loop: Feedback-Driven Adaptation of (L)LM-Assisted Classification in an Open-Source Telegram Monitoring Tool
The role of civil society organizations (CSOs) in monitoring harmful online
content is increasingly crucial, especially as platform providers reduce their
investment in content moderation. AI tools can assist in detecting and
monitoring harmful content at scale. However, few open-source tools offer
seamless integration of AI models and social media monitoring infrastructures.
Given their thematic expertise and contextual understanding of harmful content,
CSOs should be active partners in co-developing technological tools, providing
feedback, helping to improve models, and ensuring alignment with stakeholder
needs and values, rather than as passive 'consumers'. However, collaborations
between the open source community, academia, and civil society remain rare, and
research on harmful content seldom translates into practical tools usable by
civil society actors. This work in progress explores how CSOs can be
meaningfully involved in an AI-assisted open-source monitoring tool of
anti-democratic movements on Telegram, which we are currently developing in
collaboration with CSO stakeholders.
☆ On the Effect of Uncertainty on Layer-wise Inference Dynamics ICML 2025
Understanding how large language models (LLMs) internally represent and
process their predictions is central to detecting uncertainty and preventing
hallucinations. While several studies have shown that models encode uncertainty
in their hidden states, it is underexplored how this affects the way they
process such hidden states. In this work, we demonstrate that the dynamics of
output token probabilities across layers for certain and uncertain outputs are
largely aligned, revealing that uncertainty does not seem to affect inference
dynamics. Specifically, we use the Tuned Lens, a variant of the Logit Lens, to
analyze the layer-wise probability trajectories of final prediction tokens
across 11 datasets and 5 models. Using incorrect predictions as those with
higher epistemic uncertainty, our results show aligned trajectories for certain
and uncertain predictions that both observe abrupt increases in confidence at
similar layers. We balance this finding by showing evidence that more competent
models may learn to process uncertainty differently. Our findings challenge the
feasibility of leveraging simplistic methods for detecting uncertainty at
inference. More broadly, our work demonstrates how interpretability methods may
be used to investigate the way uncertainty affects inference.
comment: Accepted to Actionable Interpretability Workshop - ICML 2025
☆ CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs
Large language models (LLMs), including zero-shot and few-shot paradigms,
have shown promising capabilities in clinical text generation. However,
real-world applications face two key challenges: (1) patient data is highly
unstructured, heterogeneous, and scattered across multiple note types and (2)
clinical notes are often long and semantically dense, making naive prompting
infeasible due to context length constraints and the risk of omitting
clinically relevant information.
We introduce CLI-RAG (Clinically Informed Retrieval-Augmented Generation), a
domain-specific framework for structured and clinically grounded text
generation using LLMs. It incorporates a novel hierarchical chunking strategy
that respects clinical document structure and introduces a task-specific
dual-stage retrieval mechanism. The global stage identifies relevant note types
using evidence-based queries, while the local stage extracts high-value content
within those notes creating relevance at both document and section levels.
We apply the system to generate structured progress notes for individual
hospital visits using 15 clinical note types from the MIMIC-III dataset.
Experiments show that it preserves temporal and semantic alignment across
visits, achieving an average alignment score of 87.7%, surpassing the 80.7%
baseline from real clinician-authored notes. The generated outputs also
demonstrate high consistency across LLMs, reinforcing deterministic behavior
essential for reproducibility, reliability, and clinical trust.
comment: 12 pages, 4 figures
☆ Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models
This project introduces a new measure of elite polarization via actor and
subject detection using artificial intelligence. I identify when politicians
mention one another in parliamentary speeches, note who is speaking and who is
being addressed, and assess the emotional temperature behind these evaluations.
This maps how elites evaluate their various out-parties, allowing us to create
an index of mutual out-party hostility, that is, elite polarization. While I
analyzed polarization data over the past four decades for the UK, and two
decades for Hungary and Italy, my approach lays the groundwork for a
twenty-year, EU-wide time-series dataset on elite polarization. I obtain the
results that can be aggregated by party and quarter. The resulting index
demonstrates a good face validity: it reacts to events such as electoral
campaigns, country- and party-level crises, and to parties losing and assuming
power.
☆ Expediting data extraction using a large language model (LLM) and scoping review protocol: a methodological study within a complex scoping review
James Stewart-Evans, Emma Wilson, Tessa Langley, Andrew Prayle, Angela Hands, Karen Exley, Jo Leonardi-Bee
The data extraction stages of reviews are resource-intensive, and researchers
may seek to expediate data extraction using online (large language models) LLMs
and review protocols. Claude 3.5 Sonnet was used to trial two approaches that
used a review protocol to prompt data extraction from 10 evidence sources
included in a case study scoping review. A protocol-based approach was also
used to review extracted data. Limited performance evaluation was undertaken
which found high accuracy for the two extraction approaches (83.3% and 100%)
when extracting simple, well-defined citation details; accuracy was lower (9.6%
and 15.8%) when extracting more complex, subjective data items. Considering all
data items, both approaches had precision >90% but low recall (<25%) and F1
scores (<40%). The context of a complex scoping review, open response types and
methodological approach likely impacted performance due to missed and
misattributed data. LLM feedback considered the baseline extraction accurate
and suggested minor amendments: four of 15 (26.7%) to citation details and 8 of
38 (21.1%) to key findings data items were considered to potentially add value.
However, when repeating the process with a dataset featuring deliberate errors,
only 2 of 39 (5%) errors were detected. Review-protocol-based methods used for
expediency require more robust performance evaluation across a range of LLMs
and review contexts with comparison to conventional prompt engineering
approaches. We recommend researchers evaluate and report LLM performance if
using them similarly to conduct data extraction or review extracted data. LLM
feedback contributed to protocol adaptation and may assist future review
protocol drafting.
comment: 44 pages, 4 figures
☆ FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation
Building on the success of Large Language Models (LLMs), LLM-based
representations have dominated the document representation landscape, achieving
great performance on the document embedding benchmarks. However, the
high-dimensional, computationally expensive embeddings from LLMs tend to be
either too generic or inefficient for domain-specific applications. To address
these limitations, we introduce FuDoBa a Bayesian optimisation-based method
that integrates LLM-based embeddings with domain-specific structured knowledge,
sourced both locally and from external repositories like WikiData. This fusion
produces low-dimensional, task-relevant representations while reducing training
complexity and yielding interpretable early-fusion weights for enhanced
classification performance. We demonstrate the effectiveness of our approach on
six datasets in two domains, showing that when paired with robust AutoML-based
classifiers, our proposed representation learning approach performs on par
with, or surpasses, those produced solely by the proprietary LLM-based
embedding baselines.
☆ Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen
Recent advances in language modeling have demonstrated the effectiveness of
State Space Models (SSMs) for efficient sequence modeling. While hybrid
architectures such as Samba and the decoder-decoder architecture, YOCO, have
shown promising performance gains over Transformers, prior works have not
investigated the efficiency potential of representation sharing between SSM
layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet
effective mechanism for efficient memory sharing across layers. We apply it to
create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in
the cross-decoder to share memory readout states from a Samba-based
self-decoder. SambaY significantly enhances decoding efficiency, preserves
linear pre-filling time complexity, and boosts long-context performance, all
while eliminating the need for explicit positional encoding. Through extensive
scaling experiments, we demonstrate that our model exhibits a significantly
lower irreducible loss compared to a strong YOCO baseline, indicating superior
performance scalability under large-scale compute regimes. Our largest model
enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves
significantly better performance than Phi4-mini-Reasoning on reasoning tasks
such as Math500, AIME24/25, and GPQA Diamond without any reinforcement
learning, while delivering up to 10x higher decoding throughput on 2K-length
prompts with 32K generation length under the vLLM inference framework. We
release our training codebase on open-source data at
https://github.com/microsoft/ArchScale.
☆ Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis
We propose a unified food-domain QA framework that combines a large-scale
multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000
recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate
40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint
fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves
BERTScore by 16.2\%, reduces FID by 37.8\%, and boosts CLIP alignment by
31.1\%. Diagnostic analyses-CLIP-based mismatch detection (35.2\% to 7.3\%) and
LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid
retrieval-generation strategy achieves 94.1\% accurate image reuse and 85\%
adequacy in synthesis. Our results demonstrate that structured knowledge and
multimodal generation together enhance reliability and diversity in food QA.
☆ The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production
Large-language models turn writing into a live exchange between humans and
software. We capture this new medium with a discursive-network model that
treats people and LLMs as equal nodes and tracks how their statements
circulate. Broadening the focus from isolated hallucinations, we define
invalidation (any factual, logical, or structural breach) and show it follows
four hazards: drift from truth, self-repair, fresh fabrication, and external
detection. A general mathematical model of discursive networks is developed to
provide valuable insights: A network governed only by drift and self-repair
stabilizes at a modest error rate; adding fabrication reproduces the high rates
seen in current LLMs. Giving each false claim even a small chance of peer
review shifts the system to a truth-dominant state. We operationalize peer
review with the open-source \emph{Flaws-of-Others (FOO) algorithm}: a
configurable loop in which any set of agents critique one another while a
harmoniser merges their verdicts. The takeaway is practical and cultural:
reliability in this new medium comes not from perfecting single models but from
wiring imperfect ones into networks that keep each other honest.
comment: 27 pages, 3 figures, 4 tables, 1 algorithm, 28 references
☆ DS@GT at CheckThat! 2025: Exploring Retrieval and Reranking Pipelines for Scientific Claim Source Retrieval on Social Media Discourse
Social media users often make scientific claims without citing where these
claims come from, generating a need to verify these claims. This paper details
work done by the DS@GT team for CLEF 2025 CheckThat! Lab Task 4b Scientific
Claim Source Retrieval which seeks to find relevant scientific papers based on
implicit references in tweets. Our team explored 6 different data augmentation
techniques, 7 different retrieval and reranking pipelines, and finetuned a
bi-encoder. Achieving an MRR@5 of 0.58, our team ranked 16th out of 30 teams
for the CLEF 2025 CheckThat! Lab Task 4b, and improvement of 0.15 over the BM25
baseline of 0.43. Our code is available on Github at
https://github.com/dsgt-arc/checkthat-2025-swd/tree/main/subtask-4b.
☆ Large Language Model for Extracting Complex Contract Information in Industrial Scenes
This paper proposes a high-quality dataset construction method for complex
contract information extraction tasks in industrial scenarios and fine-tunes a
large language model based on this dataset. Firstly, cluster analysis is
performed on industrial contract texts, and GPT-4 and GPT-3.5 are used to
extract key information from the original contract data, obtaining high-quality
data annotations. Secondly, data augmentation is achieved by constructing new
texts, and GPT-3.5 generates unstructured contract texts from randomly combined
keywords, improving model robustness. Finally, the large language model is
fine-tuned based on the high-quality dataset. Experimental results show that
the model achieves excellent overall performance while ensuring high field
recall and precision and considering parsing efficiency. LoRA, data balancing,
and data augmentation effectively enhance model accuracy and robustness. The
proposed method provides a novel and efficient solution for industrial contract
information extraction tasks.
☆ InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior
Aligning Large Language Models (LLMs) with investor decision-making processes
under herd behavior is a critical challenge in behavioral finance, which
grapples with a fundamental limitation: the scarcity of real-user data needed
for Supervised Fine-Tuning (SFT). While SFT can bridge the gap between LLM
outputs and human behavioral patterns, its reliance on massive authentic data
imposes substantial collection costs and privacy risks. We propose InvestAlign,
a novel framework that constructs high-quality SFT datasets by leveraging
theoretical solutions to similar and simple optimal investment problems rather
than complex scenarios. Our theoretical analysis demonstrates that training
LLMs with InvestAlign-generated data achieves faster parameter convergence than
using real-user data, suggesting superior learning efficiency. Furthermore, we
develop InvestAgent, an LLM agent fine-tuned with InvestAlign, which
demonstrates significantly closer alignment to real-user data than pre-SFT
models in both simple and complex investment problems. This highlights our
proposed InvestAlign as a promising approach with the potential to address
complex optimal investment problems and align LLMs with investor
decision-making processes under herd behavior. Our code is publicly available
at https://github.com/thu-social-network-research-group/InvestAlign.
☆ FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation
Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable
progress in both Video-to-Text and Text-to-Video tasks. However, they often
suffer fro hallucinations, generating content that contradicts the visual
input. Existing evaluation methods are limited to one task (e.g., V2T) and also
fail to assess hallucinations in open-ended, free-form responses. To address
this gap, we propose FIFA, a unified FaIthFulness evAluation framework that
extracts comprehensive descriptive facts, models their semantic dependencies
via a Spatio-Temporal Semantic Dependency Graph, and verifies them using
VideoQA models. We further introduce Post-Correction, a tool-based correction
framework that revises hallucinated content. Extensive experiments demonstrate
that FIFA aligns more closely with human judgment than existing evaluation
methods, and that Post-Correction effectively improves factual consistency in
both text and video generation.
☆ SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers ACL 2025
Large Language Models (LLMs) have achieved impressive accomplishments in
recent years. However, the increasing memory consumption of KV cache has
possessed a significant challenge to the inference system. Eviction methods
have revealed the inherent redundancy within the KV cache, demonstrating its
potential for reduction, particularly in deeper layers. However, KV cache
reduction for shallower layers has been found to be insufficient. Based on our
observation that, the KV cache exhibits a high degree of similarity. Based on
this observation, we proposed a novel KV cache reduction method, SpindleKV,
which balances both shallow and deep layers. For deep layers, we employ an
attention weight based eviction method, while for shallow layers, we apply a
codebook based replacement approach which is learnt by similarity and merging
policy. Moreover, SpindleKV addressed the Grouped-Query Attention (GQA) dilemma
faced by other attention based eviction methods. Experiments on two common
benchmarks with three different LLMs shown that SpindleKV obtained better KV
cache reduction effect compared to baseline methods, while preserving similar
or even better model performance.
comment: Accepted by ACL 2025 main
☆ Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings
Translating wordplay across languages presents unique challenges that have
long confounded both professional human translators and machine translation
systems. This research proposes a novel approach for translating puns from
English to French by combining state-of-the-art large language models with
specialized techniques for wordplay generation.
Our methodology employs a three-stage approach. First, we establish a
baseline using multiple frontier large language models with feedback based on a
new contrastive learning dataset. Second, we implement a guided
chain-of-thought pipeline with combined phonetic-semantic embeddings. Third, we
implement a multi-agent generator-discriminator framework for evaluating and
regenerating puns with feedback.
Moving beyond the limitations of literal translation, our methodology's
primary objective is to capture the linguistic creativity and humor of the
source text wordplay, rather than simply duplicating its vocabulary. Our best
runs earned first and second place in the CLEF JOKER 2025 Task 2 competition
where they were evaluated manually by expert native French speakers.
This research addresses a gap between translation studies and computational
linguistics by implementing linguistically-informed techniques for wordplay
translation, advancing our understanding of how language models can be
leveraged to handle the complex interplay between semantic ambiguity, phonetic
similarity, and the implicit cultural and linguistic awareness needed for
successful humor.
comment: CLEF 2025 Working Notes, 9-12 September 2025, Madrid, Spain
☆ On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks
Robust verbal confidence generated by large language models (LLMs) is crucial
for the deployment of LLMs to ensure transparency, trust, and safety in
human-AI interactions across many high-stakes applications. In this paper, we
present the first comprehensive study on the robustness of verbal confidence
under adversarial attacks. We introduce a novel framework for attacking verbal
confidence scores through both perturbation and jailbreak-based methods, and
show that these attacks can significantly jeopardize verbal confidence
estimates and lead to frequent answer changes. We examine a variety of
prompting strategies, model sizes, and application domains, revealing that
current confidence elicitation methods are vulnerable and that commonly used
defence techniques are largely ineffective or counterproductive. Our findings
underscore the urgent need to design more robust mechanisms for confidence
expression in LLMs, as even subtle semantic-preserving modifications can lead
to misleading confidence in responses.
☆ Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
Despite advances in reinforcement learning (RL)-based video reasoning with
large language models (LLMs), data collection and finetuning remain significant
challenges. These methods often rely on large-scale supervised fine-tuning
(SFT) with extensive video data and long Chain-of-Thought (CoT) annotations,
making them costly and hard to scale. To address this, we present Video-RTS, a
new approach to improve video reasoning capability with drastically improved
data efficiency by combining data-efficient RL with a video-adaptive test-time
scaling (TTS) strategy. Based on observations about the data scaling of RL
samples, we skip the resource-intensive SFT step and employ efficient pure-RL
training with output-based rewards, requiring no additional annotations or
extensive fine-tuning. Furthermore, to utilize computational resources more
efficiently, we introduce a sparse-to-dense video TTS strategy that improves
inference by iteratively adding frames based on output consistency. We validate
our approach on multiple video reasoning benchmarks, showing that Video-RTS
surpasses existing video reasoning models by an average of 2.4% in accuracy
using only 3.6% training samples. For example, Video-RTS achieves a 4.2%
improvement on Video-Holmes, a recent and challenging video reasoning
benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and
adaptive video TTS offer complementary strengths, enabling Video-RTS's strong
reasoning performance.
comment: The first two authors contributed equally. Project page:
https://sites.google.com/cs.unc.edu/videorts2025/
☆ Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents
This study investigates how stylized, voiced agents shape user interaction in
a multimodal language learning environment. We conducted a mixed-methods
evaluation of 54 participants interacting with anime-inspired characters
powered by large language models and expressive text-to-speech synthesis. These
agents responded in Japanese character language, offering users asynchronous,
semi-structured conversation in varying speech styles and emotional tones. We
analyzed user engagement patterns, perceived usability, emotional responses,
and learning behaviors, with particular attention to how agent stylization
influenced interaction across language proficiency levels and cultural
backgrounds. Our findings reveal that agent design, especially voice, persona,
and linguistic style, substantially affected user experience, motivation, and
strategy. This work contributes to the understanding of affective, culturally
stylized agents in human-agent interaction and offers guidance for designing
more engaging, socially responsive systems.
♻ ☆ Multi-Attribute Steering of Language Models via Targeted Intervention ACL 2025
Inference-time intervention (ITI) has emerged as a promising method for
steering large language model (LLM) behavior in a particular direction (e.g.,
improving helpfulness) by intervening on token representations without costly
updates to the LLM's parameters. However, existing ITI approaches fail to scale
to multi-attribute settings with conflicts, such as enhancing helpfulness while
also reducing toxicity. To address this, we introduce Multi-Attribute Targeted
Steering (MAT-Steer), a novel steering framework designed for selective
token-level intervention across multiple attributes. MAT-Steer learns steering
vectors using an alignment objective that shifts the model's internal
representations of undesirable outputs closer to those of desirable ones while
enforcing sparsity and orthogonality among vectors for different attributes,
thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two
distinct settings: (i) on question answering (QA) tasks where we balance
attributes like truthfulness, bias, and toxicity; (ii) on generative tasks
where we simultaneously improve attributes like helpfulness, correctness, and
coherence. MAT-Steer outperforms existing ITI and parameter-efficient
fine-tuning approaches across both task types (e.g., 3% average accuracy gain
across QA tasks and 55.82% win rate against the best ITI baseline).
comment: ACL 2025 camera-ready, code link:
https://github.com/duykhuongnguyen/MAT-Steer
♻ ☆ LCFO: Long Context and Long Form Output Dataset and Benchmarking
Marta R. Costa-jussà, Pierre Andrews, Mariano Coria Meglioli, Joy Chen, Joe Chuang, David Dale, Christophe Ropers, Alexandre Mourachko, Eduardo Sánchez, Holger Schwenk, Tuan Tran, Arina Turkatenko, Carleigh Wood
This paper presents the Long Context and Form Output (LCFO) benchmark, a
novel evaluation framework for assessing gradual summarization and summary
expansion capabilities across diverse domains. LCFO consists of long input
documents (5k words average length), each of which comes with three summaries
of different lengths (20%, 10%, and 5% of the input text), as well as
approximately 15 questions and answers (QA) related to the input content.
Notably, LCFO also provides alignments between specific QA pairs and
corresponding summaries in 7 domains. The primary motivation behind providing
summaries of different lengths is to establish a controllable framework for
generating long texts from shorter inputs, i.e. summary expansion. To establish
an evaluation metric framework for summarization and summary expansion, we
provide human evaluation scores for human-generated outputs, as well as results
from various state-of-the-art large language models (LLMs). GPT-4o-mini
achieves best human scores among automatic systems in both summarization and
summary expansion tasks (~ +10% and +20%, respectively). It even surpasses
human output quality in the case of short summaries (~ +7%). Overall automatic
metrics achieve low correlations with human evaluation scores (~ 0.4) but
moderate correlation on specific evaluation aspects such as fluency and
attribution (~ 0.6).
♻ ☆ LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits
Reward Models (RMs) are crucial to aligning large language models (LLMs), but
the degree to which an RM specialized to one task (e.g. writing) generalizes to
new tasks (e.g. math) is often not known a priori, often making using only one
fixed RM to train LLMs suboptimal. However, optimizing LLMs with multiple RMs
simultaneously can incur a prohibitively high computational cost and lead to
conflicting signals from different RMs that may degrade performance. To address
these challenges, we introduce LASeR (Learning to Adaptively Select Rewards),
which frames reward model selection as a multi-armed bandit problem,
efficiently and iteratively training LLMs using multiple RMs by selecting the
most well-suited RM for each instance. On commonsense and math reasoning tasks,
we show that LASeR boosts iterative LLM training, improving the absolute
average accuracy of Llama-3-8B over three datasets by 2.67% over an ensemble of
RM scores while also showing superior efficiency (e.g., a 2x speedup).
Moreover, on WildChat (open-ended instruction-following tasks), LASeR leads to
a 72.69% AlpacaEval win rate over the RM score ensemble baseline. Extending to
long-context generation, LASeR improves by 2.96 F1 points (avg.) on
single-document QA tasks and 2.97 F1 points on few-shot learning over the RM
score ensemble baseline with best-of-n sampling.
comment: 28 pages; First two authors contributed equally. Code:
https://github.com/duykhuongnguyen/LASeR-MAB
♻ ☆ Low-Rank Adaptation Secretly Imitates Differentially Private SGD
As pre-trained language models grow in size, full fine-tuning their
parameters on task adaptation data becomes increasingly impractical. To address
this challenge, some methods for low-rank adaptation of language models have
been proposed, e.g. LoRA, which incorporates trainable low-rank decomposition
matrices into only some parameters of the pre-trained model, called adapters.
This approach significantly reduces the number of trainable parameters compared
to fine-tuning all parameters or adapters. In this work, we look at low-rank
adaptation method from the lens of data privacy. We show theoretically that the
low-rank adaptation used in LoRA is equivalent to fine-tuning adapters with
noisy batch gradients - just like what DPSGD algorithm does. We also quantify
the variance of the injected noise as a decreasing function of adaptation rank.
By establishing a Berry-Esseen type bound on the total variation distance
between the injected noise distribution and a Gaussian noise distribution with
the same variance, we show that the dynamics of low-rank adaptation is very
close to when DPSGD is performed w.r.t the adapters. Following our theoretical
findings and approved by our experimental results, we show that low-rank
adaptation provides robustness to membership inference attacks w.r.t the
fine-tuning data.
♻ ☆ TokenShapley: Token Level Context Attribution with Shapley Value
Large language models (LLMs) demonstrate strong capabilities in in-context
learning, but verifying the correctness of their generated responses remains a
challenge. Prior work has explored attribution at the sentence level, but these
methods fall short when users seek attribution for specific keywords within the
response, such as numbers, years, or names. To address this limitation, we
propose TokenShapley, a novel token-level attribution method that combines
Shapley value-based data attribution with KNN-based retrieval techniques
inspired by recent advances in KNN-augmented LLMs. By leveraging a precomputed
datastore for contextual retrieval and computing Shapley values to quantify
token importance, TokenShapley provides a fine-grained data attribution
approach. Extensive evaluations on four benchmarks show that TokenShapley
outperforms state-of-the-art baselines in token-level attribution, achieving an
11-23% improvement in accuracy.
♻ ☆ Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming
While large language models (LLMs) have recently demonstrated strong
potential in solving planning problems, there is a trade-off between
flexibility and complexity. LLMs, as zero-shot planners themselves, are still
not capable of directly generating valid plans for complex planning problems
such as multi-constraint or long-horizon tasks. On the other hand, many
frameworks aiming to solve complex planning problems often rely on
task-specific preparatory efforts, such as task-specific in-context examples
and pre-defined critics/verifiers, which limits their cross-task generalization
capability. In this paper, we tackle these challenges by observing that the
core of many planning problems lies in optimization problems: searching for the
optimal solution (best plan) with goals subject to constraints (preconditions
and effects of decisions). With LLMs' commonsense, reasoning, and programming
capabilities, this opens up the possibilities of a universal LLM-based approach
to planning problems. Inspired by this observation, we propose LLMFP, a
general-purpose framework that leverages LLMs to capture key information from
planning problems and formally formulate and solve them as optimization
problems from scratch, with no task-specific examples needed. We apply LLMFP to
9 planning problems, ranging from multi-constraint decision making to
multi-step planning problems, and demonstrate that LLMFP achieves on average
83.7% and 86.8% optimal rate across 9 tasks for GPT-4o and Claude 3.5 Sonnet,
significantly outperforming the best baseline (direct planning with OpenAI
o1-preview) with 37.6% and 40.7% improvements. We also validate components of
LLMFP with ablation experiments and analyzed the underlying success and failure
reasons. Project page: https://sites.google.com/view/llmfp.
comment: 57 pages, 25 figures, 15 tables
♻ ☆ Neuron-Level Differentiation of Memorization and Generalization in Large Language Models
Ko-Wei Huang, Yi-Fu Fu, Ching-Yu Tsai, Yu-Chieh Tu, Tzu-Ling Cheng, Cheng-Yu Lin, Yi-Ting Yang, Heng-Yi Liu, Keng-Te Liao, Da-Cheng Juan, Shou-De Lin
We investigate how Large Language Models (LLMs) distinguish between
memorization and generalization at the neuron level. Through carefully designed
tasks, we identify distinct neuron subsets responsible for each behavior.
Experiments on both a GPT-2 model trained from scratch and a pretrained
LLaMA-3.2 model fine-tuned with LoRA show consistent neuron-level
specialization. We further demonstrate that inference-time interventions on
these neurons can steer the model's behavior toward memorization or
generalization. To assess robustness, we evaluate intra-task and inter-task
consistency, confirming that these neuron-behavior associations reflect
generalizable patterns rather than dataset-specific artifacts. Our findings
reveal modular structure in LLMs and enable controlling memorization and
generalization behaviors at inference time.
♻ ☆ Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che
Recent advancements in reasoning with large language models (RLLMs), such as
OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in
complex domains like mathematics and coding. A central factor in their success
lies in the application of long chain-of-thought (Long CoT) characteristics,
which enhance reasoning abilities and enable the solution of intricate
problems. However, despite these developments, a comprehensive survey on Long
CoT is still lacking, limiting our understanding of its distinctions from
traditional short chain-of-thought (Short CoT) and complicating ongoing debates
on issues like "overthinking" and "inference-time scaling." This survey seeks
to fill this gap by offering a unified perspective on Long CoT. (1) We first
distinguish Long CoT from Short CoT and introduce a novel taxonomy to
categorize current reasoning paradigms. (2) Next, we explore the key
characteristics of Long CoT: deep reasoning, extensive exploration, and
feasible reflection, which enable models to handle more complex tasks and
produce more efficient, coherent outcomes compared to the shallower Short CoT.
(3) We then investigate key phenomena such as the emergence of Long CoT with
these characteristics, including overthinking, and inference-time scaling,
offering insights into how these processes manifest in practice. (4) Finally,
we identify significant research gaps and highlight promising future
directions, including the integration of multi-modal reasoning, efficiency
improvements, and enhanced knowledge frameworks. By providing a structured
overview, this survey aims to inspire future research and further the
development of logical reasoning in artificial intelligence.
comment: Paper are available at https://long-cot.github.io/, and Github are
available at
https://github.com/LightChen233/Awesome-Long-Chain-of-Thought-Reasoning
♻ ☆ What to Keep and What to Drop: Adaptive Table Filtering Framework
Large language models (LLMs) for table-based reasoning often struggle with
large tables due to input length limits. We propose ATF (Adaptive Table
Filtering Framework), a modular and question-aware filtering pipeline that
prunes uninformative columns and rows using LLM-generated column descriptions,
clustering, and sparse-dense alignment scores. ATF integrates seamlessly with
existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that
ATF reduces table cells by 70%, boosting performance on out-of-domain TableQA
tasks while causing slight performance drops on Table Fact Verification, where
full-table context is more critical. These results highlight ATF's ability to
adaptively balance informativeness and minimalism across tasks.
comment: 26 pages, 9 figures
♻ ☆ NoLiMa: Long-Context Evaluation Beyond Literal Matching ICML 2025
Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze
Recent large language models (LLMs) support long contexts ranging from 128K
to 1M tokens. A popular method for evaluating these capabilities is the
needle-in-a-haystack (NIAH) test, which involves retrieving a "needle"
(relevant information) from a "haystack" (long irrelevant context). Extensions
of this approach include increasing distractors, fact chaining, and in-context
reasoning. However, in these benchmarks, models can exploit existing literal
matches between the needle and haystack to simplify the task. To address this,
we introduce NoLiMa, a benchmark extending NIAH with a carefully designed
needle set, where questions and needles have minimal lexical overlap, requiring
models to infer latent associations to locate the needle within the haystack.
We evaluate 13 popular LLMs that claim to support contexts of at least 128K
tokens. While they perform well in short contexts (<1K), performance degrades
significantly as context length increases. At 32K, for instance, 11 models drop
below 50% of their strong short-length baselines. Even GPT-4o, one of the
top-performing exceptions, experiences a reduction from an almost-perfect
baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the
increased difficulty the attention mechanism faces in longer contexts when
literal matches are absent, making it harder to retrieve relevant information.
Even models enhanced with reasoning capabilities or CoT prompting struggle to
maintain performance in long contexts. We publicly release the dataset and
evaluation code at https://github.com/adobe-research/NoLiMa.
comment: Accepted at ICML 2025
♻ ☆ Adaptive Elicitation of Latent Information Using Natural Language ICML 2025
Eliciting information to reduce uncertainty about a latent entity is a
critical task in many application domains, e.g., assessing individual student
learning outcomes, diagnosing underlying diseases, or learning user
preferences. Though natural language is a powerful medium for this purpose,
large language models (LLMs) and existing fine-tuning algorithms lack
mechanisms for strategically gathering information to refine their own
understanding of the latent entity. To harness the generalization power and
world knowledge of LLMs in developing effective information-gathering
strategies, we propose an adaptive elicitation framework that actively reduces
uncertainty on the latent entity. Since probabilistic modeling of an abstract
latent entity is difficult, our framework adopts a predictive view of
uncertainty, using a meta-learned language model to simulate future
observations and enable scalable uncertainty quantification over complex
natural language. Through autoregressive forward simulation, our model
quantifies how new questions reduce epistemic uncertainty, enabling the
development of sophisticated information-gathering strategies to choose the
most informative next queries. In experiments on the 20 questions game, dynamic
opinion polling, and adaptive student assessment, our method consistently
outperforms baselines in identifying critical unknowns and improving downstream
predictions, illustrating the promise of strategic information gathering in
natural language settings.
comment: ICML 2025
♻ ☆ EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning SIGDIAL 2025
Recent advances in reinforcement learning (RL) for large language model (LLM)
fine-tuning show promise in addressing multi-objective tasks but still face
significant challenges, including competing objective balancing, low training
efficiency, poor scalability, and limited explainability. Leveraging ensemble
learning principles, we introduce an Ensemble Multi-Objective RL (EMORL)
framework that fine-tunes multiple models with individual objectives while
optimizing their aggregation after the fine-tuning to improve efficiency and
flexibility. Our method is the first to aggregate the hidden states of
individual models, incorporating contextual information from multiple
objectives. This approach is supported by a hierarchical grid search algorithm
that identifies optimal weighted combinations. We evaluate EMORL on counselor
reflection generation tasks, using text classification models to score the
generations and provide rewards during RL fine-tuning. Through comprehensive
experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of
EMORL against existing baselines: significantly lower and more stable training
consumption ($17,529\pm 1,650$ data points and $6,573\pm 147.43$ seconds),
improved scalability and explainability, and comparable performance across
multiple objectives.
comment: 14 pages, 9 figures, accepted by the SIGDIAL 2025 conference
♻ ☆ CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation ACL
Guangya Yu, Yanhao Li, Zongying Jiang, Yuxiong Jin, Li Dai, Yupian Lin, Ruihui Hou, Weiyan Zhang, Yongqi Fan, Qi Ye, Jingping Liu, Tong Ruan
Medical quality control indicators are essential to assess the qualifications
of healthcare institutions for medical services. With the impressive
performance of large language models (LLMs) like GPT-4 in the medical field,
leveraging these technologies for the Medical Quality Control Indicator
Calculation (MQCIC) presents a promising approach. In this work, (1) we
introduce a real-world task MQCIC and propose an open-source Chinese electronic
medical records (EMRs)-based dataset (CMQCIC-Bench) comprising 785 instances
and 76 indicators. (2) We propose a semi-automatic method to enhance the rule
representation. Then we propose the Clinical Facts-based Inferential Rule
(CF-IR) method that disentangles the clinical fact verification and inferential
rule reasoning actions. (3) We conduct comprehensive experiments on 20
representative LLMs, covering general and medical models. Our findings reveal
that CF-IR outperforms Chain-of-Thought methods in MQCIC tasks. (4) We conduct
an error analysis and investigate the capabilities of clinical fact
verification and inferential rule reasoning, providing insights to improve
performance in the MQCIC further. The dataset and code is available in this
repository https://github.com/YuY-2001/C-MQCIC.
comment: 2025 ACL Findings
♻ ☆ Losing our Tail -- Again: On (Un)Natural Selection And Multilingual Large Language Models
Multilingual Large Language Models (LLMs) considerably changed how
technologies can influence language. While previous technologies could mediate
or assist humans, there is now a tendency to offload the task of writing itself
to these technologies, enabling them to change our linguistic ecosystem more
directly. While they provide us quick access to information and impressively
fluent output, beneath their apparent sophistication lies a subtle, more
insidious threat: the gradual decline and loss of linguistic diversity. With
this opinion piece, I explore how model collapse, with a particular focus on
translation technology, can lead to the loss of linguistic forms, grammatical
features, and cultural nuance. Model collapse refers to the eventual
consequence of self-consuming training loops, where models reinforce their own
biases and lose linguistic diversity. Drawing on recent work in Computer
Vision, Natural Language Processing (NLP) and Machine Translation (MT), I argue
that the tails of our linguistic distributions are vanishing, and with them,
the narratives and identities they carry. This is a call to resist linguistic
flattening and to reimagine NLP as a field that encourages, values and protects
expressive multilingual lexical and linguistic diversity and creativity.
comment: 12 pages
♻ ☆ Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts ACL 2025
Large Language Models (LLMs) are increasingly employed as automated
evaluators to assess the safety of generated content, yet their reliability in
this role remains uncertain. This study evaluates a diverse set of 11 LLM judge
models across critical safety domains, examining three key aspects:
self-consistency in repeated judging tasks, alignment with human judgments, and
susceptibility to input artifacts such as apologetic or verbose phrasing. Our
findings reveal that biases in LLM judges can significantly distort the final
verdict on which content source is safer, undermining the validity of
comparative evaluations. Notably, apologetic language artifacts alone can skew
evaluator preferences by up to 98\%. Contrary to expectations, larger models do
not consistently exhibit greater robustness, while smaller models sometimes
show higher resistance to specific artifacts. To mitigate LLM evaluator
robustness issues, we investigate jury-based evaluations aggregating decisions
from multiple models. Although this approach both improves robustness and
enhances alignment to human judgements, artifact sensitivity persists even with
the best jury configurations. These results highlight the urgent need for
diversified, artifact-resistant methodologies to ensure reliable safety
assessments.
comment: 9 pages, ACL 2025
♻ ☆ Test-Time Scaling with Reflective Generative Model
Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie
We introduce our first reflective generative model MetaStone-S1, which
obtains OpenAI o3-mini's performance via the new Reflective Generative Form.
The new form focuses on high-quality reasoning trajectory selection and
contains two novelties: 1) A unified interface for policy and process reward
model: we share the backbone network and use task-specific heads for reasoning
trajectory predicting and scoring respectively, introducing only 53M extra
parameters for trajectory scoring. 2) Eliminating the reliance on process-level
annotation: we provide a self-supervised process reward model, which can
directly learn the high-quality reasoning trajectory selection from the outcome
reward. Equipped with the reflective generative form, MetaStone-S1 is naturally
suitable for test-time scaling, and we provide three reasoning effort modes
(low, medium, and high) based on the controllable thinking length. Experiments
demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI
o3-mini's series with only 32B parameter size. To support the research
community, we have open-sourced MetaStone-S1 at
https://github.com/MetaStone-AI/MetaStone-S1.
♻ ☆ GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods
Despite the growing interest in jailbreak methods as an effective red-teaming
tool for building safe and responsible large language models (LLMs), flawed
evaluation system designs have led to significant discrepancies in their
effectiveness assessments. We conduct a systematic measurement study based on
37 jailbreak studies since 2022, focusing on both the methods and the
evaluation systems they employ. We find that existing evaluation systems lack
case-specific criteria, resulting in misleading conclusions about their
effectiveness and safety implications. This paper advocates a shift to a more
nuanced, case-by-case evaluation paradigm. We introduce GuidedBench, a novel
benchmark comprising a curated harmful question dataset, detailed case-by-case
evaluation guidelines and an evaluation system integrated with these guidelines
-- GuidedEval. Experiments demonstrate that GuidedBench offers more accurate
measurements of jailbreak performance, enabling meaningful comparisons across
methods and uncovering new insights overlooked in previous evaluations.
GuidedEval reduces inter-evaluator variance by at least 76.03\%. Furthermore,
we observe that incorporating guidelines can enhance the effectiveness of
jailbreak methods themselves, offering new insights into both attack strategies
and evaluation paradigms.
comment: Homepage: https://sproutnan.github.io/AI-Safety_Benchmark/
♻ ☆ Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons ACL 2025
Large Language Models (LLMs) have shown to be effective evaluators across
various domains such as machine translations or the scientific domain. Current
LLM-as-a-Judge approaches rely mostly on individual assessments or a single
round of pairwise assessments, preventing the judge LLM from developing a
global ranking perspective. To address this, we present Knockout Assessment, an
LLM-asa Judge method using a knockout tournament system with iterative pairwise
comparisons. Experiments across three LLMs on two datasets show that knockout
assessment improves scoring accuracy, increasing Pearson correlation with
expert evaluations by 0.07 on average for university-level exam scoring and
machine translation evaluations, aligning LLM assessments more closely with
human scoring.
comment: Accepted to GEM @ ACL 2025
♻ ☆ LLM-based User Profile Management for Recommender System SIGIR'25
The rapid advancement of Large Language Models (LLMs) has opened new
opportunities in recommender systems by enabling zero-shot recommendation
without conventional training. Despite their potential, most existing works
rely solely on users' purchase histories, leaving significant room for
improvement by incorporating user-generated textual data, such as reviews and
product descriptions. Addressing this gap, we propose PURE, a novel LLM-based
recommendation framework that builds and maintains evolving user profiles by
systematically extracting and summarizing key information from user reviews.
PURE consists of three core components: a Review Extractor for identifying user
preferences and key product features, a Profile Updater for refining and
updating user profiles, and a Recommender for generating personalized
recommendations using the most current profile. To evaluate PURE, we introduce
a continuous sequential recommendation task that reflects real-world scenarios
by adding reviews over time and updating predictions incrementally. Our
experimental results on Amazon datasets demonstrate that PURE outperforms
existing LLM-based methods, effectively leveraging long-term user information
while managing token limitations.
comment: Accepted GENNEXT@SIGIR'25 Workshop
♻ ☆ Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning
Large Language Models (LLMs) have demonstrated remarkable capabilities across
a wide range of tasks requiring complex reasoning. However, the effects of
scaling on their reasoning abilities remain insufficiently understood. In this
paper, we introduce a synthetic multihop reasoning environment designed to
closely replicate the structure and distribution of real-world large-scale
knowledge graphs. Our reasoning task involves completing missing edges in the
graph, which requires advanced multi-hop reasoning and mimics real-world
reasoning scenarios. To evaluate this, we pretrain language models (LMs) from
scratch solely on triples from the incomplete graph and assess their ability to
infer the missing edges. Interestingly, we observe that overparameterization
can impair reasoning performance due to excessive memorization. We investigate
different factors that affect this U-shaped loss curve, including graph
structure, model size, and training steps. To predict the optimal model size
for a specific knowledge graph, we find an empirical scaling that linearly maps
the knowledge graph search entropy to the optimal model size. This work
provides new insights into the relationship between scaling and reasoning in
LLMs, shedding light on possible ways to optimize their performance for
reasoning tasks.
♻ ☆ A Survey on Prompt Tuning
This survey reviews prompt tuning, a parameter-efficient approach for
adapting language models by prepending trainable continuous vectors while
keeping the model frozen. We classify existing approaches into two categories:
direct prompt learning and transfer learning. Direct prompt learning methods
include: general optimization approaches, encoder-based methods, decomposition
strategies, and mixture-of-experts frameworks. Transfer learning methods
consist of: general transfer approaches, encoder-based methods, and
decomposition strategies. For each method, we analyze method designs,
innovations, insights, advantages, and disadvantages, with illustrative
visualizations comparing different frameworks. We identify challenges in
computational efficiency and training stability, and discuss future directions
in improving training robustness and broadening application scope.
♻ ☆ Automating IRAC Analysis in Malaysian Contract Law using a Semi-Structured Knowledge Base
The effectiveness of Large Language Models (LLMs) in legal reasoning is often
limited due to the unique legal terminologies and the necessity for highly
specialized knowledge. These limitations highlight the need for high-quality
data tailored for complex legal reasoning tasks. This paper introduces
LegalSemi, a benchmark specifically curated for legal scenario analysis.
LegalSemi comprises 54 legal scenarios, each rigorously annotated by legal
experts, based on the comprehensive IRAC (Issue, Rule, Application, Conclusion)
framework from Malaysian Contract Law. In addition, LegalSemi is accompanied by
a structured knowledge base (SKE). A series of experiments were conducted to
assess the usefulness of LegalSemi for IRAC analysis. The experimental results
demonstrate the effectiveness of incorporating the SKE for issue
identification, rule retrieval, application and conclusion generation using
four different LLMs.
♻ ☆ Probing and Steering Evaluation Awareness of Language Models ICML 2025
Language models can distinguish between testing and deployment phases -- a
capability known as evaluation awareness. This has significant safety and
policy implications, potentially undermining the reliability of evaluations
that are central to AI governance frameworks and voluntary industry
commitments. In this paper, we study evaluation awareness in
Llama-3.3-70B-Instruct. We show that linear probes can separate real-world
evaluation and deployment prompts, suggesting that current models internally
represent this distinction. We also find that current safety evaluations are
correctly classified by the probes, suggesting that they already appear
artificial or inauthentic to models. Our findings underscore the importance of
ensuring trustworthy evaluations and understanding deceptive capabilities. More
broadly, our work showcases how model internals may be leveraged to support
blackbox methods in safety audits, especially for future models more competent
at evaluation awareness and deception.
comment: Actionable Interpretability Workshop (Poster) and Workshop on
Technical AI Governance (Poster) at ICML 2025, Vancouver, Canada
♻ ☆ PBa-LLM: Privacy- and Bias-aware NLP using Named-Entity Recognition (NER) AAAI
Gonzalo Mancera, Aythami Morales, Julian Fierrez, Ruben Tolosana, Alejandro Penna, Miguel Lopez-Duran, Francisco Jurado, Alvaro Ortigosa
The use of Natural Language Processing (NLP) in highstakes AI-based
applications has increased significantly in recent years, especially since the
emergence of Large Language Models (LLMs). However, despite their strong
performance, LLMs introduce important legal/ ethical concerns, particularly
regarding privacy, data protection, and transparency. Due to these concerns,
this work explores the use of Named- Entity Recognition (NER) to facilitate the
privacy-preserving training (or adaptation) of LLMs. We propose a framework
that uses NER technologies to anonymize sensitive information in text data,
such as personal identities or geographic locations. An evaluation of the
proposed privacy-preserving learning framework was conducted to measure its
impact on user privacy and system performance in a particular high-stakes and
sensitive setup: AI-based resume scoring for recruitment processes. The study
involved two language models (BERT and RoBERTa) and six anonymization
algorithms (based on Presidio, FLAIR, BERT, and different versions of GPT)
applied to a database of 24,000 candidate profiles. The findings indicate that
the proposed privacy preservation techniques effectively maintain system
performance while playing a critical role in safeguarding candidate
confidentiality, thus promoting trust in the experimented scenario. On top of
the proposed privacy-preserving approach, we also experiment applying an
existing approach that reduces the gender bias in LLMs, thus finally obtaining
our proposed Privacyand Bias-aware LLMs (PBa-LLMs). Note that the proposed
PBa-LLMs have been evaluated in a particular setup (resume scoring), but are
generally applicable to any other LLM-based AI application.
comment: Presented at AAAI Workshop on Privacy-Preserving Artificial
Intelligence (PPAI) 2025, Philadelphia, PA, USA, March 2025
♻ ☆ Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives ACL 2024
Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan
Humans use multiple senses to comprehend the environment. Vision and language
are two of the most vital senses since they allow us to easily communicate our
thoughts and perceive the world around us. There has been a lot of interest in
creating video-language understanding systems with human-like senses since a
video-language pair can mimic both our linguistic medium and visual environment
with temporal dynamics. In this survey, we review the key tasks of these
systems and highlight the associated challenges. Based on the challenges, we
summarize their methods from model architecture, model training, and data
perspectives. We also conduct performance comparison among the methods, and
discuss promising directions for future research.
comment: Accepted at ACL 2024 (Findings)
♻ ☆ Can Input Attributions Explain Inductive Reasoning in In-Context Learning? ACL 2025
Interpreting the internal process of neural models has long been a challenge.
This challenge remains relevant in the era of large language models (LLMs) and
in-context learning (ICL); for example, ICL poses a new issue of interpreting
which example in the few-shot examples contributed to identifying/solving the
task. To this end, in this paper, we design synthetic diagnostic tasks of
inductive reasoning, inspired by the generalization tests typically adopted in
psycholinguistics. Here, most in-context examples are ambiguous w.r.t. their
underlying rule, and one critical example disambiguates it. The question is
whether conventional input attribution (IA) methods can track such a reasoning
process, i.e., identify the influential example, in ICL. Our experiments
provide several practical findings; for example, a certain simple IA method
works the best, and the larger the model, the generally harder it is to
interpret the ICL with gradient-based IA methods.
comment: Findings of ACL 2025
♻ ☆ Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions
Large Language Models (LLMs) have gained enormous attention in recent years
due to their capability of understanding and generating natural languages. With
the rapid development and wild-range applications (e.g., Agents, Embodied
Intelligence), the robustness of LLMs has received increased attention. As the
core brain of many AI applications, the robustness of LLMs requires that models
should not only generate consistent contents, but also ensure the correctness
and stability of generated content when dealing with unexpeted application
scenarios (e.g., toxic prompts, limited noise domain data, outof-distribution
(OOD) applications, etc). In this survey paper, we conduct a thorough review of
the robustness of LLMs, aiming to provide a comprehensive terminology of
concepts and methods around this field and facilitate the community.
Specifically, we first give a formal definition of LLM robustness and present
the collection protocol of this survey paper. Then, based on the types of
perturbated inputs, we organize this survey from the following perspectives: 1)
Adversarial Robustness: tackling the problem that prompts are manipulated
intentionally, such as noise prompts, long context, data attack, etc; 2) OOD
Robustness: dealing with the unexpected real-world application scenarios, such
as OOD detection, zero-shot transferring, hallucinations, etc; 3) Evaluation of
Robustness: summarizing the new evaluation datasets, metrics, and tools for
verifying the robustness of LLMs. After reviewing the representative work from
each perspective, we discuss and highlight future opportunities and research
directions in this field. Meanwhile, we also organize related works and provide
an easy-to-search project
(https://github.com/zhangkunzk/Awesome-LLM-Robustness-papers) to support the
community.
comment: 33 pages, 5 figures
♻ ☆ CHAI for LLMs: Improving Code-Mixed Translation in Large Language Models through Reinforcement Learning with AI Feedback
Large Language Models (LLMs) have demonstrated remarkable capabilities across
various NLP tasks but struggle with code-mixed (or code-switched) language
understanding. For example, prior work benchmarking the performance of
multilingual LLMs on code-mixed translation tasks has demonstrated that current
state-of-the-art multilingual LLMs are ineffective in dealing with code-mixed
languages. However, the question of how to improve the capability of
multilingual LLMs to handle code-mixed language has not received any attention
to date. In this paper, we tackle this research gap by proposing CHAI, a novel
general-purpose framework for improving the ability of multilingual LLMs to
handle code-mixed languages. CHAI relies on three novel contributions made in
this paper. First, we explore the ability of LLMs to provide accurate
annotations for code-mixed translation tasks. Second, we leverage this ability
of LLMs as annotators to generate preference data for code-mixed translation
tasks at scale, which are then used within a reinforcement learning from AI
feedback (RLAIF) procedure to improve LLMs' capability on code-mixed tasks.
Third, we conduct a rigorous experimental evaluation across various real-world
datasets and settings. Our analysis shows that CHAI-powered LLMs outperform
state-of-the-art open-source LLMs by 25.66% (in terms of win rate adjudicated
by human annotators) in code-mixed translation tasks. This work represents a
first step towards developing more inclusive code-mixed LLMs.
comment: full draft v2: 8 pages, 3 figures
♻ ☆ AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework
Answering natural language (NL) questions about tables, known as Tabular
Question Answering (TQA), is crucial because it allows users to quickly and
efficiently extract meaningful insights from structured data, effectively
bridging the gap between human language and machine-readable formats. Many of
these tables are derived from web sources or real-world scenarios, which
require meticulous data preparation (or data prep) to ensure accurate
responses. However, preparing such tables for NL questions introduces new
requirements that extend beyond traditional data preparation. This
question-ware data preparation involves specific tasks such as column
derivation and filtering tailored to particular questions, as well as
question-aware value normalization or conversion, highlighting the need for a
more nuanced approach in this context. Because each of the above tasks is
unique, a single model (or agent) may not perform effectively across all
scenarios. In this paper, we propose AutoPrep, a large language model
(LLM)-based multiagent framework that leverages the strengths of multiple
agents, each specialized in a certain type of data prep, ensuring more accurate
and contextually relevant responses. Given an NL question over a table,
AutoPrep performs data prep through three key components. Planner: Determines a
logical plan, outlining a sequence of high-level operations. Programmer:
Translates this logical plan into a physical plan by generating the
corresponding low-level code. Executor: Executes the generated code to process
the table. To support this multi-agent framework, we design a novel
Chain-ofClauses reasoning mechanism for high-level operation suggestion, and a
tool-augmented method for low-level code generation.
♻ ☆ FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction
Auto-regressive Large Language Models (LLMs) demonstrate remarkable
performance across different domains such as vision and language processing.
However, due to sequential processing through a stack of transformer layers,
autoregressive decoding faces significant computation/latency challenges,
particularly in resource-constrained environments like mobile and edge devices.
Existing approaches in literature that aim to improve latency via skipping
layers have two distinct flavors - 1) Early exit, and 2) Input-agnostic
heuristics where tokens exit at pre-determined layers irrespective of input
sequence. Both the above strategies have limitations - the former cannot be
applied to handle KV Caching necessary for speed-ups in modern framework and
the latter does not capture the variation in layer importance across tasks or
more generally, across input sequences. To address both limitations, we propose
FiRST, an algorithm that reduces inference latency by using layer-specific
routers to select a subset of transformer layers adaptively for each input
sequence - the prompt (during the prefill stage) decides which layers will be
skipped during decoding. FiRST preserves compatibility with KV caching enabling
faster inference while being quality-aware. FiRST is model-agnostic and can be
easily enabled on any pre-trained LLM. Our approach reveals that input
adaptivity is critical - indeed, different task-specific middle layers play a
crucial role in evolving hidden representations depending on tasks. Extensive
experiments show that FiRST significantly reduces latency while outperforming
other layer selection strategies in quality metics. It retains competitive
performance to base model (without layer skipping) and in some cases, even
improves upon it. FiRST is thus a promising and efficient solution for LLM
deployment in low-resource environments.
♻ ☆ FinSphere, a Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs and Domain Tools
Current financial large language models (FinLLMs) struggle with two critical
limitations: the absence of objective evaluation metrics to assess the quality
of stock analysis reports and a lack of depth in stock analysis, which impedes
their ability to generate professional-grade insights. To address these
challenges, this paper introduces FinSphere, a stock analysis agent, along with
three major contributions: (1) AnalyScore, a systematic evaluation framework
for assessing stock analysis quality, (2) Stocksis, a dataset curated by
industry experts to enhance LLMs' stock analysis capabilities, and (3)
FinSphere, an AI agent that can generate high-quality stock analysis reports in
response to user queries. Experiments demonstrate that FinSphere achieves
superior performance compared to both general and domain-specific LLMs, as well
as existing agent-based systems, even when they are enhanced with real-time
data access and few-shot guidance. The integrated framework, which combines
real-time data feeds, quantitative tools, and an instruction-tuned LLM, yields
substantial improvements in both analytical quality and practical applicability
for real-world stock analysis.
♻ ☆ Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving
Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu
Existing approaches to mathematical reasoning with large language models
(LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated
Reasoning (TIR) for precise computation. While efforts have been made to
combine these methods, they primarily rely on post-selection or predefined
strategies, leaving an open question: whether LLMs can autonomously adapt their
reasoning strategy based on their inherent capabilities. In this work, we
propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework
that enables LLMs to personalize their reasoning strategy spontaneously,
aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware
data selection during supervised fine-tuning (SFT) to tailor training data to
the model's unique abilities. This approach equips LLMs to autonomously
determine and apply the appropriate reasoning strategy at test time. We
evaluate TATA through extensive experiments on six mathematical reasoning
benchmarks, using both general-purpose and math-specialized LLMs. Empirical
results demonstrate that TATA effectively combines the complementary strengths
of CoT and TIR, achieving superior or comparable performance with improved
inference efficiency compared to TIR alone. Further analysis underscores the
critical role of aptitude-aware data selection in enabling LLMs to make
effective and adaptive reasoning decisions and align reasoning strategies with
model capabilities.
comment: 8 pages
♻ ☆ DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE
Native multimodal large language models (MLLMs) restructure a single large
language model (LLM) into a spoken language model (SLM) capable of both speech
and text generation. Compared to modular and aligned MLLMs, native MLLMs
preserve richer paralinguistic features such as emotion and prosody, and
generate speech responses directly within the backbone LLM rather than using a
separate speech decoder. This integration also results in lower response
latency and smoother interaction. However, native MLLMs suffer from
catastrophic forgetting and performance degradation because the available
paired speech-text data is insufficient to support the pretraining of MLLMs
compared to the vast amount of text data required to pretrain text LLMs. To
address this issue, we propose DeepTalk, a framework for adaptive modality
expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk
first adaptively distinguishes modality experts according to their modality
load within the LLM. Each modality expert then undergoes specialized
single-modality training, followed by joint multimodal collaborative training.
As a result, DeepTalk incurs only a 5.5% performance drop compared to the
original LLM, which is significantly lower than the average performance drop of
over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par
with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within
0.5 seconds, ensuring a seamless and intelligent speech interaction experience.
Code and models are released at https://github.com/talkking/DeepTalk.
comment: Under Review
♻ ☆ Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning
Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin
While slow-thinking large language models (LLMs) exhibit reflection-like
reasoning, commonly referred to as the "aha moment:, their ability to generate
informative critiques and refine prior solutions remains limited. In this
paper, we introduce Double-Checker, a principled framework designed to enhance
the reasoning capabilities of slow-thinking LLMs by fostering explicit
self-critique and iterative refinement of their previous solutions. By
fine-tuning on our curated 1,730 self-critical instances, Double-Checker
empowers long-CoT LLMs to iteratively critique and refine their outputs during
inference until they evaluate their solutions as correct under self-generated
critiques. We validate the efficacy of Double-Checker across a comprehensive
suite of reasoning benchmarks, demonstrating that iterative self-critique
significantly enhances the reasoning capabilities of long-CoT LLMs. Notably,
our Double-Checker increases the pass@1 performance on challenging AIME
benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These
results highlight a promising direction for developing more trustworthy and
effective LLMs capable of structured self-critique. Our codes and data are
available at https://github.com/XinXU-USTC/DoubleChecker
comment: 10 pages
♻ ☆ Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs
Shuai Zhao, Leilei Gan, Zhongliang Guo, Xiaobao Wu, Yanhao Jia, Luwei Xiao, Cong-Duy Nguyen, Luu Anh Tuan
Despite being widely applied due to their exceptional capabilities, Large
Language Models (LLMs) have been proven to be vulnerable to backdoor attacks.
These attacks introduce targeted vulnerabilities into LLMs by poisoning
training samples and full-parameter fine-tuning (FPFT). However, this kind of
backdoor attack is limited since they require significant computational
resources, especially as the size of LLMs increases. Besides,
parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted
parameter updating may impede the alignment of triggers with target labels. In
this study, we first verify that backdoor attacks with PEFT may encounter
challenges in achieving feasible performance. To address these issues and
improve the effectiveness of backdoor attacks with PEFT, we propose a novel
backdoor attack algorithm from the weak-to-strong based on Feature
Alignment-enhanced Knowledge Distillation (FAKD). Specifically, we poison
small-scale language models through FPFT to serve as the teacher model. The
teacher model then covertly transfers the backdoor to the large-scale student
model through FAKD, which employs PEFT. Theoretical analysis reveals that FAKD
has the potential to augment the effectiveness of backdoor attacks. We
demonstrate the superior performance of FAKD on classification tasks across
four language models, four backdoor attack algorithms, and two different
architectures of teacher models. Experimental results indicate success rates
close to 100% for backdoor attacks targeting PEFT.
♻ ☆ GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification
Integrating powerful but computationally expensive Pre-trained Language
Models (PLMs) with Graph Neural Networks (GNNs) is a key challenge, especially
on text-rich heterophilic graphs. We propose the Graph Masked Language Model
(GMLM), a framework designed for the efficient and effective fusion of graph
structure and text semantics. GMLM employs a two-stage process: first, a
contrastive pre-training stage with a novel soft masking technique builds a
robust multi-scale GNN; second, an end-to-end fine-tuning stage uses a dynamic
active node selection strategy for scalability and a bi-directional
cross-attention module for deep fusion. Experiments on five heterophilic
benchmarks show GMLM achieves state-of-the-art results on four, significantly
outperforming prior GNN and large LLM-based methods. For instance, it improves
accuracy on the Texas dataset by over 8\% and on Wisconsin by nearly 5\%. Our
work demonstrates that a sophisticated, deeply-integrated architecture can be
more effective and efficient than larger, general-purpose models for text-rich
graph representation learning.
♻ ☆ ModelCitizens: Representing Community Voices in Online Safety
Ashima Suvarna, Christina Chance, Karolina Naranjo, Hamid Palangi, Sophie Hao, Thomas Hartvigsen, Saadia Gabriel
Automatic toxic language detection is critical for creating safe, inclusive
online spaces. However, it is a highly subjective task, with perceptions of
toxic language shaped by community norms and lived experience. Existing
toxicity detection models are typically trained on annotations that collapse
diverse annotator perspectives into a single ground truth, erasing important
context-specific notions of toxicity such as reclaimed language. To address
this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K
toxicity annotations across diverse identity groups. To capture the role of
conversational context on toxicity, typical of social media posts, we augment
MODELCITIZENS posts with LLM-generated conversational scenarios.
State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API,
GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on
context-augmented posts. Finally, we release LLAMACITIZEN-8B and
GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS,
which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our
findings highlight the importance of community-informed annotation and modeling
for inclusive content moderation. The data, models and code are available at
https://github.com/asuvarna31/modelcitizens.
♻ ☆ Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations
Large vision-language contrastive models (VLCMs), such as CLIP, have become
foundational, demonstrating remarkable success across a variety of downstream
tasks. Despite their advantages, these models, akin to other foundational
systems, inherit biases from the disproportionate distribution of real-world
data, leading to misconceptions about the actual environment. Prevalent
datasets like ImageNet are often riddled with non-causal, spurious correlations
that can diminish VLCM performance in scenarios where these contextual elements
are absent. This study presents an investigation into how a simple linear probe
can effectively distill task-specific core features from CLIP's embedding for
downstream applications. Our analysis reveals that the CLIP text
representations are often tainted by spurious correlations, inherited in the
biased pre-training dataset. Empirical evidence suggests that relying on visual
representations from CLIP, as opposed to text embedding, is more effective to
refine the skewed perceptions in VLCMs, emphasizing the superior utility of
visual representations in overcoming embedded biases. Our code can be found
here.
comment: 10 pages, 8 figures
♻ ☆ Can adversarial attacks by large language models be attributed?
Attributing outputs from Large Language Models (LLMs) in adversarial
settings-such as cyberattacks and disinformation campaigns-presents significant
challenges that are likely to grow in importance. We approach this attribution
problem from both a theoretical and an empirical perspective, drawing on formal
language theory (identification in the limit) and data-driven analysis of the
expanding LLM ecosystem. By modeling an LLM's set of possible outputs as a
formal language, we analyze whether finite samples of text can uniquely
pinpoint the originating model. Our results show that, under mild assumptions
of overlapping capabilities among models, certain classes of LLMs are
fundamentally non-identifiable from their outputs alone. We delineate four
regimes of theoretical identifiability: (1) an infinite class of deterministic
(discrete) LLM languages is not identifiable (Gold's classical result from
1967); (2) an infinite class of probabilistic LLMs is also not identifiable (by
extension of the deterministic case); (3) a finite class of deterministic LLMs
is identifiable (consistent with Angluin's tell-tale criterion); and (4) even a
finite class of probabilistic LLMs can be non-identifiable (we provide a new
counterexample establishing this negative result). Complementing these
theoretical insights, we quantify the explosion in the number of plausible
model origins (hypothesis space) for a given output in recent years. Even under
conservative assumptions-each open-source model fine-tuned on at most one new
dataset-the count of distinct candidate models doubles approximately every 0.5
years, and allowing multi-dataset fine-tuning combinations yields doubling
times as short as 0.28 years. This combinatorial growth, alongside the
extraordinary computational cost of brute-force likelihood attribution across
all models and potential users, renders exhaustive attribution infeasible in
practice.
comment: 21 pages, 5 figures, 2 tables
♻ ☆ TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation ICML25
Generating ultra-long sequences with large language models (LLMs) has become
increasingly crucial but remains a highly time-intensive task, particularly for
sequences up to 100K tokens. While traditional speculative decoding methods
exist, simply extending their generation limits fails to accelerate the process
and can be detrimental. Through an in-depth analysis, we identify three major
challenges hindering efficient generation: frequent model reloading, dynamic
key-value (KV) management and repetitive generation. To address these issues,
we introduce TOKENSWIFT, a novel framework designed to substantially accelerate
the generation process of ultra-long sequences while maintaining the target
model's inherent quality. Experimental results demonstrate that TOKENSWIFT
achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B,
14B) and architectures (MHA, GQA). This acceleration translates to hours of
time savings for ultra-long sequence generation, establishing TOKENSWIFT as a
scalable and effective solution at unprecedented lengths. Code can be found at
https://github.com/bigai-nlco/TokenSwift.
comment: Accepted By ICML25
♻ ☆ Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making
Sang Quang Nguyen, Kiet Van Nguyen, Vinh-Tiep Nguyen, Thanh Duc Ngo, Ngan Luu-Thuy Nguyen, Duy-Dinh Le
In this paper, we explore the ability of large language models (LLMs) to plan
and make decisions through the lens of the traditional Vietnamese board game,
\^O \u{A}n Quan. This game, which involves a series of strategic token
movements and captures, offers a unique environment for evaluating the
decision-making and strategic capabilities of LLMs. Specifically, we develop
various agent personas, ranging from aggressive to defensive, and employ the
\^O \u{A}n Quan game as a testbed for assessing LLM performance across
different strategies. Through experimentation with models like
Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Llama-3.3-70B-Instruct, we
aim to understand how these models execute strategic decision-making, plan
moves, and manage dynamic game states. The results will offer insights into the
strengths and weaknesses of LLMs in terms of reasoning and strategy,
contributing to a deeper understanding of their general capabilities.
comment: Accepted paper at MAPR 2025
♻ ☆ Skywork-R1V3 Technical Report
Wei Shen, Jiangbo Pei, Yi Peng, Xuchen Song, Yang Liu, Jian Peng, Haofeng Sun, Yunzhuo Hao, Peiyu Wang, Jianhao Zhang, Yahui Zhou
We introduce Skywork-R1V3, an advanced, open-source vision-language model
(VLM) that pioneers a new approach to visual reasoning. Its key innovation lies
in effectively transferring reasoning skills from text-only Large Language
Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily
stems from our elaborate post-training RL framework, which effectively
activates and enhances the model's reasoning ability, without the need for
additional continue pre-training. Through this framework, we further uncover
the fundamental role of the connector module in achieving robust cross-modal
alignment for multimodal reasoning models. In addition, we introduce a unique
indicator of reasoning capability, the entropy of critical reasoning tokens,
which has proven highly effective for checkpoint selection during RL training.
Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving
from 64.3% to 76.0%. This performance matches entry-level human capabilities.
Remarkably, our RL-powered post-training approach enables even the 38B
parameter model to rival top closed-source VLMs. The implementation
successfully transfers mathematical reasoning to other subject-related
reasoning tasks. We also include an analysis of curriculum learning and
reinforcement finetuning strategies, along with a broader discussion on
multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal
reasoning, showcasing RL as a powerful engine for advancing open-source VLM
capabilities.
♻ ☆ InfoTech Assistant: A Multimodal Conversational Agent for InfoTechnology Web Portal Queries
This pilot study presents the development of the InfoTech Assistant, a
domain-specific, multimodal chatbot engineered to address queries in bridge
evaluation and infrastructure technology. By integrating web data scraping,
large language models (LLMs), and Retrieval-Augmented Generation (RAG), the
InfoTech Assistant provides accurate and contextually relevant responses. Data,
including textual descriptions and images, are sourced from publicly available
documents on the InfoTechnology website and organized in JSON format to
facilitate efficient querying. The architecture of the system includes an
HTML-based interface and a Flask back end connected to the Llama 3.1 model via
LLM Studio. Evaluation results show approximately 95 percent accuracy on
domain-specific tasks, with high similarity scores confirming the quality of
response matching. This RAG-enhanced setup enables the InfoTech Assistant to
handle complex, multimodal queries, offering both textual and visual
information in its responses. The InfoTech Assistant demonstrates strong
potential as a dependable tool for infrastructure professionals, delivering
high accuracy and relevance in its domain-specific outputs.
comment: Accepted by IEEE Big Data 2024
♻ ☆ Theme-Explanation Structure for Table Summarization using Large Language Models: A Case Study on Korean Tabular Data ACL 2025
Tables are a primary medium for conveying critical information in
administrative domains, yet their complexity hinders utilization by Large
Language Models (LLMs). This paper introduces the Theme-Explanation
Structure-based Table Summarization (Tabular-TX) pipeline, a novel approach
designed to generate highly interpretable summaries from tabular data, with a
specific focus on Korean administrative documents. Current table summarization
methods often neglect the crucial aspect of human-friendly output. Tabular-TX
addresses this by first employing a multi-step reasoning process to ensure deep
table comprehension by LLMs, followed by a journalist persona prompting
strategy for clear sentence generation. Crucially, it then structures the
output into a Theme Part (an adverbial phrase) and an Explanation Part (a
predicative clause), significantly enhancing readability. Our approach
leverages in-context learning, obviating the need for extensive fine-tuning and
associated labeled data or computational resources. Experimental results show
that Tabular-TX effectively processes complex table structures and metadata,
offering a robust and efficient solution for generating human-centric table
summaries, especially in low-resource scenarios.
comment: Accepted to TRL@ACL 2025