Computation and Language
☆ MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan
We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark
for evaluating foundation models in video understanding. MMVU includes 3,000
expert-annotated questions spanning 27 subjects across four core disciplines:
Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared to
prior benchmarks, MMVU features three key advancements. First, it challenges
models to apply domain-specific knowledge and perform expert-level reasoning to
analyze specialized-domain videos, moving beyond the basic visual perception
typically assessed in current video benchmarks. Second, each example is
annotated by human experts from scratch. We implement strict data quality
controls to ensure the high quality of the dataset. Finally, each example is
enriched with expert-annotated reasoning rationals and relevant domain
knowledge, facilitating in-depth analysis. We conduct an extensive evaluation
of 32 frontier multimodal foundation models on MMVU. The latest
System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest
performance among the tested models. However, they still fall short of matching
human expertise. Through in-depth error analyses and case studies, we offer
actionable insights for future advancements in expert-level,
knowledge-intensive video understanding for specialized domains.
☆ InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang
Despite the promising performance of Large Vision Language Models (LVLMs) in
visual understanding, they occasionally generate incorrect outputs. While
reward models (RMs) with reinforcement learning or test-time scaling offer the
potential for improving generation quality, a critical gap remains: publicly
available multi-modal RMs for LVLMs are scarce, and the implementation details
of proprietary models are often unclear. We bridge this gap with
InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective
multi-modal reward model that aligns LVLMs with human preferences. To ensure
the robustness and versatility of IXC-2.5-Reward, we set up a high-quality
multi-modal preference corpus spanning text, image, and video inputs across
diverse domains, such as instruction following, general understanding,
text-rich documents, mathematical reasoning, and video understanding.
IXC-2.5-Reward achieves excellent results on the latest multi-modal reward
model benchmark and shows competitive performance on text-only reward model
benchmarks. We further demonstrate three key applications of IXC-2.5-Reward:
(1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward
with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows
consistent improvements in instruction following and multi-modal open-ended
dialogue; (2) Selecting the best response from candidate responses for
test-time scaling; and (3) Filtering outlier or noisy samples from existing
image and video instruction tuning training data. To ensure reproducibility and
facilitate further research, we have open-sourced all model weights and
training recipes at https://github.com/InternLM/InternLM-XComposer
comment: Tech Report
☆ FuocChuVIP123 at CoMeDi Shared Task: Disagreement Ranking with XLM-Roberta Sentence Embeddings and Deep Neural Regression COLING 2025
This paper presents results of our system for CoMeDi Shared Task, focusing on
Subtask 2: Disagreement Ranking. Our system leverages sentence embeddings
generated by the paraphrase-xlm-r-multilingual-v1 model, combined with a deep
neural regression model incorporating batch normalization and dropout for
improved generalization. By predicting the mean of pairwise judgment
differences between annotators, our method explicitly targets disagreement
ranking, diverging from traditional "gold label" aggregation approaches. We
optimized our system with a customized architecture and training procedure,
achieving competitive performance in Spearman correlation against mean
disagreement labels. Our results highlight the importance of robust embeddings,
effective model architecture, and careful handling of judgment differences for
ranking disagreement in multilingual contexts. These findings provide insights
into the use of contextualized representations for ordinal judgment tasks and
open avenues for further refinement of disagreement prediction models.
comment: Accepted at COMEDI shared Task, Workshop at COLING 2025
☆ Automatic Labelling with Open-source LLMs using Dynamic Label Schema Integration
Acquiring labelled training data remains a costly task in real world machine
learning projects to meet quantity and quality requirements. Recently Large
Language Models (LLMs), notably GPT-4, have shown great promises in labelling
data with high accuracy. However, privacy and cost concerns prevent the
ubiquitous use of GPT-4. In this work, we explore effectively leveraging
open-source models for automatic labelling. We identify integrating label
schema as a promising technology but found that naively using the label
description for classification leads to poor performance on high cardinality
tasks. To address this, we propose Retrieval Augmented Classification (RAC) for
which LLM performs inferences for one label at a time using corresponding label
schema; we start with the most related label and iterates until a label is
chosen by the LLM. We show that our method, which dynamically integrates label
description, leads to performance improvements in labelling tasks. We further
show that by focusing only on the most promising labels, RAC can trade off
between label quality and coverage - a property we leverage to automatically
label our internal datasets.
comment: 11 pages, 1 figure
☆ UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi
This paper introduces UI-TARS, a native GUI agent model that solely perceives
the screenshots as input and performs human-like interactions (e.g., keyboard
and mouse operations). Unlike prevailing agent frameworks that depend on
heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts
and workflows, UI-TARS is an end-to-end model that outperforms these
sophisticated frameworks. Experiments demonstrate its superior performance:
UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating
perception, grounding, and GUI task execution. Notably, in the OSWorld
benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15
steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld,
UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several
key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of
GUI screenshots for context-aware understanding of UI elements and precise
captioning; (2) Unified Action Modeling, which standardizes actions into a
unified space across platforms and achieves precise grounding and interaction
through large-scale action traces; (3) System-2 Reasoning, which incorporates
deliberate reasoning into multi-step decision making, involving multiple
reasoning patterns such as task decomposition, reflection thinking, milestone
recognition, etc. (4) Iterative Training with Reflective Online Traces, which
addresses the data bottleneck by automatically collecting, filtering, and
reflectively refining new interaction traces on hundreds of virtual machines.
Through iterative training and reflection tuning, UI-TARS continuously learns
from its mistakes and adapts to unforeseen situations with minimal human
intervention. We also analyze the evolution path of GUI agents to guide the
further development of this domain.
☆ Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
The quality of Supervised Fine-Tuning (SFT) data plays a critical role in
enhancing the conversational capabilities of Large Language Models (LLMs).
However, as LLMs become more advanced, the availability of high-quality
human-annotated SFT data has become a significant bottleneck, necessitating a
greater reliance on synthetic training data. In this work, we introduce Condor,
a novel two-stage synthetic data generation framework that incorporates World
Knowledge Tree and Self-Reflection Refinement to produce high-quality SFT data
at scale. Our experimental results demonstrate that a base model fine-tuned on
only 20K Condor-generated samples achieves superior performance compared to
counterparts. The additional refinement stage in Condor further enables
iterative self-improvement for LLMs at various scales (up to 72B), validating
the effectiveness of our approach. Furthermore, our investigation into the
scaling for synthetic data in post-training reveals substantial unexplored
potential for performance improvements, opening promising avenues for future
research.
comment: Tech Report. Github: https://github.com/InternLM/Condor
☆ CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification
The main challenges limiting the adoption of deep learning-based solutions in
medical workflows are the availability of annotated data and the lack of
interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the
latter by constraining the final disease prediction on a set of predefined and
human-interpretable concepts. However, the increased interpretability achieved
through these concept-based explanations implies a higher annotation burden.
Moreover, if a new concept needs to be added, the whole system needs to be
retrained. Inspired by the remarkable performance shown by Large
Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet
effective, methodology, CBVLM, which tackles both of the aforementioned
challenges. First, for each concept, we prompt the LVLM to answer if the
concept is present in the input image. Then, we ask the LVLM to classify the
image based on the previous concept predictions. Moreover, in both stages, we
incorporate a retrieval module responsible for selecting the best examples for
in-context learning. By grounding the final diagnosis on the predicted
concepts, we ensure explainability, and by leveraging the few-shot capabilities
of LVLMs, we drastically lower the annotation cost. We validate our approach
with extensive experiments across four medical datasets and twelve LVLMs (both
generic and medical) and show that CBVLM consistently outperforms CBMs and
task-specific supervised methods without requiring any training and using just
a few annotated examples. More information on our project page:
https://cristianopatricio.github.io/CBVLM/.
comment: This work has been submitted to the IEEE for possible publication
☆ FOCUS: First Order Concentrated Updating Scheme
Large language models (LLMs) demonstrate remarkable performance, and
improving their pre-training process appears to be key to enhancing their
capabilities further. Based on the documented success of Adam, learning rate
decay, and weight decay, we hypothesize that the pre-training loss landscape
features a narrowing valley structure. Through experiments with synthetic loss
functions, we discover that when gradient query noise is high relative to the
valley's sharpness, Adam's performance falls behind that of Signum because Adam
reduces the effective step size too drastically. This observation led us to
develop FOCUS, an optimizer that enhances Signum by incorporating attraction
toward moving averaged parameters, allowing it to handle noise better while
maintaining larger step sizes. In training GPT-2, FOCUS proves to be more
stable than Signum and faster than Adam. These results suggest that gradient
noise may be an underappreciated limiting factor in LLM training, and FOCUS
offers promising solutions.
comment: 19 pages, 8 figures
☆ InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models
The improved competence of generative models can help building multi-modal
virtual assistants that leverage modalities beyond language. By observing
humans performing multi-step tasks, one can build assistants that have
situational awareness of actions and tasks being performed, enabling them to
cater assistance based on this understanding. In this paper, we develop a
Context-aware Instructional Task Assistant with Multi-modal Large Language
Models (InsTALL) that leverages an online visual stream (e.g. a user's screen
share or video recording) and responds in real-time to user queries related to
the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal
model on task videos and paired textual data, and 2) automatically extracts
task graph from video data and leverages it at training and inference time. We
show InsTALL achieves state-of-the-art performance across proposed sub-tasks
considered for multimodal activity understanding -- task recognition (TR),
action recognition (AR), next action prediction (AP), and plan prediction (PP)
-- and outperforms existing baselines on two novel sub-tasks related to
automatic error identification.
☆ Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model
Large Vision Language Models (LVLMs) have demonstrated remarkable
capabilities in understanding and describing visual content, achieving
state-of-the-art performance across various vision-language tasks. However,
these models frequently exhibit hallucination behavior, where they generate
descriptions containing objects or details absent in the input image. Our work
investigates this phenomenon by analyzing attention patterns across transformer
layers and heads, revealing that hallucinations often stem from progressive
degradation of visual grounding in deeper layers. We propose a novel attention
modification approach that combines selective token emphasis and head-specific
modulation to maintain visual grounding throughout the generation process. Our
method introduces two key components: (1) a dual-stream token selection
mechanism that identifies and prioritizes both locally informative and
spatially significant visual tokens, and (2) an attention head-specific
modulation strategy that differentially amplifies visual information processing
based on measured visual sensitivity of individual attention heads. Through
extensive experimentation on the MSCOCO dataset, we demonstrate that our
approach reduces hallucination rates by up to 62.3\% compared to baseline
models while maintaining comparable task performance. Our analysis reveals that
selectively modulating tokens across attention heads with varying levels of
visual sensitivity can significantly improve visual grounding without requiring
model retraining.
comment: 10 pages, 5 tables, 4 figures
☆ Extend Adversarial Policy Against Neural Machine Translation via Unknown Token
Generating adversarial examples contributes to mainstream neural machine
translation~(NMT) robustness. However, popular adversarial policies are apt for
fixed tokenization, hindering its efficacy for common character perturbations
involving versatile tokenization. Based on existing adversarial generation via
reinforcement learning~(RL), we propose the `DexChar policy' that introduces
character perturbations for the existing mainstream adversarial policy based on
token substitution. Furthermore, we improve the self-supervised matching that
provides feedback in RL to cater to the semantic constraints required during
training adversaries. Experiments show that our method is compatible with the
scenario where baseline adversaries fail, and can generate high-efficiency
adversarial examples for analysis and optimization of the system.
comment: accepted by CCMT 2024()
☆ AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding
Zikun Li, Zhuofu Chen, Remi Delacourt, Gabriele Oliaro, Zeyu Wang, Qinghan Chen, Shuhuai Lin, April Yang, Zhihao Zhang, Zhuoming Chen, Sean Lai, Xupeng Miao, Zhihao Jia
This paper introduces AdaServe, the first LLM serving system to support SLO
customization through fine-grained speculative decoding. AdaServe leverages the
logits of a draft model to predict the speculative accuracy of tokens and
employs a theoretically optimal algorithm to construct token trees for
verification. To accommodate diverse SLO requirements without compromising
throughput, AdaServe employs a speculation-and-selection scheme that first
constructs candidate token trees for each request and then dynamically selects
tokens to meet individual SLO constraints while optimizing throughput.
Comprehensive evaluations demonstrate that AdaServe achieves up to 73% higher
SLO attainment and 74% higher goodput compared to state-of-the-art systems.
These results underscore AdaServe's potential to enhance the efficiency and
adaptability of LLM deployments across varied application scenarios.
☆ Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities
Selecting appropriate training data is crucial for effective instruction
fine-tuning of large language models (LLMs), which aims to (1) elicit strong
capabilities, and (2) achieve balanced performance across a diverse range of
tasks. Influence-based methods show promise in achieving (1) by estimating the
contribution of each training example to the model's predictions, but often
struggle with (2). Our systematic investigation reveals that this
underperformance can be attributed to an inherent bias where certain tasks
intrinsically have greater influence than others. As a result, data selection
is often biased towards these tasks, not only hurting the model's performance
on others but also, counterintuitively, harms performance on these
high-influence tasks themselves.
As a remedy, we propose BIDS, a Balanced and Influential Data Selection
algorithm. BIDS first normalizes influence scores of the training data, and
then iteratively balances data selection by choosing the training example with
the highest influence on the most underrepresented task. Experiments with both
Llama-3 and Mistral-v0.3 on seven benchmarks spanning five diverse capabilities
show that BIDS consistently outperforms both state-of-the-art influence-based
algorithms and other non-influence-based selection frameworks. Surprisingly,
training on a 15% subset selected by BIDS can even outperform full-dataset
training with a much more balanced performance. Our analysis further highlights
the importance of both instance-level normalization and iterative optimization
of selected data for balanced learning of diverse capabilities.
☆ Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes
Tumor documentation in Germany is largely done manually, requiring reading
patient records and entering data into structured databases. Large language
models (LLMs) could potentially enhance this process by improving efficiency
and reliability. This evaluation tests eleven different open source LLMs with
sizes ranging from 1-70 billion model parameters on three basic tasks of the
tumor documentation process: identifying tumor diagnoses, assigning ICD-10
codes, and extracting the date of first diagnosis. For evaluating the LLMs on
these tasks, a dataset of annotated text snippets based on anonymized doctors'
notes from urology was prepared. Different prompting strategies were used to
investigate the effect of the number of examples in few-shot prompting and to
explore the capabilities of the LLMs in general. The models Llama 3.1 8B,
Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks.
Models with less extensive training data or having fewer than 7 billion
parameters showed notably lower performance, while larger models did not
display performance gains. Examples from a different medical domain than
urology could also improve the outcome in few-shot prompting, which
demonstrates the ability of LLMs to handle tasks needed for tumor
documentation. Open source LLMs show a strong potential for automating tumor
documentation. Models from 7-12 billion parameters could offer an optimal
balance between performance and resource efficiency. With tailored fine-tuning
and well-designed prompting, these models might become important tools for
clinical documentation in the future. The code for the evaluation is available
from https://github.com/stefan-m-lenz/UroLlmEval. We also release the dataset
as a new valuable resource that addresses the shortage of authentic and easily
accessible benchmarks in German-language medical NLP.
comment: 48 pages, 5 figures
☆ EDoRA: Efficient Weight-Decomposed Low-Rank Adaptation via Singular Value Decomposition
Parameter-efficient fine-tuning methods, such as LoRA, reduces the number of
trainable parameters. However, they often suffer from scalability issues and
differences between their learning pattern and full fine-tuning. To overcome
these limitations, we propose Efficient Weight-Decomposed Low-Rank Adaptation
(EDoRA): a novel PEFT method that decomposes pre-trained weights into magnitude
and directional components. By freezing low-rank matrices, initializing them by
singular value decomposition, and introducing a small trainable matrix between
them, EDoRA achieves substantial reduction in trainable parameters while
maintaining learning capacity. Experimental results on the GLUE benchmark
demonstrate that EDoRA achieves competitive or superior performance compared to
state-of-the-art methods, such as LoRA and DoRA, with up to 30x fewer trainable
parameters. This makes EDoRA a highly efficient solution for adapting LLMs to
diverse tasks under memory-constrained settings. Code is available at
https://github.com/Hamid-Nasiri/EDoRA .
comment: 10 pages, 4 figures, 4 tables
☆ MedS$^3$: Towards Medical Small Language Models with Self-Evolved Slow Thinking
Medical language models (MLMs) have become pivotal in advancing medical
natural language processing. However, prior models that rely on pre-training or
supervised fine-tuning often exhibit low data efficiency and limited
practicality in real-world clinical applications. While OpenAIs O1 highlights
test-time scaling in mathematics, attempts to replicate this approach in
medicine typically distill responses from GPT-series models to open-source
models, focusing primarily on multiple-choice tasks. This strategy, though
straightforward, neglects critical concerns like data privacy and realistic
deployment in clinical settings. In this work, we present a deployable,
small-scale medical language model, \mone, designed for long-chain reasoning in
clinical tasks using a self-evolution paradigm. Starting with a seed dataset of
around 8,000 instances spanning five domains and 16 datasets, we prompt a base
policy model to perform Monte Carlo Tree Search (MCTS) to construct verifiable
reasoning chains. Each reasoning step is assigned an evolution rollout value,
allowing verified trajectories to train the policy model and the reward model.
During inference, the policy model generates multiple responses, and the reward
model selects the one with the highest reward score. Experiments on eleven
evaluation datasets demonstrate that \mone outperforms prior open-source models
by 2 points, with the addition of the reward model further boosting performance
($\sim$13 points), surpassing GPT-4o-mini. Code and data are available at
\url{https://github.com/pixas/MedSSS}.
comment: 19 pages; technical report
☆ Reference-free Evaluation Metrics for Text Generation: A Survey
A number of automatic evaluation metrics have been proposed for natural
language generation systems. The most common approach to automatic evaluation
is the use of a reference-based metric that compares the model's output with
gold-standard references written by humans. However, it is expensive to create
such references, and for some tasks, such as response generation in dialogue,
creating references is not a simple matter. Therefore, various reference-free
metrics have been developed in recent years. In this survey, which intends to
cover the full breadth of all NLG tasks, we investigate the most commonly used
approaches, their application, and their other uses beyond evaluating models.
The survey concludes by highlighting some promising directions for future
research.
comment: Work in progress
☆ Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues
Training task-oriented dialogue systems is both costly and time-consuming,
due to the need for high-quality datasets encompassing diverse intents.
Traditional methods depend on extensive human annotation, while recent
advancements leverage large language models (LLMs) to generate synthetic data.
However, these approaches often require custom prompts or code, limiting
accessibility for non-technical users. We introduce GraphTOD, an end-to-end
framework that simplifies the generation of task-oriented dialogues. Users can
create dialogues by specifying transition graphs in JSON format. Our evaluation
demonstrates that GraphTOD generates high-quality dialogues across various
domains, significantly lowering the cost and complexity of dataset creation.
☆ A Hybrid Attention Framework for Fake News Detection with Large Language Models
With the rapid growth of online information, the spread of fake news has
become a serious social challenge. In this study, we propose a novel detection
framework based on Large Language Models (LLMs) to identify and classify fake
news by integrating textual statistical features and deep semantic features.
Our approach utilizes the contextual understanding capability of the large
language model for text analysis and introduces a hybrid attention mechanism to
focus on feature combinations that are particularly important for fake news
identification. Extensive experiments on the WELFake news dataset show that our
model significantly outperforms existing methods, with a 1.5\% improvement in
F1 score. In addition, we assess the interpretability of the model through
attention heat maps and SHAP values, providing actionable insights for content
review strategies. Our framework provides a scalable and efficient solution to
deal with the spread of fake news and helps build a more reliable online
information ecosystem.
☆ TAD-Bench: A Comprehensive Benchmark for Embedding-Based Text Anomaly Detection
Text anomaly detection is crucial for identifying spam, misinformation, and
offensive language in natural language processing tasks. Despite the growing
adoption of embedding-based methods, their effectiveness and generalizability
across diverse application scenarios remain under-explored. To address this, we
present TAD-Bench, a comprehensive benchmark designed to systematically
evaluate embedding-based approaches for text anomaly detection. TAD-Bench
integrates multiple datasets spanning different domains, combining
state-of-the-art embeddings from large language models with a variety of
anomaly detection algorithms. Through extensive experiments, we analyze the
interplay between embeddings and detection methods, uncovering their strengths,
weaknesses, and applicability to different tasks. These findings offer new
perspectives on building more robust, efficient, and generalizable anomaly
detection systems for real-world applications.
☆ Proverbs Run in Pairs: Evaluating Proverb Translation Capability of Large Language Model
Despite achieving remarkable performance, machine translation (MT) research
remains underexplored in terms of translating cultural elements in languages,
such as idioms, proverbs, and colloquial expressions. This paper investigates
the capability of state-of-the-art neural machine translation (NMT) and large
language models (LLMs) in translating proverbs, which are deeply rooted in
cultural contexts. We construct a translation dataset of standalone proverbs
and proverbs in conversation for four language pairs. Our experiments show that
the studied models can achieve good translation between languages with similar
cultural backgrounds, and LLMs generally outperform NMT models in proverb
translation. Furthermore, we find that current automatic evaluation metrics
such as BLEU, CHRF++ and COMET are inadequate for reliably assessing the
quality of proverb translation, highlighting the need for more culturally aware
evaluation metrics.
☆ HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja
While Korean historical documents are invaluable cultural heritage,
understanding those documents requires in-depth Hanja expertise. Hanja is an
ancient language used in Korea before the 20th century, whose characters were
borrowed from old Chinese but had evolved in Korea for centuries. Modern
Koreans and Chinese cannot understand Korean historical documents without
substantial additional help, and while previous efforts have produced some
Korean and English translations, this requires in-depth expertise, and so most
of the documents are not translated into any modern language. To address this
gap, we present HERITAGE, the first open-source Hanja NLP toolkit to assist in
understanding and translating the unexplored Korean historical documents
written in Hanja. HERITAGE is a web-based platform providing model predictions
of three critical tasks in historical document understanding via Hanja language
models: punctuation restoration, named entity recognition, and machine
translation (MT). HERITAGE also provides an interactive glossary, which
provides the character-level reading of the Hanja characters in modern Korean,
as well as character-level English definition. HERITAGE serves two purposes.
First, anyone interested in these documents can get a general understanding
from the model predictions and the interactive glossary, especially MT outputs
in Korean and English. Second, since the model outputs are not perfect, Hanja
experts can revise them to produce better annotations and translations. This
would boost the translation efficiency and potentially lead to most of the
historical documents being translated into modern languages, lowering the
barrier on unexplored Korean historical documents.
comment: Demo and video are available at https://hanja.dev and
https://hanja.dev/video
☆ LuxVeri at GenAI Detection Task 3: Cross-Domain Detection of AI-Generated Text Using Inverse Perplexity-Weighted Ensemble of Fine-Tuned Transformer Models
This paper presents our approach for Task 3 of the GenAI content detection
workshop at COLING-2025, focusing on Cross-Domain Machine-Generated Text (MGT)
Detection. We propose an ensemble of fine-tuned transformer models, enhanced by
inverse perplexity weighting, to improve classification accuracy across diverse
text domains. For Subtask A (Non-Adversarial MGT Detection), we combined a
fine-tuned RoBERTa-base model with an OpenAI detector-integrated RoBERTa-base
model, achieving an aggregate TPR score of 0.826, ranking 10th out of 23
detectors. In Subtask B (Adversarial MGT Detection), our fine-tuned
RoBERTa-base model achieved a TPR score of 0.801, securing 8th out of 22
detectors. Our results demonstrate the effectiveness of inverse
perplexity-based weighting for enhancing generalization and performance in both
non-adversarial and adversarial MGT detection, highlighting the potential for
transformer models in cross-domain AI-generated content detection.
☆ LuxVeri at GenAI Detection Task 1: Inverse Perplexity Weighted Ensemble for Robust Detection of AI-Generated Text across English and Multilingual Contexts
This paper presents a system developed for Task 1 of the COLING 2025 Workshop
on Detecting AI-Generated Content, focusing on the binary classification of
machine-generated versus human-written text. Our approach utilizes an ensemble
of models, with weights assigned according to each model's inverse perplexity,
to enhance classification accuracy. For the English text detection task, we
combined RoBERTa-base, RoBERTa-base with the OpenAI detector, and
BERT-base-cased, achieving a Macro F1-score of 0.7458, which ranked us 12th out
of 35 teams. We ensembled RemBERT, XLM-RoBERTa-base, and
BERT-base-multilingual-case for the multilingual text detection task, employing
the same inverse perplexity weighting technique. This resulted in a Macro
F1-score of 0.7513, positioning us 4th out of 25 teams. Our results demonstrate
the effectiveness of inverse perplexity weighting in improving the robustness
of machine-generated text detection across both monolingual and multilingual
settings, highlighting the potential of ensemble methods for this challenging
task.
☆ Panoramic Interests: Stylistic-Content Aware Personalized Headline Generation WWW'25
Personalized news headline generation aims to provide users with
attention-grabbing headlines that are tailored to their preferences. Prevailing
methods focus on user-oriented content preferences, but most of them overlook
the fact that diverse stylistic preferences are integral to users' panoramic
interests, leading to suboptimal personalization. In view of this, we propose a
novel Stylistic-Content Aware Personalized Headline Generation (SCAPE)
framework. SCAPE extracts both content and stylistic features from headlines
with the aid of large language model (LLM) collaboration. It further adaptively
integrates users' long- and short-term interests through a contrastive
learning-based hierarchical fusion network. By incorporating the panoramic
interests into the headline generator, SCAPE reflects users' stylistic-content
preferences during the generation process. Extensive experiments on the
real-world dataset PENS demonstrate the superiority of SCAPE over baselines.
comment: Accepted to The ACM Web Conference 2025 (WWW'25, short paper)
☆ Med-R$^2$: Crafting Trustworthy LLM Physicians through Retrieval and Reasoning of Evidence-Based Medicine
Keer Lu, Zheng Liang, Da Pan, Shusen Zhang, Xin Wu, Weipeng Chen, Zenan Zhou, Guosheng Dong, Bin Cui, Wentao Zhang
In recent years, Large Language Models (LLMs) have exhibited remarkable
capabilities in clinical scenarios. However, despite their potential, existing
works face challenges when applying LLMs to medical settings. Strategies
relying on training with medical datasets are highly cost-intensive and may
suffer from outdated training data. Leveraging external knowledge bases is a
suitable alternative, yet it faces obstacles such as limited retrieval
precision and poor effectiveness in answer extraction. These issues
collectively prevent LLMs from demonstrating the expected level of proficiency
in mastering medical expertise. To address these challenges, we introduce
Med-R^2, a novel LLM physician framework that adheres to the Evidence-Based
Medicine (EBM) process, efficiently integrating retrieval mechanisms as well as
the selection and reasoning processes of evidence, thereby enhancing the
problem-solving capabilities of LLMs in healthcare scenarios and fostering a
trustworthy LLM physician. Our comprehensive experiments indicate that Med-R^2
achieves a 14.87\% improvement over vanilla RAG methods and even a 3.59\%
enhancement compared to fine-tuning strategies, without incurring additional
training costs.
☆ From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning
Scaling data and model size has been proven effective for boosting the
performance of large language models. In addition to training-time scaling,
recent studies have revealed that increasing test-time computational resources
can further improve performance. In this work, we introduce Aggregation
Fine-Tuning (AFT), a supervised finetuning paradigm where the model learns to
synthesize multiple draft responses, referred to as proposals, into a single,
refined answer, termed aggregation. At inference time, a propose-and-aggregate
strategy further boosts performance by iteratively generating proposals and
aggregating them. Empirical evaluations on benchmark datasets show that
AFT-trained models substantially outperform standard SFT. Notably, an AFT
model, fine-tuned from Llama3.1-8B-Base with only 64k data, achieves a 41.3% LC
win rate on AlpacaEval 2, surpassing significantly larger LLMs such as
Llama3.1-405B-Instruct and GPT4. By combining sequential refinement and
parallel sampling, the propose-and-aggregate framework scales inference-time
computation in a flexible manner. Overall, These findings position AFT as a
promising approach to unlocking additional capabilities of LLMs without
resorting to increasing data volume or model size.
comment: 20 pages; work in progress
☆ Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin
This paper revisits the implementation of
$\textbf{L}$oad-$\textbf{b}$alancing $\textbf{L}$oss (LBL) when training
Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as $N_E
\sum_{i=1}^{N_E} f_i p_i$, where $N_E$ is the total number of experts, $f_i$
represents the frequency of expert $i$ being selected, and $p_i$ denotes the
average gating score of the expert $i$. Existing MoE training frameworks
usually employ the parallel training strategy so that $f_i$ and the LBL are
calculated within a $\textbf{micro-batch}$ and then averaged across parallel
groups. In essence, a micro-batch for training billion-scale LLMs normally
contains very few sequences. So, the micro-batch LBL is almost at the sequence
level, and the router is pushed to distribute the token evenly within each
sequence. Under this strict constraint, even tokens from a domain-specific
sequence ($\textit{e.g.}$, code) are uniformly routed to all experts, thereby
inhibiting expert specialization. In this work, we propose calculating LBL
using a $\textbf{global-batch}$ to loose this constraint. Because a
global-batch contains much more diverse sequences than a micro-batch, which
will encourage load balance at the corpus level. Specifically, we introduce an
extra communication step to synchronize $f_i$ across micro-batches and then use
it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to
$\textbf{42.8B}$ total parameters and $\textbf{400B}$ tokens), we surprisingly
find that the global-batch LBL strategy yields excellent performance gains in
both pre-training perplexity and downstream tasks. Our analysis reveals that
the global-batch LBL also greatly improves the domain specialization of MoE
experts.
☆ EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents
Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, Maosong Sun
Multimodal Large Language Models (MLLMs) have shown significant advancements,
providing a promising future for embodied agents. Existing benchmarks for
evaluating MLLMs primarily utilize static images or videos, limiting
assessments to non-interactive scenarios. Meanwhile, existing embodied AI
benchmarks are task-specific and not diverse enough, which do not adequately
evaluate the embodied capabilities of MLLMs. To address this, we propose
EmbodiedEval, a comprehensive and interactive evaluation benchmark for MLLMs
with embodied tasks. EmbodiedEval features 328 distinct tasks within 125 varied
3D scenes, each of which is rigorously selected and annotated. It covers a
broad spectrum of existing embodied AI tasks with significantly enhanced
diversity, all within a unified simulation and evaluation framework tailored
for MLLMs. The tasks are organized into five categories: navigation, object
interaction, social interaction, attribute question answering, and spatial
question answering to assess different capabilities of the agents. We evaluated
the state-of-the-art MLLMs on EmbodiedEval and found that they have a
significant shortfall compared to human level on embodied tasks. Our analysis
demonstrates the limitations of existing MLLMs in embodied capabilities,
providing insights for their future development. We open-source all evaluation
data and simulation framework at https://github.com/thunlp/EmbodiedEval.
☆ Cross-Entropy Attacks to Language Models via Rare Event Simulation
Black-box textual adversarial attacks are challenging due to the lack of
model information and the discrete, non-differentiable nature of text. Existing
methods often lack versatility for attacking different models, suffer from
limited attacking performance due to the inefficient optimization with word
saliency ranking, and frequently sacrifice semantic integrity to achieve better
attack outcomes. This paper introduces a novel approach to textual adversarial
attacks, which we call Cross-Entropy Attacks (CEA), that uses Cross-Entropy
optimization to address the above issues. Our CEA approach defines adversarial
objectives for both soft-label and hard-label settings and employs CE
optimization to identify optimal replacements. Through extensive experiments on
document classification and language translation problems, we demonstrate that
our attack method excels in terms of attacking performance, imperceptibility,
and sentence quality.
☆ Challenges in Expanding Portuguese Resources: A View from Open Information Extraction
Open Information Extraction (Open IE) is the task of extracting structured
information from textual documents, independent of domain. While traditional
Open IE methods were based on unsupervised approaches, recently, with the
emergence of robust annotated datasets, new data-based approaches have been
developed to achieve better results. These innovations, however, have focused
mainly on the English language due to a lack of datasets and the difficulty of
constructing such resources for other languages. In this work, we present a
high-quality manually annotated corpus for Open Information Extraction in the
Portuguese language, based on a rigorous methodology grounded in established
semantic theories. We discuss the challenges encountered in the annotation
process, propose a set of structural and contextual annotation rules, and
validate our corpus by evaluating the performance of state-of-the-art Open IE
systems. Our resource addresses the lack of datasets for Open IE in Portuguese
and can support the development and evaluation of new methods and systems in
this area.
☆ Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance
Nikos Kanakaris, Heng Ping, Xiongye Xiao, Nesreen K. Ahmed, Luca Luceri, Emilio Ferrara, Paul Bogdan
Detecting organized political campaigns is of paramount importance in
fighting against disinformation on social media. Existing approaches for the
identification of such organized actions employ techniques mostly from network
science, graph machine learning and natural language processing. Their ultimate
goal is to analyze the relationships and interactions (e.g. re-posting) among
users and the textual similarities of their posts. Despite their effectiveness
in recognizing astroturf campaigns, these methods face significant challenges,
notably the class imbalance in available training datasets. To mitigate this
issue, recent methods usually resort to data augmentation or increasing the
number of positive samples, which may not always be feasible or sufficient in
real-world settings. Following a different path, in this paper, we propose a
novel framework for identifying astroturf campaigns based solely on large
language models (LLMs), introducing a Balanced Retrieval-Augmented Generation
(Balanced RAG) component. Our approach first gives both textual information
concerning the posts (in our case tweets) and the user interactions of the
social network as input to a language model. Then, through prompt engineering
and the proposed Balanced RAG method, it effectively detects coordinated
disinformation campaigns on X (Twitter). The proposed framework does not
require any training or fine-tuning of the language model. Instead, by
strategically harnessing the strengths of prompt engineering and Balanced RAG,
it facilitates LLMs to overcome the effects of class imbalance and effectively
identify coordinated political campaigns. The experimental results demonstrate
that by incorporating the proposed prompt engineering and Balanced RAG methods,
our framework outperforms the traditional graph-based baselines, achieving
2x-3x improvements in terms of precision, recall and F1 scores.
☆ Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs
In this paper, we present an investigative study on how Mental Sets influence
the reasoning capabilities of LLMs. LLMs have excelled in diverse natural
language processing (NLP) tasks, driven by advancements in parameter-efficient
fine-tuning (PEFT) and emergent capabilities like in-context learning (ICL).
For complex reasoning tasks, selecting the right model for PEFT or ICL is
critical, often relying on scores on benchmarks such as MMLU, MATH, and GSM8K.
However, current evaluation methods, based on metrics like F1 Score or
reasoning chain assessments by larger models, overlook a key dimension:
adaptability to unfamiliar situations and overcoming entrenched thinking
patterns. In cognitive psychology, Mental Set refers to the tendency to persist
with previously successful strategies, even when they become inefficient - a
challenge for problem solving and reasoning. We compare the performance of LLM
models like Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct and GPT-4o in the
presence of mental sets. To the best of our knowledge, this is the first study
to integrate cognitive psychology concepts into the evaluation of LLMs for
complex reasoning tasks, providing deeper insights into their adaptability and
problem-solving efficacy.
☆ Fact-Preserved Personalized News Headline Generation ICDM 2023
Personalized news headline generation, aiming at generating user-specific
headlines based on readers' preferences, burgeons a recent flourishing research
direction. Existing studies generally inject a user interest embedding into an
encoderdecoder headline generator to make the output personalized, while the
factual consistency of headlines is inadequate to be verified. In this paper,
we propose a framework Fact-Preserved Personalized News Headline Generation
(short for FPG), to prompt a tradeoff between personalization and consistency.
In FPG, the similarity between the candidate news to be exposed and the
historical clicked news is used to give different levels of attention to key
facts in the candidate news, and the similarity scores help to learn a
fact-aware global user embedding. Besides, an additional training procedure
based on contrastive learning is devised to further enhance the factual
consistency of generated headlines. Extensive experiments conducted on a
real-world benchmark PENS validate the superiority of FPG, especially on the
tradeoff between personalization and factual consistency.
comment: Accepted by IEEE ICDM 2023, Short paper, 6 pages
♻ ☆ Leveraging Explicit Reasoning for Inference Integration in Commonsense-Augmented Dialogue Models COLING 2025
Open-domain dialogue systems need to grasp social commonsense to understand
and respond effectively to human users. Commonsense-augmented dialogue models
have been proposed that aim to infer commonsense knowledge from dialogue
contexts in order to improve response quality. However, existing approaches to
commonsense-augmented dialogue rely on implicit reasoning to integrate
commonsense inferences during response generation. In this study, we explore
the impact of explicit reasoning against implicit reasoning over commonsense
for dialogue response generation. Our findings demonstrate that separating
commonsense reasoning into explicit steps for generating, selecting, and
integrating commonsense into responses leads to better dialogue interactions,
improving naturalness, engagement, specificity, and overall quality. Subsequent
analyses of these findings unveil insights into the effectiveness of various
types of commonsense in generating responses and the particular response traits
enhanced through explicit reasoning for commonsense integration. Our work
advances research in open-domain dialogue by achieving a new state-of-the-art
in commonsense-augmented response generation.
comment: Accepted to COLING 2025
(https://aclanthology.org/2025.coling-main.152/)
♻ ☆ Hire Me or Not? Examining Language Model's Behavior with Occupation Attributes COLING 2025
With the impressive performance in various downstream tasks, large language
models (LLMs) have been widely integrated into production pipelines, like
recruitment and recommendation systems. A known issue of models trained on
natural language data is the presence of human biases, which can impact the
fairness of the system. This paper investigates LLMs' behavior with respect to
gender stereotypes, in the context of occupation decision making. Our framework
is designed to investigate and quantify the presence of gender stereotypes in
LLMs' behavior via multi-round question answering. Inspired by prior works, we
construct a dataset by leveraging a standard occupation classification
knowledge base released by authoritative agencies. We tested three LLMs
(RoBERTa-large, GPT-3.5-turbo, and Llama2-70b-chat) and found that all models
exhibit gender stereotypes analogous to human biases, but with different
preferences. The distinct preferences of GPT-3.5-turbo and Llama2-70b-chat may
imply the current alignment methods are insufficient for debiasing and could
introduce new biases contradicting the traditional gender stereotypes.
comment: COLING 2025
♻ ☆ ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability
Zhongxiang Sun, Xiaoxue Zang, Kai Zheng, Yang Song, Jun Xu, Xiao Zhang, Weijie Yu, Yang Song, Han Li
Retrieval-Augmented Generation (RAG) models are designed to incorporate
external knowledge, reducing hallucinations caused by insufficient parametric
(internal) knowledge. However, even with accurate and relevant retrieved
content, RAG models can still produce hallucinations by generating outputs that
conflict with the retrieved information. Detecting such hallucinations requires
disentangling how Large Language Models (LLMs) utilize external and parametric
knowledge. Current detection methods often focus on one of these mechanisms or
without decoupling their intertwined effects, making accurate detection
difficult. In this paper, we investigate the internal mechanisms behind
hallucinations in RAG scenarios. We discover hallucinations occur when the
Knowledge FFNs in LLMs overemphasize parametric knowledge in the residual
stream, while Copying Heads fail to effectively retain or integrate external
knowledge from retrieved content. Based on these findings, we propose ReDeEP, a
novel method that detects hallucinations by decoupling LLM's utilization of
external context and parametric knowledge. Our experiments show that ReDeEP
significantly improves RAG hallucination detection accuracy. Additionally, we
introduce AARF, which mitigates hallucinations by modulating the contributions
of Knowledge FFNs and Copying Heads.
comment: 23pages
♻ ☆ Word and Phrase Features in Graph Convolutional Network for Automatic Question Classification
Effective question classification is crucial for AI-driven educational tools,
enabling adaptive learning systems to categorize questions by skill area,
difficulty level, and competence. This classification not only supports
educational diagnostics and analytics but also enhances complex tasks like
information retrieval and question answering by associating questions with
relevant categories. Traditional methods, often based on word embeddings and
conventional classifiers, struggle to capture the nuanced relationships in
natural language, leading to suboptimal performance. To address this, we
propose a novel approach leveraging graph convolutional networks, named Phrase
Question-Graph Convolutional Network (PQ-GCN) to better model the inherent
structure of questions. By representing questions as graphs-where nodes signify
words or phrases and edges denote syntactic or semantic relationships-our
method allows the model to learn from the interconnected nature of language
more effectively. Additionally, we explore the incorporation of phrase-based
features to enhance classification performance on question datasets of various
domains and characteristics. Our findings demonstrate that the proposed model,
augmented with these features, offer a promising solution for more robust and
context-aware question classification, bridging the gap between graph neural
network research and practical educational applications of AI.
♻ ☆ TDAG: A Multi-Agent Framework based on Dynamic Task Decomposition and Agent Generation
The emergence of Large Language Models (LLMs) like ChatGPT has inspired the
development of LLM-based agents capable of addressing complex, real-world
tasks. However, these agents often struggle during task execution due to
methodological constraints, such as error propagation and limited adaptability.
To address this issue, we propose a multi-agent framework based on dynamic Task
Decomposition and Agent Generation (TDAG). This framework dynamically
decomposes complex tasks into smaller subtasks and assigns each to a
specifically generated subagent, thereby enhancing adaptability in diverse and
unpredictable real-world tasks. Simultaneously, existing benchmarks often lack
the granularity needed to evaluate incremental progress in complex, multi-step
tasks. In response, we introduce ItineraryBench in the context of travel
planning, featuring interconnected, progressively complex tasks with a
fine-grained evaluation system. ItineraryBench is designed to assess agents'
abilities in memory, planning, and tool usage across tasks of varying
complexity. Our experimental results reveal that TDAG significantly outperforms
established baselines, showcasing its superior adaptability and context
awareness in complex task scenarios.
comment: Accepted by Neural Networks
♻ ☆ FLARE: Faithful Logic-Aided Reasoning and Exploration
Modern Question Answering (QA) and Reasoning approaches based on Large
Language Models (LLMs) commonly use prompting techniques, such as
Chain-of-Thought (CoT), assuming the resulting generation will have a more
granular exploration and reasoning over the question space and scope. However,
such methods struggle with generating outputs that are faithful to the
intermediate chain of reasoning produced by the model. On the other end of the
spectrum, neuro-symbolic methods such as Faithful CoT (F-CoT) propose to
combine LLMs with external symbolic solvers. While such approaches boast a high
degree of faithfulness, they usually require a model trained for code
generation and struggle with tasks that are ambiguous or hard to formalise
strictly. We introduce $\textbf{F}$aithful $\textbf{L}$ogic-$\textbf{A}$ided
$\textbf{R}$easoning and $\textbf{E}$xploration ($\textbf{FLARE}$), a novel
interpretable approach for traversing the problem space using task
decompositions. We use the LLM to plan a solution, soft-formalise the query
into facts and predicates using a logic programming code and simulate that code
execution using an exhaustive multi-hop search over the defined space. Our
method allows us to compute the faithfulness of the reasoning process w.r.t.
the generated code and analyse the steps of the multi-hop search without
relying on external solvers. Our methods achieve SOTA results on $\mathbf{7}$
out of $\mathbf{9}$ diverse reasoning benchmarks. We also show that model
faithfulness positively correlates with overall performance and further
demonstrate that $\textbf{FLARE}$ allows pinpointing the decisive factors
sufficient for and leading to the correct answer with optimal reasoning during
the multi-hop search.
♻ ☆ The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective
Javier de la Rosa, Vladislav Mikhailov, Lemei Zhang, Freddy Wetjen, David Samuel, Peng Liu, Rolv-Arild Braaten, Petter Mæhlum, Magnus Breder Birkenes, Andrey Kutuzov, Tita Enstad, Hans Christian Farsethås, Svein Arne Brygfjeld, Jon Atle Gulla, Stephan Oepen, Erik Velldal, Wilfred Østgulen, Liljia Øvrelid, Aslak Sira Myhre
The use of copyrighted materials in training language models raises critical
legal and ethical questions. This paper presents a framework for and the
results of empirically assessing the impact of publisher-controlled copyrighted
corpora on the performance of generative large language models (LLMs) for
Norwegian. When evaluated on a diverse set of tasks, we found that adding both
books and newspapers to the data mixture of LLMs tend to improve their
performance, while the addition of fiction works seems to be detrimental. Our
experiments could inform the creation of a compensation scheme for authors
whose works contribute to AI development.
comment: 17 pages, 5 figures, 8 tables. Accepted at NoDaLiDa/Baltic-HLT 2025
♻ ☆ Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping
Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao
Large language model inference is both memory-intensive and time-consuming,
often requiring distributed algorithms to efficiently scale. Various model
parallelism strategies are used in multi-gpu training and inference to
partition computation across multiple devices, reducing memory load and
computation time. However, using model parallelism necessitates communication
of information between GPUs, which has been a major bottleneck and limits the
gains obtained by scaling up the number of devices. We introduce Ladder
Residual, a simple architectural modification applicable to all residual-based
models that enables straightforward overlapping that effectively hides the
latency of communication. Our insight is that in addition to systems
optimization, one can also redesign the model architecture to decouple
communication from computation. While Ladder Residual can allow
communication-computation decoupling in conventional parallelism patterns, we
focus on Tensor Parallelism in this paper, which is particularly bottlenecked
by its heavy communication. For a Transformer model with 70B parameters,
applying Ladder Residual to all its layers can achieve 30% end-to-end wall
clock speed up at inference time with TP sharding over 8 devices. We refer the
resulting Transformer model as the Ladder Transformer. We train a 1B and 3B
Ladder Transformer from scratch and observe comparable performance to a
standard dense transformer baseline. We also show that it is possible to
convert parts of the Llama-3.1 8B model to our Ladder Residual architecture
with minimal accuracy degradation by only retraining for 3B tokens.
♻ ☆ The syntax-semantics interface in a child's path: A study of 3- to 11-year-olds' elicited production of Mandarin recursive relative clauses
There have been apparently conflicting claims over the syntax-semantics
relationship in child acquisition. However, few of them have assessed the
child's path toward the acquisition of recursive relative clauses (RRCs). The
authors of the current paper did experiments to investigate 3- to 11-year-olds'
most-structured elicited production of eight Mandarin RRCs in a 4 (syntactic
types)*2 (semantic conditions) design. The four syntactic types were RRCs with
a subject-gapped RC embedded in an object-gapped RC (SORRCs), RRCs with an
object-gapped RC embedded in another object-gapped RC (OORRCs), RRCs with an
object-gapped RC embedded in a subject-gapped RC (OSRRCs), and RRCs with a
subject-gapped RC embedded in another subject-gapped RC (SSRRCs). Each
syntactic type was put in two conditions differing in internal semantics:
irreversible internal semantics (IIS) and reversible internal semantics (RIS).
For example, "the balloon that [the girl that _ eats the banana] holds _" is
SORRCs in the IIS condition; "the monkey that [the dog that _ bites the pig]
hits_" is SORRCs in the RIS condition. For each target, the participants were
provided with a speech-visual stimulus constructing a condition of irreversible
external semantics (IES). The results showed that SSRRCs, OSRRCs and SORRCs in
the IIS-IES condition were produced two years earlier than their counterparts
in the RIS-IES condition. Thus, a 2-stage development path is proposed: the
language acquisition device starts with the interface between (irreversible)
syntax and IIS, and ends with the interface between syntax and IES, both
abiding by the syntax-semantic interface principle.
comment: Revised clarifications in Section 2.2 and important data attached,
results unchanged
♻ ☆ Large Language Model-Brained GUI Agents: A Survey
Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
GUIs have long been central to human-computer interaction, providing an
intuitive and visually-driven way to access and interact with digital systems.
The advent of LLMs, particularly multimodal models, has ushered in a new era of
GUI automation. They have demonstrated exceptional capabilities in natural
language understanding, code generation, and visual processing. This has paved
the way for a new generation of LLM-brained GUI agents capable of interpreting
complex GUI elements and autonomously executing actions based on natural
language instructions. These agents represent a paradigm shift, enabling users
to perform intricate, multi-step tasks through simple conversational commands.
Their applications span across web navigation, mobile app interactions, and
desktop automation, offering a transformative user experience that
revolutionizes how individuals interact with software. This emerging field is
rapidly advancing, with significant progress in both research and industry.
To provide a structured understanding of this trend, this paper presents a
comprehensive survey of LLM-brained GUI agents, exploring their historical
evolution, core components, and advanced techniques. We address research
questions such as existing GUI agent frameworks, the collection and utilization
of data for training specialized GUI agents, the development of large action
models tailored for GUI tasks, and the evaluation metrics and benchmarks
necessary to assess their effectiveness. Additionally, we examine emerging
applications powered by these agents. Through a detailed analysis, this survey
identifies key research gaps and outlines a roadmap for future advancements in
the field. By consolidating foundational knowledge and state-of-the-art
developments, this work aims to guide both researchers and practitioners in
overcoming challenges and unlocking the full potential of LLM-brained GUI
agents.
comment: The collection of papers reviewed in this survey will be hosted and
regularly updated on the GitHub repository:
https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a
searchable webpage is available at https://aka.ms/gui-agent for easier access
and exploration
♻ ☆ PIER: A Novel Metric for Evaluating What Matters in Code-Switching ICASSP 2025
Code-switching, the alternation of languages within a single discourse,
presents a significant challenge for Automatic Speech Recognition. Despite the
unique nature of the task, performance is commonly measured with established
metrics such as Word-Error-Rate (WER). However, in this paper, we question
whether these general metrics accurately assess performance on code-switching.
Specifically, using both Connectionist-Temporal-Classification and
Encoder-Decoder models, we show fine-tuning on non-code-switched data from both
matrix and embedded language improves classical metrics on code-switching test
sets, although actual code-switched words worsen (as expected). Therefore, we
propose Point-of-Interest Error Rate (PIER), a variant of WER that focuses only
on specific words of interest. We instantiate PIER on code-switched utterances
and show that this more accurately describes the code-switching performance,
showing huge room for improvement in future work. This focused evaluation
allows for a more precise assessment of model performance, particularly in
challenging aspects such as inter-word and intra-word code-switching.
comment: Accepted at ICASSP 2025
♻ ☆ Yi: Open Foundation Models by 01.AI
01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yanpeng Li, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, Zonghong Dai
We introduce the Yi model family, a series of language and multimodal models
that demonstrate strong multi-dimensional capabilities. The Yi model family is
based on 6B and 34B pretrained language models, then we extend them to chat
models, 200K long context models, depth-upscaled models, and vision-language
models. Our base models achieve strong performance on a wide range of
benchmarks like MMLU, and our finetuned chat models deliver strong human
preference rate on major evaluation platforms like AlpacaEval and Chatbot
Arena. Building upon our scalable super-computing infrastructure and the
classical transformer architecture, we attribute the performance of Yi models
primarily to its data quality resulting from our data-engineering efforts. For
pretraining, we construct 3.1 trillion tokens of English and Chinese corpora
using a cascaded data deduplication and quality filtering pipeline. For
finetuning, we polish a small scale (less than 10K) instruction dataset over
multiple iterations such that every single instance has been verified directly
by our machine learning engineers. For vision-language, we combine the chat
language model with a vision transformer encoder and train the model to align
visual representations to the semantic space of the language model. We further
extend the context length to 200K through lightweight continual pretraining and
demonstrate strong needle-in-a-haystack retrieval performance. We show that
extending the depth of the pretrained checkpoint through continual pretraining
further improves performance. We believe that given our current results,
continuing to scale up model parameters using thoroughly optimized data will
lead to even stronger frontier models.
♻ ☆ Federated Instruction Tuning of LLMs with Domain Coverage Augmentation
Federated Domain-specific Instruction Tuning (FedDIT) utilizes limited
cross-client private data together with various strategies of instruction
augmentation, ultimately boosting model performance within specific domains. To
date, the factors affecting FedDIT remain unclear, and existing instruction
augmentation methods primarily focus on the centralized setting without
considering distributed environments. Our experiments reveal that the
cross-client domain coverage, rather than data heterogeneity, drives model
performance in FedDIT. In response, we propose FedDCA, which optimizes domain
coverage through greedy client center selection and retrieval-based
augmentation. At its core, the greedy selection procedure iteratively picks
client centers that maximize the diversity and coverage of the instruction
space while avoiding redundancy with previously selected centers. This ensures
broad yet efficient coverage of the domain distribution across clients. For
client-side computational efficiency and system scalability, FedDCA$^*$, the
variant of FedDCA, utilizes heterogeneous encoders with server-side feature
alignment. Extensive experiments across code, medical, financial, and
mathematical domains substantiate the effectiveness of both methods, as well as
plug-and-play capability. We further analyze privacy preservation against
memory extraction attacks, showing that while privacy leakage risk is
independent of augmented public data ratio, it decreases or converges as
training progresses.
♻ ☆ Attending To Syntactic Information In Biomedical Event Extraction Via Graph Neural Networks
Many models are proposed in the literature on biomedical event
extraction(BEE). Some of them use the shortest dependency path(SDP) information
to represent the argument classification task. There is an issue with this
representation since even missing one word from the dependency parsing graph
may totally change the final prediction. To this end, the full adjacency matrix
of the dependency graph is used to embed individual tokens using a graph
convolutional network(GCN). An ablation study is also done to show the effect
of the dependency graph on the overall performance. The results show a
significant improvement when dependency graph information is used. The proposed
model slightly outperforms state-of-the-art models on BEE over different
datasets.
comment: 6 figures, 4 tables
♻ ☆ QROA: A Black-Box Query-Response Optimization Attack on LLMs
Large Language Models (LLMs) have surged in popularity in recent months, yet
they possess concerning capabilities for generating harmful content when
manipulated. This study introduces the Query-Response Optimization Attack
(QROA), an optimization-based strategy designed to exploit LLMs through a
black-box, query-only interaction. QROA adds an optimized trigger to a
malicious instruction to compel the LLM to generate harmful content. Unlike
previous approaches, QROA does not require access to the model's logit
information or any other internal data and operates solely through the standard
query-response interface of LLMs. Inspired by deep Q-learning and Greedy
coordinate descent, the method iteratively updates tokens to maximize a
designed reward function. We tested our method on various LLMs such as Vicuna,
Falcon, and Mistral, achieving an Attack Success Rate (ASR) over 80\%. We also
tested the model against Llama2-chat, the fine-tuned version of Llama2 designed
to resist Jailbreak attacks, achieving good ASR with a suboptimal initial
trigger seed. This study demonstrates the feasibility of generating jailbreak
attacks against deployed LLMs in the public domain using black-box optimization
methods, enabling more comprehensive safety testing of LLMs.
♻ ☆ BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference
The rapidly increasing size of large language models (LLMs) presents
significant challenges in memory usage and computational costs. Quantizing both
weights and activations can address these issues, with hardware-supported
fine-grained scaling emerging as a promising solution to mitigate outliers.
However, existing methods struggle to capture nuanced block data distributions.
We propose BlockDialect, a block-wise fine-grained mixed format technique that
assigns a per-block optimal number format from a formatbook for better data
representation. Additionally, we introduce DialectFP4, a formatbook of FP4
variants (akin to dialects) that adapt to diverse data distributions. To
leverage this efficiently, we propose a two-stage approach for online
DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy
efficiency by selecting representable values as scaled integers compatible with
low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy
gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit
usage per data, while being only 5.45% (2.69%) below full precision even when
quantizing full-path matrix multiplication. Focusing on how to represent over
how to scale, our work presents a promising path for energy-efficient LLM
inference.
♻ ☆ A Look Into News Avoidance Through AWRS: An Avoidance-Aware Recommender System SDM25
In recent years, journalists have expressed concerns about the increasing
trend of news article avoidance, especially within specific domains. This issue
has been exacerbated by the rise of recommender systems. Our research indicates
that recommender systems should consider avoidance as a fundamental factor. We
argue that news articles can be characterized by three principal elements:
exposure, relevance, and avoidance, all of which are closely interconnected. To
address these challenges, we introduce AWRS, an Avoidance-Aware Recommender
System. This framework incorporates avoidance awareness when recommending news,
based on the premise that news article avoidance conveys significant
information about user preferences. Evaluation results on three news datasets
in different languages (English, Norwegian, and Japanese) demonstrate that our
method outperforms existing approaches.
comment: SIAM International Conference on Data Mining (SDM25)
♻ ☆ RUIE: Retrieval-based Unified Information Extraction using Large Language Model COLING 2025
Unified information extraction (UIE) aims to extract diverse structured
information from unstructured text. While large language models (LLMs) have
shown promise for UIE, they require significant computational resources and
often struggle to generalize to unseen tasks. We propose RUIE (Retrieval-based
Unified Information Extraction), a framework that leverages in-context learning
for efficient task generalization. RUIE introduces a novel demonstration
selection mechanism combining LLM preferences with a keyword-enhanced reward
model, and employs a bi-encoder retriever trained through contrastive learning
and knowledge distillation. As the first trainable retrieval framework for UIE,
RUIE serves as a universal plugin for various LLMs. Experimental results on
eight held-out datasets demonstrate RUIE's effectiveness, with average F1-score
improvements of 19.22 and 3.22 compared to instruction-tuning methods and other
retrievers, respectively.
comment: To appear in COLING 2025 main conference
♻ ☆ Multi-Agent Consensus Seeking via Large Language Models
Multi-agent systems driven by large language models (LLMs) have shown
promising abilities for solving complex tasks in a collaborative manner. This
work considers a fundamental problem in multi-agent collaboration: consensus
seeking. When multiple agents work together, we are interested in how they can
reach a consensus through inter-agent negotiation. To that end, this work
studies a consensus-seeking task where the state of each agent is a numerical
value and they negotiate with each other to reach a consensus value. It is
revealed that when not explicitly directed on which strategy should be adopted,
the LLM-driven agents primarily use the average strategy for consensus seeking
although they may occasionally use some other strategies. Moreover, this work
analyzes the impact of the agent number, agent personality, and network
topology on the negotiation process. The findings reported in this work can
potentially lay the foundations for understanding the behaviors of LLM-driven
multi-agent systems for solving more complex tasks. Furthermore, LLM-driven
consensus seeking is applied to a multi-robot aggregation task. This
application demonstrates the potential of LLM-driven agents to achieve
zero-shot autonomous planning for multi-robot collaboration tasks. Project
website: windylab.github.io/ConsensusLLM/.
♻ ☆ Towards LifeSpan Cognitive Systems
Yu Wang, Chi Han, Tongtong Wu, Xiaoxin He, Wangchunshu Zhou, Nafis Sadeq, Xiusi Chen, Zexue He, Wei Wang, Gholamreza Haffari, Heng Ji, Julian McAuley
Building a human-like system that continuously interacts with complex
environments -- whether simulated digital worlds or human society -- presents
several key challenges. Central to this is enabling continuous, high-frequency
interactions, where the interactions are termed experiences. We refer to this
envisioned system as the LifeSpan Cognitive System (LSCS). A critical feature
of LSCS is its ability to engage in incremental and rapid updates while
retaining and accurately recalling past experiences. In this paper we focus on
the domain of Large Language Models (LLMs), where we identify two major
challenges: (1) Abstraction and Experience Merging, and (2) Long-term Retention
with Accurate Recall. These properties are essential for storing new
experiences, organizing past experiences, and responding to the environment in
ways that leverage relevant historical data. Unlike language models with
continual learning, which typically rely on large corpora for fine-tuning and
focus on improving performance within specific domains or tasks, LSCS must
rapidly and incrementally update with new information from its environment at a
high frequency. Existing technologies with the potential of solving the above
two major challenges can be classified into four classes based on a conceptual
metric called Storage Complexity, which measures the relative space required to
store past experiences. Each of these four classes of technologies has its own
strengths and limitations while we argue none of them alone can achieve LSCS
alone. To this end, we propose a potential instantiation for LSCS that can
integrate all four classes of technologies. The new instantiation, serving as a
conjecture, operates through two core processes: Absorbing Experiences and
Generating Responses.
♻ ☆ Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Models for Hateful Meme Detection COLING 2025
Recent advances show that two-stream approaches have achieved outstanding
performance in hateful meme detection. However, hateful memes constantly evolve
as new memes emerge by fusing progressive cultural ideas, making existing
methods obsolete or ineffective. In this work, we explore the potential of
Large Multimodal Models (LMMs) for hateful meme detection. To this end, we
propose Evolver, which incorporates LMMs via Chain-of-Evolution (CoE)
Prompting, by integrating the evolution attribute and in-context information of
memes. Specifically, Evolver simulates the evolving and expressing process of
memes and reasons through LMMs in a step-by-step manner. First, an evolutionary
pair mining module retrieves the top-k most similar memes in the external
curated meme set with the input meme. Second, an evolutionary information
extractor is designed to summarize the semantic regularities between the paired
memes for prompting. Finally, a contextual relevance amplifier enhances the
in-context hatefulness information to boost the search for evolutionary
processes. Extensive experiments on public FHM, MAMI, and HarM datasets show
that CoE prompting can be incorporated into existing LMMs to improve their
performance. More encouragingly, it can serve as an interpretive tool to
promote the understanding of the evolution of social memes. [Homepage]
(https://github.com/inFaaa/Evolver)
comment: accepted by COLING 2025
♻ ☆ Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes
Offering a promising solution to the scalability challenges associated with
human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an
approach to evaluating large language models (LLMs). However, there are still
many open questions about the strengths and weaknesses of this paradigm, and
what potential biases it may hold. In this paper, we present a comprehensive
study of the performance of various LLMs acting as judges, focusing on a clean
scenario in which inter-human agreement is high. Investigating thirteen judge
models of different model sizes and families, judging answers of nine different
'examtaker models' - both base and instruction-tuned - we find that only the
best (and largest) models achieve reasonable alignment with humans. However,
they are still quite far behind inter-human agreement and their assigned scores
may still differ with up to 5 points from human-assigned scores. In terms of
their ranking of the nine exam-taker models, instead, also smaller models and
even the lexical metric contains may provide a reasonable signal. Through error
analysis and other studies, we identify vulnerabilities in judge models, such
as their sensitivity to prompt complexity and length, and a tendency toward
leniency. The fact that even the best judges differ from humans in this
comparatively simple setup suggest that caution may be wise when using judges
in more complex setups. Lastly, our research rediscovers the importance of
using alignment metrics beyond simple percent alignment, showing that judges
with high percent agreement can still assign vastly different scores.
♻ ☆ FLAME: Learning to Navigate with Multimodal LLM in Urban Environments AAAI 2025
Large Language Models (LLMs) have demonstrated potential in
Vision-and-Language Navigation (VLN) tasks, yet current applications face
challenges. While LLMs excel in general conversation scenarios, they struggle
with specialized navigation tasks, yielding suboptimal performance compared to
specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied
Agent), a novel Multimodal LLM-based agent and architecture designed for urban
VLN tasks that efficiently handles multiple observations. Our approach
implements a three-phase tuning technique for effective adaptation to
navigation tasks, including single perception tuning for street view
description, multiple perception tuning for route summarization, and end-to-end
training on VLN datasets. The augmented datasets are synthesized automatically.
Experimental results demonstrate FLAME's superiority over existing methods,
surpassing state-of-the-art methods by a 7.3% increase in task completion on
Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs)
in complex navigation tasks, representing an advancement towards applications
of MLLMs in the field of embodied intelligence.
comment: Accepted to AAAI 2025 (Oral)
♻ ☆ CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification AAAI 2025
Yuchen Tian, Weixiang Yan, Qian Yang, Xuandong Zhao, Qian Chen, Wen Wang, Ziyang Luo, Lei Ma, Dawn Song
Large Language Models (LLMs) have made significant progress in code
generation, offering developers groundbreaking automated programming support.
However, LLMs often generate code that is syntactically correct and even
semantically plausible, but may not execute as expected or fulfill specified
requirements. This phenomenon of hallucinations in the code domain has not been
systematically explored. To advance the community's understanding and research
on this issue, we introduce the concept of code hallucinations and propose a
classification method for code hallucination based on execution verification.
We categorize code hallucinations into four main types: mapping, naming,
resource, and logic hallucinations, with each category further divided into
different subcategories to understand and address the unique challenges faced
by LLMs in code generation with finer granularity. Additionally, we present a
dynamic detection algorithm called CodeHalu designed to detect and quantify
code hallucinations. We also introduce the CodeHaluEval benchmark, which
includes 8,883 samples from 699 tasks, to systematically and quantitatively
evaluate code hallucinations. By evaluating 17 popular LLMs using this
benchmark, we reveal significant differences in their accuracy and reliability
in code generation, offering detailed insights for further improving the code
generation capabilities of LLMs. The CodeHalu benchmark and code are publicly
available at https://github.com/yuchen814/CodeHalu.
comment: Accepted by AAAI 2025 main conference
♻ ☆ Assessing the Alignment of FOL Closeness Metrics with Human Judgement
The recent successful paradigm of solving logical reasoning problems with
tool-augmented large language models (LLMs) leverages translation of natural
language statements into First-Order Logic~(FOL) and external theorem provers.
However, the correctness of FOL statements, comprising operators and text
predicates, often goes unverified due to the lack of a reliable evaluation
metric for comparing generated and ground-truth FOLs. In this paper, we present
a comprehensive study of sensitivity of existing metrics and their alignment
with human judgement on FOL evaluation. Using ground-truth FOLs, we carefully
designed various perturbations on the ground-truth to assess metric
sensitivity. We sample FOL translation candidates for natural language
statements and measure the ranking alignment between automatic metrics and
human annotators. Our empirical findings highlight oversensitivity in the
n-gram metric BLEU for text perturbations, the semantic graph metric Smatch++
for structural perturbations, and FOL metric for operator perturbation. We also
observe a closer alignment between BertScore and human judgement. Additionally,
we show that combining metrics enhances both alignment and sensitivity compared
to using individual metrics.
comment: Code: https://github.com/RamyaKeerthy/AlignmentFOL
♻ ☆ MedCT: A Clinical Terminology Graph for Generative AI Applications in Healthcare
We introduce the world's first clinical terminology for the Chinese
healthcare community, namely MedCT, accompanied by a clinical foundation model
MedBERT and an entity linking model MedLink. The MedCT system enables
standardized and programmable representation of Chinese clinical data,
successively stimulating the development of new medicines, treatment pathways,
and better patient outcomes for the populous Chinese community. Moreover, the
MedCT knowledge graph provides a principled mechanism to minimize the
hallucination problem of large language models (LLMs), therefore achieving
significant levels of accuracy and safety in LLM-based clinical applications.
By leveraging the LLMs' emergent capabilities of generativeness and
expressiveness, we were able to rapidly built a production-quality terminology
system and deployed to real-world clinical field within three months, while
classical terminologies like SNOMED CT have gone through more than twenty years
development. Our experiments show that the MedCT system achieves
state-of-the-art (SOTA) performance in semantic matching and entity linking
tasks, not only for Chinese but also for English. We also conducted a
longitudinal field experiment by applying MedCT and LLMs in a representative
spectrum of clinical tasks, including electronic health record (EHR)
auto-generation and medical document search for diagnostic decision making. Our
study shows a multitude of values of MedCT for clinical workflows and patient
outcomes, especially in the new genre of clinical LLM applications. We present
our approach in sufficient engineering detail, such that implementing a
clinical terminology for other non-English societies should be readily
reproducible. We openly release our terminology, models and algorithms, along
with real-world clinical datasets for the development.
♻ ☆ Quantifying the Importance of Data Alignment in Downstream Model Performance
Contrary to the conventional emphasis on dataset size, we explore the role of
data alignment -- an often overlooked aspect of data quality -- in training
capable Large Language Models (LLMs). To do so, we use the Task2Vec-based
alignment coefficient, a quantitative measure of the similarity between two
datasets, to quantify the impact of alignment between training data and
evaluation data on downstream performance. In particular, we conduct controlled
\textit{interventional} experiments for two settings: 1. the impact of
increased alignment coefficients between various pre-training (pt) against
evaluation datasets, and 2. the impact of increased alignment coefficients
between domain specific fine-tuning (ft) against domain specific evaluation.
The domain specific task we explore is Autoformalization -- the machine
translation task between natural language and code for formal verification. In
both settings, we find a strong, predictable negative correlation between the
alignment coefficient of a model's training and evaluation data and the model's
loss/perplexity on the respective downstream task. These findings suggest a
re-evaluation of LLM training approaches, demonstrating the relevance of data
alignment compared to data quantity, especially in specialized downstream tasks
such as Autoformalization.