Computation and Language
☆ LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han
Large language models (LLMs) have shown remarkable potential in processing
long sequences, yet efficiently serving these long-context models remains
challenging due to the quadratic computational complexity of attention in the
prefilling stage and the large memory footprint of the KV cache in the decoding
stage. To address these issues, we introduce LServe, an efficient system that
accelerates long-sequence LLM serving via hybrid sparse attention. This method
unifies different hardware-friendly, structured sparsity patterns for both
prefilling and decoding attention into a single framework, where computations
on less important tokens are skipped block-wise. LServe demonstrates the
compatibility of static and dynamic sparsity in long-context LLM attention.
This design enables multiplicative speedups by combining these optimizations.
Specifically, we convert half of the attention heads to nearly free streaming
heads in both the prefilling and decoding stages. Additionally, we find that
only a constant number of KV pages is required to preserve long-context
capabilities, irrespective of context length. We then design a hierarchical KV
page selection policy that dynamically prunes KV pages based on query-centric
similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and
decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is
released at https://github.com/mit-han-lab/omniserve.
comment: Accepted by MLSys 2025. Code available at:
https://github.com/mit-han-lab/omniserve
☆ Interpretable Text Embeddings and Text Similarity Explanation: A Primer
Text embeddings and text embedding models are a backbone of many AI and NLP
systems, particularly those involving search. However, interpretability
challenges persist, especially in explaining obtained similarity scores, which
is crucial for applications requiring transparency. In this paper, we give a
structured overview of interpretability methods specializing in explaining
those similarity scores, an emerging research area. We study the methods'
individual ideas and techniques, evaluating their potential for improving
interpretability of text embeddings and explaining predicted similarities.
☆ Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning
Large language models (LLMs) often fail to ask effective questions under
uncertainty, making them unreliable in domains where proactive
information-gathering is essential for decisionmaking. We present ALFA, a
framework that improves LLM question-asking by (i) decomposing the notion of a
"good" question into a set of theory-grounded attributes (e.g., clarity,
relevance), (ii) controllably synthesizing attribute-specific question
variations, and (iii) aligning models via preference-based optimization to
explicitly learn to ask better questions along these fine-grained attributes.
Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs
dataset, composed of 17k real-world clinical interactions augmented with 80k
attribute-specific preference pairs of follow-up questions, as well as a novel
expert-annotated interactive healthcare QA task to evaluate question-asking
abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on
MediQ-AskDocs compared to SOTA instruction-tuned LLMs, with a question-level
win-rate of 64.4% and strong generalizability. Our findings suggest that
explicitly guiding question-asking with structured, fine-grained attributes
offers a scalable path to improve LLMs, especially in expert application
domains.
comment: 22 pages, 8 figures, 8 tables
☆ FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun
Speculative sampling has emerged as an important technique for accelerating
the auto-regressive generation process of large language models (LLMs) by
utilizing a draft-then-verify mechanism to produce multiple tokens per forward
pass. While state-of-the-art speculative sampling methods use only a single
layer and a language modeling (LM) head as the draft model to achieve
impressive layer compression, their efficiency gains are substantially reduced
for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens.
To address this, we present FR-Spec, a frequency-ranked speculative sampling
framework that optimizes draft candidate selection through vocabulary space
compression. By constraining the draft search to a frequency-prioritized token
subset, our method reduces LM Head computation overhead by 75% while ensuring
the equivalence of the final output distribution. Experiments across multiple
datasets demonstrate an average of 1.12$\times$ speedup over the
state-of-the-art speculative sampling method EAGLE-2.
☆ Prompt-to-Leaderboard
Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica
Large language model (LLM) evaluations typically rely on aggregated metrics
like accuracy or human preference, averaging across users and prompts. This
averaging obscures user- and prompt-specific variations in model performance.
To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces
leaderboards specific to a prompt. The core idea is to train an LLM taking
natural language prompts as input to output a vector of Bradley-Terry
coefficients which are then used to predict the human preference vote. The
resulting prompt-dependent leaderboards allow for unsupervised task-specific
evaluation, optimal routing of queries to models, personalization, and
automated evaluation of model strengths and weaknesses. Data from Chatbot Arena
suggest that P2L better captures the nuanced landscape of language model
performance than the averaged leaderboard. Furthermore, our findings suggest
that P2L's ability to produce prompt-specific evaluations follows a power law
scaling similar to that observed in LLMs themselves. In January 2025, the
router we trained based on this methodology achieved the \#1 spot in the
Chatbot Arena leaderboard. Our code is available at this GitHub link:
https://github.com/lmarena/p2l.
☆ CLIPPER: Compression enables long-context synthetic data generation
LLM developers are increasingly reliant on synthetic data, but generating
high-quality data for complex long-context reasoning tasks remains challenging.
We introduce CLIPPER, a compression-based approach for generating synthetic
data tailored to narrative claim verification - a task that requires reasoning
over a book to verify a given claim. Instead of generating claims directly from
the raw text of the book, which results in artifact-riddled claims, CLIPPER
first compresses the book into chapter outlines and book summaries and then
uses these intermediate representations to generate complex claims and
corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces
claims that are more valid, grounded, and complex. Using CLIPPER, we construct
a dataset of 19K synthetic book claims paired with their source texts and
chain-of-thought reasoning, and use it to fine-tune three open-weight models.
Our best model achieves breakthrough results on narrative claim verification
(from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for
sub-10B models on the NoCha leaderboard. Further analysis shows that our models
generate more detailed and grounded chain-of-thought reasoning while also
improving performance on other narrative understanding tasks (e.g.,
NarrativeQA).
☆ GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks
Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu
Large Language Models (LLMs) have shown great promise in tool-making, yet
existing frameworks often struggle to efficiently construct reliable toolsets
and are limited to single-task settings. To address these challenges, we
propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that
dynamically constructs and evolves a hierarchical graph of reusable tools
across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft),
agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date,
TabMWP). Our results show that GATE achieves up to 4.3x faster milestone
completion in Minecraft compared to the previous SOTA, and provides an average
improvement of 9.23% over existing tool-making methods in code generation tasks
and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution,
balancing tool quantity, complexity, and functionality while maintaining high
efficiency. Code and data are available at
\url{https://github.com/ayanami2003/GATE}.
comment: 8 pages of main text, 38 pages of appendices
☆ Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark
Reasoning about images with rich text, such as charts and documents, is a
critical application of vision-language models (VLMs). However, VLMs often
struggle in these domains due to the scarcity of diverse text-rich
vision-language data. To address this challenge, we present CoSyn, a framework
that leverages the coding capabilities of text-only large language models
(LLMs) to automatically create synthetic text-rich multimodal data. Given input
text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts
an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic
images. With the underlying code as textual representations of the synthetic
images, CoSyn can generate high-quality instruction-tuning data, again relying
on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K
images and 2.7M rows of vision-language instruction-tuning data. Comprehensive
experiments on seven benchmarks demonstrate that models trained on our
synthetic data achieve state-of-the-art performance among competitive
open-source models, including Llama 3.2, and surpass proprietary models such as
GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing
data, enabling VLMs to ground information within input images, showcasing its
potential for developing multimodal agents capable of acting in real-world
environments.
comment: 20 pages, 19 figures, 9 tables, website:
https://yueyang1996.github.io/cosyn/
☆ Revealing and Mitigating Over-Attention in Knowledge Editing
Large Language Models have demonstrated superior performance across a wide
range of tasks, but they still exhibit undesirable errors due to incorrect
knowledge learned from the training data. To avoid this, knowledge editing
methods emerged to precisely edit the specific model knowledge via efficiently
modifying a very small percentage of parameters. % However, those methods can
lead to the problem of Specificity Failure: when the content related to the
edited knowledge occurs in the context, it can inadvertently corrupt other
pre-existing knowledge. However, those methods can lead to the problem of
Specificity Failure, where the existing knowledge and capabilities are severely
degraded due to editing. Our preliminary indicates that Specificity Failure
primarily stems from the model's attention heads assigning excessive attention
scores to entities related to the edited knowledge, thereby unduly focusing on
specific snippets within the context, which we denote as the Attention Drift
phenomenon. To mitigate such Attention Drift issue, we introduce a simple yet
effective method Selective Attention Drift Restriction}(SADR), which introduces
an additional regularization term during the knowledge editing process to
restrict changes in the attention weight distribution, thereby preventing undue
focus on the edited entity. Experiments on five frequently used strong LLMs
demonstrate the effectiveness of our method, where SADR can significantly
mitigate Specificity Failure in the predominant knowledge editing tasks.
☆ Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
Multi-head Latent Attention (MLA) is an innovative architecture proposed by
DeepSeek, designed to ensure efficient and economical inference by
significantly compressing the Key-Value (KV) cache into a latent vector.
Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its
variants such as Grouped-Query Attention (GQA) exhibit significant cost
disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA
without pre-training from scratch is both meaningful and challenging. This
paper proposes the first data-efficient fine-tuning method for transitioning
from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE,
we remove RoPE from dimensions of queries and keys that contribute less to the
attention scores, for low-rank approximation, we introduce joint SVD
approximations based on the pre-trained parameters of keys and values. These
carefully designed strategies enable MHA2MLA to recover performance using only
a small fraction (0.3% to 0.6%) of the data, significantly reducing inference
costs while seamlessly integrating with compression techniques such as KV cache
quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%,
with only a 0.5% drop in LongBench performance.
comment: 16 pages, 8 figures
☆ LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li
Existing Large Vision-Language Models (LVLMs) can process inputs with context
lengths up to 128k visual and text tokens, yet they struggle to generate
coherent outputs beyond 1,000 words. We find that the primary limitation is the
absence of long output examples during supervised fine-tuning (SFT). To tackle
this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158
examples, each with multiple input images, an instruction, and corresponding
outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that
maintain high-fidelity to the input images, we employ Direct Preference
Optimization (DPO) to the SFT model. Given the high cost of collecting human
feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which
breaks long outputs into segments and uses iterative corrections to form
preference pairs with the original outputs. Additionally, we develop
MMLongBench-Write, a benchmark featuring six tasks to evaluate the
long-generation capabilities of VLMs. Our 7B parameter model, trained with
LongWriter-V-22k and IterDPO, achieves impressive performance on this
benchmark, outperforming larger proprietary models like GPT-4o. Code and data:
https://github.com/THU-KEG/LongWriter-V
☆ Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs
While large language models demonstrate remarkable capabilities at
task-specific applications through fine-tuning, extending these benefits across
diverse languages is essential for broad accessibility. However, effective
cross-lingual transfer is hindered by LLM performance gaps across languages and
the scarcity of fine-tuning data in many languages. Through analysis of LLM
internal representations from over 1,000+ language pairs, we discover that
middle layers exhibit the strongest potential for cross-lingual alignment.
Building on this finding, we propose a middle-layer alignment objective
integrated into task-specific training. Our experiments on slot filling,
machine translation, and structured text generation show consistent
improvements in cross-lingual transfer, especially to lower-resource languages.
The method is robust to the choice of alignment languages and generalizes to
languages unseen during alignment. Furthermore, we show that separately trained
alignment modules can be merged with existing task-specific modules, improving
cross-lingual capabilities without full re-training. Our code is publicly
available (https://github.com/dannigt/mid-align).
☆ Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps
When prompted to think step-by-step, language models (LMs) produce a chain of
thought (CoT), a sequence of reasoning steps that the model supposedly used to
produce its prediction. However, despite much work on CoT prompting, it is
unclear if CoT reasoning is faithful to the models' parameteric beliefs. We
introduce a framework for measuring parametric faithfulness of generated
reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an
instance of this framework. FUR erases information contained in reasoning steps
from model parameters. We perform experiments unlearning CoTs of four LMs
prompted on four multi-choice question answering (MCQA) datasets. Our
experiments show that FUR is frequently able to change the underlying models'
prediction by unlearning key steps, indicating when a CoT is parametrically
faithful. Further analysis shows that CoTs generated by models post-unlearning
support different answers, hinting at a deeper effect of unlearning.
Importantly, CoT steps identified as important by FUR do not align well with
human notions of plausbility, emphasizing the need for specialized alignment
☆ eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables NAACL 2025
Large Language Models (LLMs) have demonstrated exceptional versatility across
diverse domains, yet their application in e-commerce remains underexplored due
to a lack of domain-specific datasets. To address this gap, we introduce
eC-Tab2Text, a novel dataset designed to capture the intricacies of e-commerce,
including detailed product attributes and user-specific queries. Leveraging
eC-Tab2Text, we focus on text generation from product tables, enabling LLMs to
produce high-quality, attribute-specific product reviews from structured
tabular data. Fine-tuned models were rigorously evaluated using standard
Table2Text metrics, alongside correctness, faithfulness, and fluency
assessments. Our results demonstrate substantial improvements in generating
contextually accurate reviews, highlighting the transformative potential of
tailored datasets and fine-tuning methodologies in optimizing e-commerce
workflows. This work highlights the potential of LLMs in e-commerce workflows
and the essential role of domain-specific datasets in tailoring them to
industry-specific challenges.
comment: NAACL 2025 (Industry Track)
☆ Optimizing Model Selection for Compound AI Systems
Compound AI systems that combine multiple LLM calls, such as self-refine and
multi-agent-debate, achieve strong performance on many AI tasks. We address a
core question in optimizing compound systems: for each LLM call or module in
the system, how should one decide which LLM to use? We show that these LLM
choices have a large effect on quality, but the search space is exponential. We
propose LLMSelector, an efficient framework for model selection in compound
systems, which leverages two key empirical insights: (i) end-to-end performance
is often monotonic in how well each module performs, with all other modules
held fixed, and (ii) per-module performance can be estimated accurately by an
LLM. Building upon these insights, LLMSelector iteratively selects one module
and allocates to it the model with the highest module-wise performance, as
estimated by an LLM, until no further gain is possible. LLMSelector is
applicable to any compound system with a bounded number of modules, and its
number of API calls scales linearly with the number of modules, achieving
high-quality model allocation both empirically and theoretically. Experiments
with popular compound systems such as multi-agent debate and self-refine using
LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector
confers 5%-70% accuracy gains compared to using the same LLM for all modules.
☆ From RAG to Memory: Non-Parametric Continual Learning for Large Language Models
Our ability to continuously acquire, organize, and leverage knowledge is a
key feature of human intelligence that AI systems must approximate to unlock
their full potential. Given the challenges in continual learning with large
language models (LLMs), retrieval-augmented generation (RAG) has become the
dominant way to introduce new information. However, its reliance on vector
retrieval hinders its ability to mimic the dynamic and interconnected nature of
human long-term memory. Recent RAG approaches augment vector embeddings with
various structures like knowledge graphs to address some of these gaps, namely
sense-making and associativity. However, their performance on more basic
factual memory tasks drops considerably below standard RAG. We address this
unintended deterioration and propose HippoRAG 2, a framework that outperforms
standard RAG comprehensively on factual, sense-making, and associative memory
tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in
HippoRAG and enhances it with deeper passage integration and more effective
online use of an LLM. This combination pushes this RAG system closer to the
effectiveness of human long-term memory, achieving a 7% improvement in
associative memory tasks over the state-of-the-art embedding model while also
exhibiting superior factual knowledge and sense-making memory capabilities.
This work paves the way for non-parametric continual learning for LLMs. Our
code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG.
comment: Code and data to be released at:
https://github.com/OSU-NLP-Group/HippoRAG
☆ Rapid Word Learning Through Meta In-Context Learning
Humans can quickly learn a new word from a few illustrative examples, and
then systematically and flexibly use it in novel contexts. Yet the abilities of
current language models for few-shot word learning, and methods for improving
these abilities, are underexplored. In this study, we introduce a novel method,
Meta-training for IN-context learNing Of Words (Minnow). This method trains
language models to generate new examples of a word's usage given a few
in-context examples, using a special placeholder token to represent the new
word. This training is repeated on many new words to develop a general
word-learning ability. We find that training models from scratch with Minnow on
human-scale child-directed language enables strong few-shot word learning,
comparable to a large language model (LLM) pre-trained on orders of magnitude
more data. Furthermore, through discriminative and generative evaluations, we
demonstrate that finetuning pre-trained LLMs with Minnow improves their ability
to discriminate between new words, identify syntactic categories of new words,
and generate reasonable new usages and definitions for new words, based on one
or a few in-context examples. These findings highlight the data efficiency of
Minnow and its potential to improve language model performance in word learning
tasks.
☆ ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting
Efficient and privacy-preserving multimodal interaction is essential as AR,
VR, and modern smartphones with powerful cameras become primary interfaces for
human-computer communication. Existing powerful large vision-language models
(VLMs) enabling multimodal interaction often rely on cloud-based processing,
raising significant concerns about (1) visual privacy by transmitting sensitive
vision data to servers, and (2) their limited real-time, on-device usability.
This paper explores Visual Instruction Rewriting, a novel approach that
transforms multimodal instructions into text-only commands, allowing seamless
integration of lightweight on-device instruction rewriter VLMs (250M
parameters) with existing conversational AI systems, enhancing vision data
privacy. To achieve this, we present a dataset of over 39,000 examples across
14 domains and develop a compact VLM, pretrained on image captioning datasets
and fine-tuned for instruction rewriting. Experimental results, evaluated
through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic
parsing analysis, demonstrate that even a quantized version of the model
(<500MB storage footprint) can achieve effective instruction rewriting, thus
enabling privacy-focused, multimodal AI applications.
comment: 12 pages, 7 figures, 3 tables
☆ Harnessing PDF Data for Improving Japanese Large Multimodal Models
Large Multimodal Models (LMMs) have demonstrated strong performance in
English, but their effectiveness in Japanese remains limited due to the lack of
high-quality training data. Current Japanese LMMs often rely on translated
English datasets, restricting their ability to capture Japan-specific cultural
knowledge. To address this, we explore the potential of Japanese PDF data as a
training resource, an area that remains largely underutilized. We introduce a
fully automated pipeline that leverages pretrained models to extract image-text
pairs from PDFs through layout analysis, OCR, and vision-language pairing,
removing the need for manual annotation. Additionally, we construct instruction
data from extracted image-text pairs to enrich the training data. To evaluate
the effectiveness of PDF-derived data, we train Japanese LMMs and assess their
performance on the Japanese LMM Benchmark. Our results demonstrate substantial
improvements, with performance gains ranging from 3.9% to 13.8% on Heron-Bench.
Further analysis highlights the impact of PDF-derived data on various factors,
such as model size and language models, reinforcing its value as a multimodal
resource for Japanese LMMs. We plan to make the source code and data publicly
available upon acceptance.
comment: 15 pages, 8 figures
☆ SurveyX: Academic Survey Automation via Large Language Models
Xun Liang, Jiawei Yang, Yezhaohui Wang, Chen Tang, Zifan Zheng, Simin Niu, Shichao Song, Hanyu Wang, Bo Tang, Feiyu Xiong, Keming Mao, Zhiyu li
Large Language Models (LLMs) have demonstrated exceptional comprehension
capabilities and a vast knowledge base, suggesting that LLMs can serve as
efficient tools for automated survey generation. However, recent research
related to automated survey generation remains constrained by some critical
limitations like finite context window, lack of in-depth content discussion,
and absence of systematic evaluation frameworks. Inspired by human writing
processes, we propose SurveyX, an efficient and organized system for automated
survey generation that decomposes the survey composing process into two phases:
the Preparation and Generation phases. By innovatively introducing online
reference retrieval, a pre-processing method called AttributeTree, and a
re-polishing process, SurveyX significantly enhances the efficacy of survey
composition. Experimental evaluation results show that SurveyX outperforms
existing automated survey generation systems in content quality (0.259
improvement) and citation quality (1.76 enhancement), approaching human expert
performance across multiple evaluation dimensions. Examples of surveys
generated by SurveyX are available on www.surveyx.cn
comment: 15 pages, 16 figures
☆ Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo
Inspired by the success of DeepSeek-R1, we explore the potential of
rule-based reinforcement learning (RL) in large reasoning models. To analyze
reasoning dynamics, we use synthetic logic puzzles as training data due to
their controllable complexity and straightforward answer verification. We make
some key technical contributions that lead to effective and stable RL training:
a system prompt that emphasizes the thinking and answering process, a stringent
format reward function that penalizes outputs for taking shortcuts, and a
straightforward training recipe that achieves stable convergence. Our 7B model
develops advanced reasoning skills-such as reflection, verification, and
summarization-that are absent from the logic corpus. Remarkably, after training
on just 5K logic problems, it demonstrates generalization abilities to the
challenging math benchmarks AIME and AMC.
☆ Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis
With the exponential growth of research facilitated by modern technology and
improved accessibility, scientific discoveries have become increasingly
fragmented within and across fields. This makes it challenging to assess the
significance, novelty, incremental findings, and equivalent ideas between
related works, particularly those from different research communities. Large
language models (LLMs) have recently demonstrated strong quantitative and
qualitative reasoning abilities, and multi-agent LLM debates have shown promise
in handling complex reasoning tasks by exploring diverse perspectives and
reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a
framework which converts scientific papers into LLM personas that debate their
respective novelties. To emphasize structured, critical reasoning rather than
focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling
fine-grained analysis of independent novelty arguments within scholarly
articles. Through experiments on scientific literature across various domains,
evaluated by expert researchers, we demonstrate that ToD generates informative
arguments, effectively contrasts papers, and supports researchers in their
literature review.
comment: Code available at: https://github.com/pkargupta/tree-of-debate
☆ Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning NAACL 2025
Fact verification (FV) aims to assess the veracity of a claim based on
relevant evidence. The traditional approach for automated FV includes a
three-part pipeline relying on short evidence snippets and encoder-only
inference models. More recent approaches leverage the multi-turn nature of LLMs
to address FV as a step-by-step problem where questions inquiring additional
context are generated and answered until there is enough information to make a
decision. This iterative method makes the verification process rational and
explainable. While these methods have been tested for encyclopedic claims,
exploration on domain-specific and realistic claims is missing. In this work,
we apply an iterative FV system on three medical fact-checking datasets and
evaluate it with multiple settings, including different LLMs, external web
search, and structured reasoning using logic predicates. We demonstrate
improvements in the final performance over traditional approaches and the high
potential of step-by-step FV systems for domain-specific claims.
comment: Accepted to NAACL 2025 (Main)
☆ On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems NAACL 2025
Retrieval-augmented generation (RAG) has emerged as an approach to augment
large language models (LLMs) by reducing their reliance on static knowledge and
improving answer factuality. RAG retrieves relevant context snippets and
generates an answer based on them. Despite its increasing industrial adoption,
systematic exploration of RAG components is lacking, particularly regarding the
ideal size of provided context, and the choice of base LLM and retrieval
method. To help guide development of robust RAG systems, we evaluate various
context sizes, BM25 and semantic search as retrievers, and eight base LLMs.
Moving away from the usual RAG evaluation with short answers, we explore the
more challenging long-form question answering in two domains, where a good
answer has to utilize the entire context. Our findings indicate that final QA
performance improves steadily with up to 15 snippets but stagnates or declines
beyond that. Finally, we show that different general-purpose LLMs excel in the
biomedical domain than the encyclopedic one, and that open-domain evidence
retrieval in large corpora is challenging.
comment: Accepted to Findings of NAACL 2025
☆ TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun
Triton, a high-level Python-like language designed for building efficient GPU
kernels, is widely adopted in deep learning frameworks due to its portability,
flexibility, and accessibility. However, programming and parallel optimization
still require considerable trial and error from Triton developers. Despite
advances in large language models (LLMs) for conventional code generation,
these models struggle to generate accurate, performance-optimized Triton code,
as they lack awareness of its specifications and the complexities of GPU
programming. More critically, there is an urgent need for systematic
evaluations tailored to Triton. In this work, we introduce TritonBench, the
first comprehensive benchmark for Triton operator generation. TritonBench
features two evaluation channels: a curated set of 184 real-world operators
from GitHub and a collection of operators aligned with PyTorch interfaces.
Unlike conventional code benchmarks prioritizing functional correctness,
TritonBench also profiles efficiency performance on widely deployed GPUs
aligned with industry applications. Our study reveals that current
state-of-the-art code LLMs struggle to generate efficient Triton operators,
highlighting a significant gap in high-performance code generation. TritonBench
will be available at https://github.com/thunlp/TritonBench.
☆ Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs
Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Alden Dima, Juan Francisco Fung, Jordan Boyd-Graber
A common use of NLP is to facilitate the understanding of large document
collections, with a shift from using traditional topic models to Large Language
Models. Yet the effectiveness of using LLM for large corpus understanding in
real-world applications remains under-explored. This study measures the
knowledge users acquire with unsupervised, supervised LLM-based exploratory
approaches or traditional topic models on two datasets. While LLM-based methods
generate more human-readable topics and show higher average win probabilities
than traditional models for data exploration, they produce overly generic
topics for domain-specific datasets that do not easily allow users to learn
much about the documents. Adding human supervision to the LLM generation
process improves data exploration by mitigating hallucination and
over-genericity but requires greater human effort. In contrast, traditional.
models like Latent Dirichlet Allocation (LDA) remain effective for exploration
but are less user-friendly. We show that LLMs struggle to describe the haystack
of large corpora without human help, particularly domain-specific data, and
face scaling and hallucination limitations due to context length constraints.
Dataset available at https://huggingface. co/datasets/zli12321/Bills.
comment: 21 Pages. LLM for Data Exploration and content analysis
☆ HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States
The integration of additional modalities increases the susceptibility of
large vision-language models (LVLMs) to safety risks, such as jailbreak
attacks, compared to their language-only counterparts. While existing research
primarily focuses on post-hoc alignment techniques, the underlying safety
mechanisms within LVLMs remain largely unexplored. In this work , we
investigate whether LVLMs inherently encode safety-relevant signals within
their internal activations during inference. Our findings reveal that LVLMs
exhibit distinct activation patterns when processing unsafe prompts, which can
be leveraged to detect and mitigate adversarial inputs without requiring
extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a
novel tuning-free framework that harnesses internal model activations to
enhance safety. Experimental results show that {HiddenDetect} surpasses
state-of-the-art methods in detecting jailbreak attacks against LVLMs. By
utilizing intrinsic safety-aware patterns, our method provides an efficient and
scalable solution for strengthening LVLM robustness against multimodal threats.
Our code will be released publicly at
https://github.com/leigest519/HiddenDetect.
☆ SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jingyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang
Large language models (LLMs) have demonstrated remarkable proficiency in
mainstream academic disciplines such as mathematics, physics, and computer
science. However, human knowledge encompasses over 200 specialized disciplines,
far exceeding the scope of existing benchmarks. The capabilities of LLMs in
many of these specialized fields-particularly in light industry, agriculture,
and service-oriented disciplines-remain inadequately evaluated. To address this
gap, we present SuperGPQA, a comprehensive benchmark that evaluates
graduate-level knowledge and reasoning capabilities across 285 disciplines. Our
benchmark employs a novel Human-LLM collaborative filtering mechanism to
eliminate trivial or ambiguous questions through iterative refinement based on
both LLM responses and expert feedback. Our experimental results reveal
significant room for improvement in the performance of current state-of-the-art
LLMs across diverse knowledge domains (e.g., the reasoning-focused model
DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting
the considerable gap between current model capabilities and artificial general
intelligence. Additionally, we present comprehensive insights from our
management of a large-scale annotation process, involving over 80 expert
annotators and an interactive Human-LLM collaborative system, offering valuable
methodological guidance for future research initiatives of comparable scope.
☆ Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models
We propose the Sentence Smith framework that enables controlled and specified
manipulation of text meaning. It consists of three main steps: 1. Parsing a
sentence into a semantic graph, 2. Applying human-designed semantic
manipulation rules, and 3. Generating text from the manipulated graph. A final
filtering step (4.) ensures the validity of the applied transformation. To
demonstrate the utility of Sentence Smith in an application study, we use it to
generate hard negative pairs that challenge text embedding models. Since the
controllable generation makes it possible to clearly isolate different types of
semantic shifts, we can gain deeper insights into the specific strengths and
weaknesses of widely used text embedding models, also addressing an issue in
current benchmarking where linguistic phenomena remain opaque. Human validation
confirms that the generations produced by Sentence Smith are highly accurate.
☆ Entity Framing and Role Portrayal in the News ACL
Tarek Mahmoud, Zhuohan Xie, Dimitar Dimitrov, Nikolaos Nikolaidis, Purificação Silvano, Roman Yangarber, Shivam Sharma, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov
We introduce a novel multilingual hierarchical corpus annotated for entity
framing and role portrayal in news articles. The dataset uses a unique taxonomy
inspired by storytelling elements, comprising 22 fine-grained roles, or
archetypes, nested within three main categories: protagonist, antagonist, and
innocent. Each archetype is carefully defined, capturing nuanced portrayals of
entities such as guardian, martyr, and underdog for protagonists; tyrant,
deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for
innocents. The dataset includes 1,378 recent news articles in five languages
(Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two
critical domains of global significance: the Ukraine-Russia War and Climate
Change. Over 5,800 entity mentions have been annotated with role labels. This
dataset serves as a valuable resource for research into role portrayal and has
broader implications for news analysis. We describe the characteristics of the
dataset and the annotation process, and we report evaluation results on
fine-tuned state-of-the-art multilingual transformers and hierarchical
zero-shot learning using LLMs at the level of a document, a paragraph, and a
sentence.
comment: 23 pages, 12 figures. Submitted to ACL Rolling Review (ARR)
☆ From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT
The generative capabilities of LLM models present opportunities in
accelerating tasks and concerns with the authenticity of the knowledge it
produces. To address the concerns, we present a computational approach that
systematically evaluates the factual accuracy of biomedical knowledge that an
LLM model has been prompted to generate. Our approach encompasses two
processes: the generation of disease-centric associations and the verification
of them using the semantic knowledge of the biomedical ontologies. Using
ChatGPT as the select LLM model, we designed a set of prompt-engineering
processes to generate linkages between diseases, drugs, symptoms, and genes to
establish grounds for assessments. Experimental results demonstrate high
accuracy in identifying disease terms (88%-97%), drug names (90%-91%), and
genetic information (88%-98%). The symptom term identification accuracy was
notably lower (49%-61%), as verified against the DOID, ChEBI, SYMPTOM, and GO
ontologies accordingly. The verification of associations reveals literature
coverage rates of (89%-91%) among disease-drug and disease-gene associations.
The low identification accuracy for symptom terms also contributed to the
verification of symptom-related associations (49%-62%).
comment: 26 pages, 6 figures, In Review with a Cell Press Journal
☆ Data-Efficient Pretraining with Group-Level Data Influence Modeling
Data-efficient pretraining has shown tremendous potential to elevate scaling
laws. This paper argues that effective pretraining data should be curated at
the group level, treating a set of data points as a whole rather than as
independent contributors. To achieve that, we propose Group-Level Data
Influence Modeling (Group-MATES), a novel data-efficient pretraining method
that captures and optimizes group-level data utility. Specifically, Group-MATES
collects oracle group-level influences by locally probing the pretraining model
with data sets. It then fine-tunes a relational data influence model to
approximate oracles as relationship-weighted aggregations of individual
influences. The fine-tuned model selects the data subset by maximizing its
group-level influence prediction, with influence-aware clustering to enable
efficient inference. Experiments on the DCLM benchmark demonstrate that
Group-MATES achieves a 10% relative core score improvement on 22 downstream
tasks over DCLM-Baseline and 5% over individual-influence-based methods,
establishing a new state-of-the-art. Further analyses highlight the
effectiveness of relational data influence models in capturing intricate
interactions between data points.
☆ I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search
Recent advancements in large language models (LLMs) have shown remarkable
potential in automating machine learning tasks. However, existing LLM-based
agents often struggle with low-diversity and suboptimal code generation. While
recent work has introduced Monte Carlo Tree Search (MCTS) to address these
issues, limitations persist in the quality and diversity of thoughts generated,
as well as in the scalar value feedback mechanisms used for node selection. In
this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a
novel approach that iteratively expands tree nodes through an introspective
process that meticulously analyzes solutions and results from parent and
sibling nodes. This facilitates a continuous refinement of the node in the
search tree, thereby enhancing the overall decision-making process.Furthermore,
we integrate a Large Language Model (LLM)-based value model to facilitate
direct evaluation of each node's solution prior to conducting comprehensive
computational rollouts. A hybrid rewarding mechanism is implemented to
seamlessly transition the Q-value from LLM-estimated scores to actual
performance scores. This allows higher-quality nodes to be traversed
earlier.Applied to the various ML tasks, our approach demonstrates a6\%
absolute improvement in performance compared to the strong open-source AutoML
agents, showcasing its effectiveness in enhancing agentic AutoML systems.
☆ Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup
Large language models have demonstrated excellent performance in many tasks,
including Text-to-SQL, due to their powerful in-context learning capabilities.
They are becoming the mainstream approach for Text-to-SQL. However, these
methods still have a significant gap compared to human performance, especially
on complex questions. As the complexity of questions increases, the gap between
questions and SQLs increases. We identify two important gaps: the structural
mapping gap and the lexical mapping gap. To tackle these two gaps, we propose
PAS-SQL, an efficient SQL generation pipeline based on LLMs, which alleviates
gaps through Abstract Query Pattern (AQP) and Contextual Schema Markup (CSM).
AQP aims to obtain the structural pattern of the question by removing
database-related information, which enables us to find structurally similar
demonstrations. CSM aims to associate database-related text span in the
question with specific tables or columns in the database, which alleviates the
lexical mapping gap. Experimental results on the Spider and BIRD datasets
demonstrate the effectiveness of our proposed method. Specifically, PAS-SQL +
GPT-4o sets a new state-of-the-art on the Spider benchmark with an execution
accuracy of 87.9\%, and achieves leading results on the BIRD dataset with an
execution accuracy of 64.67\%.
☆ How to Get Your LLM to Generate Challenging Problems for Evaluation
The pace of evolution of Large Language Models (LLMs) necessitates new
approaches for rigorous and comprehensive evaluation. Traditional human
annotation is increasingly impracticable due to the complexities and costs
involved in generating high-quality, challenging problems. In this work, we
introduce CHASE, a unified framework to synthetically generate challenging
problems using LLMs without human involvement. For a given task, our approach
builds a hard problem in a bottom-up manner from simpler components. Moreover,
our framework decomposes the generation process into independently verifiable
sub-tasks, thereby ensuring a high level of quality and correctness. We
implement CHASE to create evaluation benchmarks across three diverse domains:
(1) document-based question answering, (2) repository-level code completion,
and (3) math reasoning. The performance of state-of-the-art LLMs on these
synthetic benchmarks lies in the range of 40-60% accuracy, thereby
demonstrating the effectiveness of our framework at generating challenging
problems. We publicly release our benchmarks and code.
☆ Data-Constrained Synthesis of Training Data for De-Identification
Many sensitive domains -- such as the clinical domain -- lack widely
available datasets due to privacy risks. The increasing generative capabilities
of large language models (LLMs) have made synthetic datasets a viable path
forward. In this study, we domain-adapt LLMs to the clinical domain and
generate synthetic clinical texts that are machine-annotated with tags for
personally identifiable information using capable encoder-based NER models. The
synthetic corpora are then used to train synthetic NER models. The results show
that training NER models using synthetic corpora incurs only a small drop in
predictive performance. The limits of this process are investigated in a
systematic ablation study -- using both Swedish and Spanish data. Our analysis
shows that smaller datasets can be sufficient for domain-adapting LLMs for data
synthesis. Instead, the effectiveness of this process is almost entirely
contingent on the performance of the machine-annotating NER models trained
using the original data.
comment: Under review
☆ Explanations of Deep Language Models Explain Language Representations in the Brain
Recent advances in artificial intelligence have given rise to large language
models (LLMs) that not only achieve human-like performance but also share
computational principles with the brain's language processing mechanisms. While
previous research has primarily focused on aligning LLMs' internal
representations with neural activity, we introduce a novel approach that
leverages explainable AI (XAI) methods to forge deeper connections between the
two domains. Using attribution methods, we quantified how preceding words
contribute to an LLM's next-word predictions and employed these explanations to
predict fMRI recordings from participants listening to the same narratives. Our
findings demonstrate that attribution methods robustly predict brain activity
across the language network, surpassing traditional internal representations in
early language areas. This alignment is hierarchical: early-layer explanations
correspond to the initial stages of language processing in the brain, while
later layers align with more advanced stages. Moreover, the layers more
influential on LLM next-word prediction$\unicode{x2014}$those with higher
attribution scores$\unicode{x2014}$exhibited stronger alignment with neural
activity. This work establishes a bidirectional bridge between AI and
neuroscience. First, we demonstrate that attribution methods offer a powerful
lens for investigating the neural mechanisms of language comprehension,
revealing how meaning emerges from preceding context. Second, we propose using
brain alignment as a metric to evaluate the validity of attribution methods,
providing a framework for assessing their biological plausibility.
☆ AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO
Large Language Models (LLMs) have demonstrated impressive capabilities in
language processing, yet they often struggle with tasks requiring genuine
visual spatial reasoning. In this paper, we introduce a novel two-stage
training framework designed to equip standard LLMs with visual reasoning
abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT)
on a curated dataset of tokenized maze representations to teach the model to
predict step-by-step movement commands. Next, we apply Group Relative Policy
Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted
reward function to refine the model's sequential decision-making and encourage
emergent chain-of-thought behaviors. Experimental results on synthetically
generated mazes show that while a baseline model fails to navigate the maze,
the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning
boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more
robust and self-corrective reasoning, highlighting the potential of our
approach to bridge the gap between language models and visual spatial tasks.
These findings offer promising implications for applications in robotics,
autonomous navigation, and other domains that require integrated visual and
sequential reasoning.
☆ InstructAgent: Building User Controllable Recommender via LLM Agent WWW2025
Wujiang Xu, Yunxiao Shi, Zujie Liang, Xuying Ning, Kai Mei, Kun Wang, Xi Zhu, Min Xu, Yongfeng Zhang
Traditional recommender systems usually take the user-platform paradigm,
where users are directly exposed under the control of the platform's
recommendation algorithms. However, the defect of recommendation algorithms may
put users in very vulnerable positions under this paradigm. First, many
sophisticated models are often designed with commercial objectives in mind,
focusing on the platform's benefits, which may hinder their ability to protect
and capture users' true interests. Second, these models are typically optimized
using data from all users, which may overlook individual user's preferences.
Due to these shortcomings, users may experience several disadvantages under the
traditional user-platform direct exposure paradigm, such as lack of control
over the recommender system, potential manipulation by the platform, echo
chamber effects, or lack of personalization for less active users due to the
dominance of active users during collaborative learning. Therefore, there is an
urgent need to develop a new paradigm to protect user interests and alleviate
these issues. Recently, some researchers have introduced LLM agents to simulate
user behaviors, these approaches primarily aim to optimize platform-side
performance, leaving core issues in recommender systems unresolved. To address
these limitations, we propose a new user-agent-platform paradigm, where agent
serves as the protective shield between user and recommender system that
enables indirect exposure. To this end, we first construct four recommendation
datasets, denoted as $\dataset$, along with user instructions for each record.
comment: WWW2025@HCRS
☆ Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs
Knowledge editing allows for efficient adaptation of large language models
(LLMs) to new information or corrections without requiring full retraining.
However, prior methods typically focus on either single-language editing or
basic multilingual editing, failing to achieve true cross-linguistic knowledge
synchronization. To address this, we present a simple and practical
state-of-the-art (SOTA) recipe Cross-Lingual Knowledge Democracy Edit (X-KDE),
designed to propagate knowledge from a dominant language to other languages
effectively. Our X-KDE comprises two stages: (i) Cross-lingual Edition
Instruction Tuning (XE-IT), which fine-tunes the model on a curated parallel
dataset to modify in-scope knowledge while preserving unrelated information,
and (ii) Target-language Preference Optimization (TL-PO), which applies
advanced optimization techniques to ensure consistency across languages,
fostering the transfer of updates. Additionally, we contribute a high-quality,
cross-lingual dataset, specifically designed to enhance knowledge transfer
across languages. Extensive experiments on the Bi-ZsRE and MzsRE benchmarks
show that X-KDE significantly enhances cross-lingual performance, achieving an
average improvement of +8.19%, while maintaining high accuracy in monolingual
settings.
☆ LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning
Long context understanding remains challenging for large language models due
to their limited context windows. This paper presents Long Input Fine-Tuning
(LIFT), a novel framework for long-context modeling that can improve the
long-context performance of arbitrary (short-context) LLMs by dynamically
adapting model parameters based on the long input. Importantly, LIFT, rather
than endlessly extending the context window size to accommodate increasingly
longer inputs in context, chooses to store and absorb the long input in
parameter. By fine-tuning the long input into model parameters, LIFT allows
short-context LLMs to answer questions even when the required information is
not provided in the context during inference. Furthermore, to enhance LIFT
performance while maintaining the original in-context learning (ICL)
capabilities, we introduce Gated Memory, a specialized attention adapter that
automatically balances long input memorization and ICL. We provide a
comprehensive analysis of the strengths and limitations of LIFT on long context
understanding, offering valuable directions for future research.
comment: arXiv admin note: text overlap with arXiv:2412.13626
☆ Length-Controlled Margin-Based Preference Optimization without Reference Model
Direct Preference Optimization (DPO) is a widely adopted offline algorithm
for preference-based reinforcement learning from human feedback (RLHF),
designed to improve training simplicity and stability by redefining reward
functions. However, DPO is hindered by several limitations, including length
bias, memory inefficiency, and probability degradation. To address these
challenges, we propose Length-Controlled Margin-Based Preference Optimization
(LMPO), a more efficient and robust alternative. LMPO introduces a uniform
reference model as an upper bound for the DPO loss, enabling a more accurate
approximation of the original optimization objective. Additionally, an average
log-probability optimization strategy is employed to minimize discrepancies
between training and inference phases. A key innovation of LMPO lies in its
Length-Controlled Margin-Based loss function, integrated within the
Bradley-Terry framework. This loss function regulates response length while
simultaneously widening the margin between preferred and rejected outputs. By
doing so, it mitigates probability degradation for both accepted and discarded
responses, addressing a significant limitation of existing methods. We evaluate
LMPO against state-of-the-art preference optimization techniques on two
open-ended large language models, Mistral and LLaMA3, across six conditional
benchmarks. Our experimental results demonstrate that LMPO effectively controls
response length, reduces probability degradation, and outperforms existing
approaches. The code is available at \url{https://github.com/gengxuli/LMPO}.
☆ How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation
Recently, LLMs have garnered increasing attention across academic disciplines
for their potential as human digital twins, virtual proxies designed to
replicate individuals and autonomously perform tasks such as decision-making,
problem-solving, and reasoning on their behalf. However, current evaluations of
LLMs primarily emphasize dialogue simulation while overlooking human behavior
simulation, which is crucial for digital twins. To address this gap, we
introduce BehaviorChain, the first benchmark for evaluating LLMs' ability to
simulate continuous human behavior. BehaviorChain comprises diverse,
high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors
across 1,001 unique personas, each with detailed history and profile metadata.
For evaluation, we integrate persona metadata into LLMs and employ them to
iteratively infer contextually appropriate behaviors within dynamic scenarios
provided by BehaviorChain. Comprehensive evaluation results demonstrated that
even state-of-the-art models struggle with accurately simulating continuous
human behavior.
☆ NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
Image geo-localization is the task of predicting the specific location of an
image and requires complex reasoning across visual, geographical, and cultural
contexts. While prior Vision Language Models (VLMs) have the best accuracy at
this task, there is a dearth of high-quality datasets and models for analytical
reasoning. We first create NaviClues, a high-quality dataset derived from
GeoGuessr, a popular geography game, to supply examples of expert reasoning
from language. Using this dataset, we present Navig, a comprehensive image
geo-localization framework integrating global and fine-grained image
information. By reasoning with language, Navig reduces the average distance
error by 14% compared to previous state-of-the-art models while requiring fewer
than 1000 training samples. Our dataset and code are available at
https://github.com/SparrowZheyuan18/Navig/.
☆ PEARL: Towards Permutation-Resilient LLMs ICLR 2025
The in-context learning (ICL) capability of large language models (LLMs)
enables them to perform challenging tasks using provided demonstrations.
However, ICL is highly sensitive to the ordering of demonstrations, leading to
instability in predictions. This paper shows that this vulnerability can be
exploited to design a natural attack - difficult for model providers to detect
- that achieves nearly 80% success rate on LLaMA-3 by simply permuting the
demonstrations. Existing mitigation methods primarily rely on post-processing
and fail to enhance the model's inherent robustness to input permutations,
raising concerns about safety and reliability of LLMs. To address this issue,
we propose Permutation-resilient learning (PEARL), a novel framework based on
distributionally robust optimization (DRO), which optimizes model performance
against the worst-case input permutation. Specifically, PEARL consists of a
permutation-proposal network (P-Net) and the LLM. The P-Net generates the most
challenging permutations by treating it as an optimal transport problem, which
is solved using an entropy-constrained Sinkhorn algorithm. Through minimax
optimization, the P-Net and the LLM iteratively optimize against each other,
progressively improving the LLM's robustness. Experiments on synthetic
pre-training and real-world instruction tuning tasks demonstrate that PEARL
effectively mitigates permutation attacks and enhances performance. Notably,
despite being trained on fewer shots and shorter contexts, PEARL achieves
performance gains of up to 40% when scaled to many-shot and long-context
scenarios, highlighting its efficiency and generalization capabilities.
comment: ICLR 2025
☆ Multi-Record Web Page Information Extraction From News Websites
In this paper, we focused on the problem of extracting information from web
pages containing many records, a task of growing importance in the era of
massive web data. Recently, the development of neural network methods has
improved the quality of information extraction from web pages. Nevertheless,
most of the research and datasets are aimed at studying detailed pages. This
has left multi-record "list pages" relatively understudied, despite their
widespread presence and practical significance.
To address this gap, we created a large-scale, open-access dataset
specifically designed for list pages. This is the first dataset for this task
in the Russian language. Our dataset contains 13,120 web pages with news lists,
significantly exceeding existing datasets in both scale and complexity. Our
dataset contains attributes of various types, including optional and
multi-valued, providing a realistic representation of real-world list pages.
These features make our dataset a valuable resource for studying information
extraction from pages containing many records.
Furthermore, we proposed our own multi-stage information extraction methods.
In this work, we explore and demonstrate several strategies for applying
MarkupLM to the specific challenges of multi-record web pages. Our experiments
validate the advantages of our methods.
By releasing our dataset to the public, we aim to advance the field of
information extraction from multi-record pages.
☆ Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity
This paper investigates the efficacy of RWKV, a novel language model
architecture known for its linear attention mechanism, for generating sentence
embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate
the semantic similarity captured by embeddings from different hidden layers of
a pre-trained RWKV model. The performance is assessed on the Microsoft Research
Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared
against a GloVe-based baseline. My results indicate that while RWKV embeddings
capture some semantic relatedness, they underperform compared to the GloVe
baseline in terms of Spearman correlation. I also analyze the inference time
and GPU memory usage, highlighting the computational trade-offs associated with
RWKV embeddings. The findings suggest that while RWKV offers potential
advantages in terms of linear scaling, its zero-shot sentence embedding quality
for semantic similarity tasks requires further investigation and potential
task-specific fine-tuning to match or exceed simpler baselines.
comment: 17 pages, 3 tables, preprint on ArXiV, includes detailed analysis of
RWKV for semantic similarity tasks
☆ Reward Models Identify Consistency, Not Causality
Reward models (RMs) play a crucial role in aligning large language models
(LLMs) with human preferences and enhancing reasoning quality. Traditionally,
RMs are trained to rank candidate outputs based on their correctness and
coherence. However, in this work, we present several surprising findings that
challenge common assumptions about RM behavior. Our analysis reveals that
state-of-the-art reward models prioritize structural consistency over causal
correctness. Specifically, removing the problem statement has minimal impact on
reward scores, whereas altering numerical values or disrupting the reasoning
flow significantly affects RM outputs. Furthermore, RMs exhibit a strong
dependence on complete reasoning trajectories truncated or incomplete steps
lead to significant variations in reward assignments, indicating that RMs
primarily rely on learned reasoning patterns rather than explicit problem
comprehension. These findings hold across multiple architectures, datasets, and
tasks, leading to three key insights: (1) RMs primarily assess coherence rather
than true reasoning quality; (2) The role of explicit problem comprehension in
reward assignment is overstated; (3) Current RMs may be more effective at
ranking responses than verifying logical validity. Our results suggest a
fundamental limitation in existing reward modeling approaches, emphasizing the
need for a shift toward causality-aware reward models that go beyond
consistency-driven evaluation.
comment: 16 pages
☆ FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis
Retrieval-Augmented Large Language Models (LLMs), which integrate external
knowledge into LLMs, have shown remarkable performance in various medical
domains, including clinical diagnosis. However, existing RAG methods struggle
to effectively assess task difficulty to make retrieval decisions, thereby
failing to meet the clinical requirements for balancing efficiency and
accuracy. So in this paper, we propose FIND (\textbf{F}ine-grained
\textbf{In}formation \textbf{D}ensity Guided Adaptive RAG), a novel framework
that improves the reliability of RAG in disease diagnosis scenarios. FIND
incorporates a fine-grained adaptive control module to determine whether
retrieval is necessary based on the information density of the input. By
optimizing the retrieval process and implementing a knowledge filtering module,
FIND ensures that the retrieval is better suited to clinical scenarios.
Experiments on three Chinese electronic medical record datasets demonstrate
that FIND significantly outperforms various baseline methods, highlighting its
effectiveness in clinical diagnosis tasks.
☆ Behavioral Analysis of Information Salience in Large Language Models
Large Language Models (LLMs) excel at text summarization, a task that
requires models to select content based on its importance. However, the exact
notion of salience that LLMs have internalized remains unclear. To bridge this
gap, we introduce an explainable framework to systematically derive and
investigate information salience in LLMs through their summarization behavior.
Using length-controlled summarization as a behavioral probe into the content
selection process, and tracing the answerability of Questions Under Discussion
throughout, we derive a proxy for how models prioritize information. Our
experiments on 13 models across four datasets reveal that LLMs have a nuanced,
hierarchical notion of salience, generally consistent across model families and
sizes. While models show highly consistent behavior and hence salience
patterns, this notion of salience cannot be accessed through introspection, and
only weakly correlates with human perceptions of information salience.
☆ A Statistical Case Against Empirical Human-AI Alignment
Empirical human-AI alignment aims to make AI systems act in line with
observed human behavior. While noble in its goals, we argue that empirical
alignment can inadvertently introduce statistical biases that warrant caution.
This position paper thus advocates against naive empirical alignment, offering
prescriptive alignment and a posteriori empirical alignment as alternatives. We
substantiate our principled argument by tangible examples like human-centric
decoding of language models.
comment: 24 pages, 2 figures, 5 tables
☆ ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification
Self-awareness, i.e., the ability to assess and correct one's own generation,
is a fundamental aspect of human intelligence, making its replication in large
language models (LLMs) an important yet challenging task. Previous works tackle
this by employing extensive reinforcement learning or rather relying on large
external verifiers. In this work, we propose Refine via Intrinsic
Self-Verification (ReVISE), an efficient and effective framework that enables
LLMs to self-correct their outputs through self-verification. The core idea of
ReVISE is to enable LLMs to verify their reasoning processes and continually
rethink reasoning trajectories based on its verification. We introduce a
structured curriculum based upon online preference learning to implement this
efficiently. Specifically, as ReVISE involves two challenging tasks (i.e.,
self-verification and reasoning correction), we tackle each task sequentially
using curriculum learning, collecting both failed and successful reasoning
paths to construct preference pairs for efficient training. During inference,
our approach enjoys natural test-time scaling by integrating self-verification
and correction capabilities, further enhanced by our proposed confidence-aware
decoding mechanism. Our experiments on various reasoning tasks demonstrate that
ReVISE achieves efficient self-correction and significantly improves reasoning
performance.
☆ Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs
This work investigates the ability of open Large Language Models (LLMs) to
predict citation intent through in-context learning and fine-tuning. Unlike
traditional approaches that rely on pre-trained models like SciBERT, which
require extensive domain-specific pretraining and specialized architectures, we
demonstrate that general-purpose LLMs can be adapted to this task with minimal
task-specific data. We evaluate twelve model variations across five prominent
open LLM families using zero, one, few, and many-shot prompting to assess
performance across scenarios. Our experimental study identifies the
top-performing model through extensive experimentation of in-context
learning-related parameters, which we fine-tune to further enhance task
performance. The results highlight the strengths and limitations of LLMs in
recognizing citation intents, providing valuable insights for model selection
and prompt engineering. Additionally, we make our end-to-end evaluation
framework and models openly available for future use.
☆ Less is More: Improving LLM Alignment via Preference Data Selection
Direct Preference Optimization (DPO) has emerged as a promising approach for
aligning large language models with human preferences. While prior work mainly
extends DPO from the aspect of the objective function, we instead improve DPO
from the largely overlooked but critical aspect of data selection.
Specifically, we address the issue of parameter shrinkage caused by noisy data
by proposing a novel margin-maximization principle for dataset curation in DPO
training. To accurately estimate margins for data selection, we propose a
dual-margin guided approach that considers both external reward margins and
implicit DPO reward margins. Extensive experiments demonstrate that our method
reduces computational cost dramatically while improving performance.
Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach
achieves 3\% to 8\% improvements across various Llama and Mistral series models
on the AlpacaEval 2.0 benchmark. Furthermore, our approach seamlessly extends
to iterative DPO, yielding a roughly 3\% improvement with 25\% online data,
while further reducing training time. These results highlight the potential of
data selection strategies for advancing preference optimization.
☆ Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling
Bytes form the basis of the digital world and thus are a promising building
block for multimodal foundation models. Recently, Byte Language Models (BLMs)
have emerged to overcome tokenization, yet the excessive length of bytestreams
requires new architectural paradigms. Therefore, we present the Multiscale Byte
Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows
training with context windows of $5$M bytes on single GPU in full model
precision. We thoroughly examine MBLM's performance with Transformer and Mamba
blocks on both unimodal and multimodal tasks. Our experiments demonstrate that
hybrid architectures are efficient in handling extremely long byte sequences
during training while achieving near-linear generational efficiency. To the
best of our knowledge, we present the first evaluation of BLMs on visual Q\&A
tasks and find that, despite serializing images and the absence of an encoder,
a MBLM with pure next token prediction can match custom CNN-LSTM architectures
with designated classification heads. We show that MBLMs exhibit strong
adaptability in integrating diverse data representations, including pixel and
image filestream bytes, underlining their potential toward omnimodal foundation
models. Source code is publicly available at:
https://github.com/ai4sd/multiscale-byte-lm
comment: Under Review
☆ LLM-based User Profile Management for Recommender System ACL 2025
The rapid advancement of Large Language Models (LLMs) has opened new
opportunities in recommender systems by enabling zero-shot recommendation
without conventional training. Despite their potential, most existing works
rely solely on users' purchase histories, leaving significant room for
improvement by incorporating user-generated textual data, such as reviews and
product descriptions. Addressing this gap, we propose PURE, a novel LLM-based
recommendation framework that builds and maintains evolving user profiles by
systematically extracting and summarizing key information from user reviews.
PURE consists of three core components: a Review Extractor for identifying user
preferences and key product features, a Profile Updater for refining and
updating user profiles, and a Recommender for generating personalized
recommendations using the most current profile. To evaluate PURE, we introduce
a continuous sequential recommendation task that reflects real-world scenarios
by adding reviews over time and updating predictions incrementally. Our
experimental results on Amazon datasets demonstrate that PURE outperforms
existing LLM-based methods, effectively leveraging long-term user information
while managing token limitations.
comment: Submitted to ACL 2025
☆ LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization
Large Language Models (LLMs) have achieved remarkable success in natural
language processing, but their full fine-tuning remains resource-intensive.
Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation
(LoRA), have emerged as a practical solution by approximating parameter updates
with low-rank matrices. However, LoRA often exhibits a "double descent"
phenomenon during fine-tuning, where model performance degrades due to
overfitting and limited expressiveness caused by low-rank constraints. To
address this issue, we propose LoRA-GGPO (Gradient-Guided Perturbation
Optimization), a novel method that leverages gradient and weight norms to
generate targeted perturbations. By optimizing the sharpness of the loss
landscape, LoRA-GGPO guides the model toward flatter minima, mitigating the
double descent problem and improving generalization. Extensive experiments on
natural language understanding (NLU) and generation (NLG) tasks demonstrate
that LoRA-GGPO outperforms LoRA and its state-of-the-art variants. Furthermore,
extended experiments specifically designed to analyze the double descent
phenomenon confirm that LoRA-GGPO effectively alleviates this issue, producing
more robust and generalizable models. Our work provides a robust and efficient
solution for fine-tuning LLMs, with broad applicability in real-world
scenarios. The code is available at https://github.com/llm172/LoRA-GGPO.
☆ CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models
Large Language Model-based Multi-Agent Systems (LLM-MASs) have demonstrated
remarkable real-world capabilities, effectively collaborating to complete
complex tasks. While these systems are designed with safety mechanisms, such as
rejecting harmful instructions through alignment, their security remains
largely unexplored. This gap leaves LLM-MASs vulnerable to targeted
disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks
(Corba), a novel and simple yet highly effective attack that disrupts
interactions between agents within an LLM-MAS. Corba leverages two key
properties: its contagious nature allows it to propagate across arbitrary
network topologies, while its recursive property enables sustained depletion of
computational resources. Notably, these blocking attacks often involve
seemingly benign instructions, making them particularly challenging to mitigate
using conventional alignment methods. We evaluate Corba on two widely-used
LLM-MASs, namely, AutoGen and Camel across various topologies and commercial
models. Additionally, we conduct more extensive experiments in open-ended
interactive LLM-MASs, demonstrating the effectiveness of Corba in complex
topology structures and open-source models. Our code is available at:
https://github.com/zhrli324/Corba.
☆ Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation
We propose a new framework for zero-shot generation of synthetic tabular
data. Using the large language model (LLM) GPT-4o and plain-language prompting,
we demonstrate the ability to generate high-fidelity tabular data without
task-specific fine-tuning or access to real-world data (RWD) for pre-training.
To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated
synthetic data against data generated with the conditional tabular generative
adversarial network (CTGAN), across three open-access datasets: Iris, Fish
Measurements, and Real Estate Valuation. Despite the zero-shot approach, GPT-4o
outperformed CTGAN in preserving means, 95% confidence intervals, bivariate
correlations, and data privacy of RWD, even at amplified sample sizes. Notably,
correlations between parameters were consistently preserved with appropriate
direction and strength. However, refinement is necessary to better retain
distributional characteristics. These findings highlight the potential of LLMs
in tabular data synthesis, offering an accessible alternative to generative
adversarial networks and variational autoencoders.
comment: 12 pages, 7 figures, 5 tables
☆ MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality
Artur Kot, Mikołaj Koszowski, Wojciech Chojnowski, Mieszko Rutkowski, Artur Nowakowski, Kamil Guttmann, Mikołaj Pokrywka
Does multilingual Neural Machine Translation (NMT) lead to The Curse of the
Multlinguality or provides the Cross-lingual Knowledge Transfer within a
language family? In this study, we explore multiple approaches for extending
the available data-regime in NMT and we prove cross-lingual benefits even in
0-shot translation regime for low-resource languages. With this paper, we
provide state-of-the-art open-source NMT models for translating between
selected Slavic languages. We released our models on the HuggingFace Hub
(https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) under
the CC BY 4.0 license. Slavic language family comprises morphologically rich
Central and Eastern European languages. Although counting hundreds of millions
of native speakers, Slavic Neural Machine Translation is under-studied in our
opinion. Recently, most NMT research focuses either on: high-resource languages
like English, Spanish, and German - in WMT23 General Translation Task 7 out of
8 task directions are from or to English; massively multilingual models
covering multiple language groups; or evaluation techniques.
☆ Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases
Rena Gao, Xuetong Wu, Tatsuki Kuribayashi, Mingrui Ye, Siya Qi, Carsten Roever, Yuanxing Liu, Zheng Yuan, Jey Han Lau
This study evaluates Large Language Models' (LLMs) ability to simulate
non-native-like English use observed in human second language (L2) learners
interfered with by their native first language (L1). In dialogue-based
interviews, we prompt LLMs to mimic L2 English learners with specific L1s
(e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to
real L2 learner data. Our analysis examines L1-driven linguistic biases, such
as reference word usage and avoidance behaviors, using information-theoretic
and distributional density measures. Results show that modern LLMs (e.g.,
Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed
in human L2 data, with distinct influences from various languages (e.g.,
Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu
influences noun-verb collocations). Our results reveal the potential of LLMs
for L2 dialogue generation and evaluation for future educational applications.
☆ How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexander Panchenko, Mikhail Salnikov
The performance of Large Language Models (LLMs) on many tasks is greatly
limited by the knowledge learned during pre-training and stored in the model's
parameters. Low-rank adaptation (LoRA) is a popular and efficient training
technique for updating or domain-specific adaptation of LLMs. In this study, we
investigate how new facts can be incorporated into the LLM using LoRA without
compromising the previously learned knowledge. We fine-tuned
Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our
experiments have shown that the best results are obtained when the training
data contains a mixture of known and new facts. However, this approach is still
potentially harmful because the model's performance on external
question-answering benchmarks declines after such fine-tuning. When the
training data is biased towards certain entities, the model tends to regress to
few overrepresented answers. In addition, we found that the model becomes more
confident and refuses to provide an answer in only few cases. These findings
highlight the potential pitfalls of LoRA-based LLM updates and underscore the
importance of training data composition and tuning parameters to balance new
knowledge integration and general model capabilities.
☆ Towards a Perspectivist Turn in Argument Quality Assessment NAACL 2025
The assessment of argument quality depends on well-established logical,
rhetorical, and dialectical properties that are unavoidably subjective:
multiple valid assessments may exist, there is no unequivocal ground truth.
This aligns with recent paths in machine learning, which embrace the
co-existence of different perspectives. However, this potential remains largely
unexplored in NLP research on argument quality. One crucial reason seems to be
the yet unexplored availability of suitable datasets. We fill this gap by
conducting a systematic review of argument quality datasets. We assign them to
a multi-layered categorization targeting two aspects: (a) What has been
annotated: we collect the quality dimensions covered in datasets and
consolidate them in an overarching taxonomy, increasing dataset comparability
and interoperability. (b) Who annotated: we survey what information is given
about annotators, enabling perspectivist research and grounding our
recommendations for future actions. To this end, we discuss datasets suitable
for developing perspectivist models (i.e., those containing individual,
non-aggregated annotations), and we showcase the importance of a controlled
selection of annotators in a pilot study.
comment: Accepted to NAACL 2025
☆ MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for
evaluating and developing LLM agents on AI research tasks. This is the first
Gym environment for machine learning (ML) tasks, enabling research on
reinforcement learning (RL) algorithms for training such agents. MLGym-bench
consists of 13 diverse and open-ended AI research tasks from diverse domains
such as computer vision, natural language processing, reinforcement learning,
and game theory. Solving these tasks requires real-world AI research skills
such as generating new ideas and hypotheses, creating and processing data,
implementing ML methods, training models, running experiments, analyzing the
results, and iterating through this process to improve on a given task. We
evaluate a number of frontier large language models (LLMs) on our benchmarks
such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5
Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate
models or agents, generate synthetic data at scale, as well as develop new
learning algorithms for training agents on AI research tasks. We find that
current frontier models can improve on the given baselines, usually by finding
better hyperparameters, but do not generate novel hypotheses, algorithms,
architectures, or substantial improvements. We open-source our framework and
benchmark to facilitate future research in advancing the AI research
capabilities of LLM agents.
comment: 35 pages, 12 figures, 10 tables
☆ Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups
Macroeconomic fluctuations and the narratives that shape them form a mutually
reinforcing cycle: public discourse can spur behavioural changes leading to
economic shifts, which then result in changes in the stories that propagate. We
show that shifts in semantic embedding space can be causally linked to
financial market shocks -- deviations from the expected market behaviour.
Furthermore, we show how partisanship can influence the predictive power of
text for market fluctuations and shape reactions to those same shocks. We also
provide some evidence that text-based signals are particularly salient during
unexpected events such as COVID-19, highlighting the value of language data as
an exogenous variable in economic forecasting. Our findings underscore the
bidirectional relationship between news outlets and market shocks, offering a
novel empirical approach to studying their effect on each other.
☆ Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization
LLM-based agents have made significant advancements in interactive
environments, such as mobile operations and web browsing, and other domains
beyond computer using. Current multi-agent systems universally excel in
performance, compared to single agents, but struggle with generalization across
environments due to predefined roles and inadequate strategies for generalizing
language agents. The challenge of achieving both strong performance and good
generalization has hindered the progress of multi-agent systems for interactive
environments. To address these issues, we propose CollabUIAgents, a multi-agent
reinforcement learning framework with a novel multi-agent credit re-assignment
(CR) strategy, assigning process rewards with LLMs rather than
environment-specific rewards and learning with synthesized preference data, in
order to foster generalizable, collaborative behaviors among the role-free
agents' policies. Empirical results show that our framework improves both
performance and cross-environment generalizability of multi-agent systems.
Moreover, our 7B-parameter system achieves results on par with or exceed strong
closed-source models, and the LLM that guides the CR. We also provide insights
in using granular CR rewards effectively for environment generalization, and
accommodating trained LLMs in multi-agent systems.
comment: 24 pages, under review
☆ StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following
Multi-turn instruction following capability constitutes a core competency of
large language models (LLMs) in real-world applications. Existing evaluation
benchmarks predominantly focus on fine-grained constraint satisfaction and
domain-specific capability assessment, yet overlook the crucial structural
dependency between dialogue turns that distinguishes multi-turn from
single-turn interactions. This structural dependency not only reflects user
intent but also establishes a second dimension for instruction following
evaluation beyond constraint satisfaction. To address this gap, we propose
StructFlowBench, a multi-turn instruction following benchmark with structural
flow modeling. The benchmark innovatively defines a structural flow framework
comprising six fundamental inter-turn relationships, which not only introduces
novel structural constraints for model evaluation but also serves as generation
parameters for creating customized dialogue flows tailored to specific
scenarios. Adopting established LLM-based automatic evaluation methodologies,
we conduct systematic evaluations of 13 leading open-source and closed-source
LLMs. Experimental results reveal significant deficiencies in current models'
comprehension of multi-turn dialogue structures. The code is available at
\url{https://github.com/MLGroupJLU/StructFlowBench}.
comment: 18 pages, 8 figures, 8 tables
☆ How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation
Jailbreak attacks, where harmful prompts bypass generative models' built-in
safety, raise serious concerns about model vulnerability. While many defense
methods have been proposed, the trade-offs between safety and helpfulness, and
their application to Large Vision-Language Models (LVLMs), are not well
understood. This paper systematically examines jailbreak defenses by reframing
the standard generation task as a binary classification problem to assess model
refusal tendencies for both harmful and benign queries. We identify two key
defense mechanisms: safety shift, which increases refusal rates across all
queries, and harmfulness discrimination, which improves the model's ability to
distinguish between harmful and benign inputs. Using these mechanisms, we
develop two ensemble defense strategies-inter-mechanism ensembles and
intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the
MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these
strategies effectively improve model safety or optimize the trade-off between
safety and helpfulness.
☆ NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models
Parameter-efficient fine-tuning (PEFT) is essential for adapting large
language models (LLMs), with low-rank adaptation (LoRA) being the most popular
approach. However, LoRA suffers from slow convergence, and some recent LoRA
variants, such as PiSSA, primarily rely on Singular Value Decomposition (SVD)
for initialization, leading to expensive computation. To mitigate these
problems, we use the Nystr\"om method, which follows a three-matrix
manipulation. We first introduce StructuredLoRA (SLoRA), which investigates
adding a small intermediate matrix between the low-rank matrices A and B.
Secondly, we propose Nystr\"omLoRA (NLoRA), which leverages Nystr\"om-based
initialization for SLoRA to improve its effectiveness and efficiency. Finally,
we propose IntermediateTune (IntTune), which explores fine-tuning exclusively
on the intermediate matrix of NLoRA to further boost LLM efficiency. We
evaluate our methods on five natural language generation (NLG) tasks and eight
natural language understanding (NLU) tasks. On GSM8K, SLoRA and NLoRA achieve
accuracies of 56.48% and 57.70%, surpassing LoRA by 33.52% and 36.41%, with
only 3.67 million additional trainable parameters. IntTune improves average NLG
performance over LoRA by 7.45% while using only 1.25% of its parameters. These
results demonstrate the efficiency and effectiveness of our approach in
enhancing model performance with minimal parameter overhead.
☆ Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression
Handling long-context sequences efficiently remains a significant challenge
in large language models (LLMs). Existing methods for token selection in
sequence extrapolation either employ a permanent eviction strategy or select
tokens by chunk, which may lead to the loss of critical information. We propose
Efficient Selective Attention (ESA), a novel approach that extends context
length by efficiently selecting the most critical tokens at the token level to
compute attention. ESA reduces the computational complexity of token selection
by compressing query and key vectors into lower-dimensional representations. We
evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using
open-source LLMs with context lengths of 8k and 32k. ESA outperforms other
selective attention methods, especially in tasks requiring the retrieval of
multiple pieces of information, achieving comparable performance to
full-attention extrapolation methods across various tasks, with superior
results in certain tasks.
comment: 14 pages,2 figures
☆ Argument-Based Comparative Question Answering Evaluation Benchmark
Irina Nikishina, Saba Anwar, Nikolay Dolgov, Maria Manina, Daria Ignatenko, Viktor Moskvoretskii, Artem Shelmanov, Tim Baldwin, Chris Biemann
In this paper, we aim to solve the problems standing in the way of automatic
comparative question answering. To this end, we propose an evaluation framework
to assess the quality of comparative question answering summaries. We formulate
15 criteria for assessing comparative answers created using manual annotation
and annotation from 6 large language models and two comparative question
asnwering datasets. We perform our tests using several LLMs and manual
annotation under different settings and demonstrate the constituency of both
evaluations. Our results demonstrate that the Llama-3 70B Instruct model
demonstrates the best results for summary evaluation, while GPT-4 is the best
for answering comparative questions. All used data, code, and evaluation
results are publicly
available\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}.
comment: 8 pages, 7 Tables, 13 Figures, 18 pages with Appendix
☆ Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models
This work presents a novel architecture for context-aware interactions within
smart environments, leveraging Large Language Models (LLMs) to enhance user
experiences. Our system integrates user location data obtained through UWB tags
and sensor-equipped smart homes with real-time human activity recognition (HAR)
to provide a comprehensive understanding of user context. This contextual
information is then fed to an LLM-powered chatbot, enabling it to generate
personalised interactions and recommendations based on the user's current
activity and environment. This approach moves beyond traditional static chatbot
interactions by dynamically adapting to the user's real-time situation. A case
study conducted from a real-world dataset demonstrates the feasibility and
effectiveness of our proposed architecture, showcasing its potential to create
more intuitive and helpful interactions within smart homes. The results
highlight the significant benefits of integrating LLM with real-time activity
and location data to deliver personalised and contextually relevant user
experiences.
comment: 11 pages, 3 figures
☆ Optimal word order for non-causal text generation with Large Language Models: the Spanish case
Andrea Busto-Castiñeira, Silvia García-Méndez, Francisco de Arriba-Pérez, Francisco J. González-Castaño
Natural Language Generation (NLG) popularity has increased owing to the
progress in Large Language Models (LLMs), with zero-shot inference
capabilities. However, most neural systems utilize decoder-only causal
(unidirectional) transformer models, which are effective for English but may
reduce the richness of languages with less strict word order, subject omission,
or different relative clause attachment preferences. This is the first work
that analytically addresses optimal text generation order for non-causal
language models. We present a novel Viterbi algorithm-based methodology for
maximum likelihood word order estimation. We analyze the non-causal
most-likelihood order probability for NLG in Spanish and, then, the probability
of generating the same phrases with Spanish causal NLG. This comparative
analysis reveals that causal NLG prefers English-like SVO structures. We also
analyze the relationship between optimal generation order and causal
left-to-right generation order using Spearman's rank correlation. Our results
demonstrate that the ideal order predicted by the maximum likelihood estimator
is not closely related to the causal order and may be influenced by the
syntactic structure of the target sentence.
☆ PredictaBoard: Benchmarking LLM Score Predictability
Lorenzo Pacchiardi, Konstantinos Voudouris, Ben Slater, Fernando Martínez-Plumed, José Hernández-Orallo, Lexin Zhou, Wout Schellaert
Despite possessing impressive skills, Large Language Models (LLMs) often fail
unpredictably, demonstrating inconsistent success in even basic common sense
reasoning tasks. This unpredictability poses a significant challenge to
ensuring their safe deployment, as identifying and operating within a reliable
"safe zone" is essential for mitigating risks. To address this, we present
PredictaBoard, a novel collaborative benchmarking framework designed to
evaluate the ability of score predictors (referred to as assessors) to
anticipate LLM errors on specific task instances (i.e., prompts) from existing
datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering
the rejection rate at different tolerance errors. As such, PredictaBoard
stimulates research into developing better assessors and making LLMs more
predictable, not only with a higher average performance. We conduct
illustrative experiments using baseline assessors and state-of-the-art LLMs.
PredictaBoard highlights the critical need to evaluate predictability alongside
performance, paving the way for safer AI systems where errors are not only
minimised but also anticipated and effectively mitigated. Code for our
benchmark can be found at
https://github.com/Kinds-of-Intelligence-CFI/PredictaBoard
☆ An Enhancement of Jiang, Z., et al.s Compression-Based Classification Algorithm Applied to News Article Categorization
This study enhances Jiang et al.'s compression-based classification algorithm
by addressing its limitations in detecting semantic similarities between text
documents. The proposed improvements focus on unigram extraction and optimized
concatenation, eliminating reliance on entire document compression. By
compressing extracted unigrams, the algorithm mitigates sliding window
limitations inherent to gzip, improving compression efficiency and similarity
detection. The optimized concatenation strategy replaces direct concatenation
with the union of unigrams, reducing redundancy and enhancing the accuracy of
Normalized Compression Distance (NCD) calculations. Experimental results across
datasets of varying sizes and complexities demonstrate an average accuracy
improvement of 5.73%, with gains of up to 11% on datasets containing longer
documents. Notably, these improvements are more pronounced in datasets with
high-label diversity and complex text structures. The methodology achieves
these results while maintaining computational efficiency, making it suitable
for resource-constrained environments. This study provides a robust, scalable
solution for text classification, emphasizing lightweight preprocessing
techniques to achieve efficient compression, which in turn enables more
accurate classification.
comment: 11 pages, 5 figures, 1 table
☆ Natural Language Generation
This book provides a broad overview of Natural Language Generation (NLG),
including technology, user requirements, evaluation, and real-world
applications. The focus is on concepts and insights which hopefully will remain
relevant for many years, not on the latest LLM innovations. It draws on decades
of work by the author and others on NLG.
The book has the following chapters: Introduction to NLG; Rule-Based NLG;
Machine Learning and Neural NLG; Requirements; Evaluation; Safety, Maintenance,
and Testing; and Applications. All chapters include examples and anecdotes from
the author's personal experiences, and end with a Further Reading section.
The book should be especially useful to people working on applied NLG,
including NLG researchers, people in other fields who want to use NLG, and
commercial developers. It will not however be useful to people who want to
understand the latest LLM technology.
There is a companion site with more information at
https://ehudreiter.com/book/
comment: This is a preprint of the following work: Ehud Reiter, Natural
Language Generation, 2024, Springer reproduced with permission of Springer
Nature Switzerland AG. The final authenticated version is available online
at: http://dx.doi.org/10.1007/978-3-031-68582-8
☆ Early-Exit and Instant Confidence Translation Quality Estimation
Quality estimation is omnipresent in machine translation, for both evaluation
and generation. Unfortunately, quality estimation models are often opaque and
computationally expensive, making them impractical to be part of large-scale
pipelines. In this work, we tackle two connected challenges: (1) reducing the
cost of quality estimation at scale, and (2) developing an inexpensive
uncertainty estimation method for quality estimation. To address the latter, we
introduce Instant Confidence COMET, an uncertainty-aware quality estimation
model that matches the performance of previous approaches at a fraction of
their costs. We extend this to Early-Exit COMET, a quality estimation model
that can compute quality scores and associated confidences already at early
model layers, allowing us to early-exit computations and reduce evaluation
costs. We also apply our model to machine translation reranking. We combine
Early-Exit COMET with an upper confidence bound bandit algorithm to find the
best candidate from a large pool without having to run the full evaluation
model on all candidates. In both cases (evaluation and reranking) our methods
reduce the required compute by 50% with very little degradation in performance.
☆ Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models
Artem Vazhentsev, Lyudmila Rvanova, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Artem Shelmanov
Uncertainty quantification (UQ) is a prominent approach for eliciting
truthful answers from large language models (LLMs). To date, information-based
and consistency-based UQ have been the dominant UQ methods for text generation
via LLMs. Density-based methods, despite being very effective for UQ in text
classification with encoder-based models, have not been very successful with
generative LLMs. In this work, we adapt Mahalanobis Distance (MD) - a
well-established UQ technique in classification tasks - for text generation and
introduce a new supervised UQ method. Our method extracts token embeddings from
multiple layers of LLMs, computes MD scores for each token, and uses linear
regression trained on these features to provide robust uncertainty scores.
Through extensive experiments on eleven datasets, we demonstrate that our
approach substantially improves over existing UQ methods, providing accurate
and computationally efficient uncertainty scores for both sequence-level
selective generation and claim-level fact-checking tasks. Our method also
exhibits strong generalization to out-of-domain data, making it suitable for a
wide range of LLM-based applications.
☆ A Survey on Data Contamination for Large Language Models
Recent advancements in Large Language Models (LLMs) have demonstrated
significant progress in various areas, such as text generation and code
synthesis. However, the reliability of performance evaluation has come under
scrutiny due to data contamination-the unintended overlap between training and
test datasets. This overlap has the potential to artificially inflate model
performance, as LLMs are typically trained on extensive datasets scraped from
publicly available sources. These datasets often inadvertently overlap with the
benchmarks used for evaluation, leading to an overestimation of the models'
true generalization capabilities. In this paper, we first examine the
definition and impacts of data contamination. Secondly, we review methods for
contamination-free evaluation, focusing on three strategies: data
updating-based methods, data rewriting-based methods, and prevention-based
methods. Specifically, we highlight dynamic benchmarks and LLM-driven
evaluation methods. Finally, we categorize contamination detecting methods
based on model information dependency: white-Box, gray-Box, and black-Box
detection approaches. Our survey highlights the requirements for more rigorous
evaluation protocols and proposes future directions for addressing data
contamination challenges.
☆ Unstructured Evidence Attribution for Long Context Query Focused Summarization
Large language models (LLMs) are capable of generating coherent summaries
from very long contexts given a user query. Extracting and properly citing
evidence spans could help improve the transparency and reliability of these
summaries. At the same time, LLMs suffer from positional biases in terms of
which information they understand and attend to, which could affect evidence
citation. Whereas previous work has focused on evidence citation with
predefined levels of granularity (e.g. sentence, paragraph, document, etc.), we
propose the task of long-context query focused summarization with unstructured
evidence citation. We show how existing systems struggle to generate and
properly cite unstructured evidence from their context, and that evidence tends
to be "lost-in-the-middle". To help mitigate this, we create the Summaries with
Unstructured Evidence Text dataset (SUnsET), a synthetic dataset generated
using a novel domain-agnostic pipeline which can be used as supervision to
adapt LLMs to this task. We demonstrate across 5 LLMs of different sizes and 4
datasets with varying document types and lengths that LLMs adapted with SUnsET
data generate more relevant and factually consistent evidence than their base
models, extract evidence from more diverse locations in their context, and can
generate more relevant and consistent summaries.
comment: 24 pages; 21 figures; 5 tables
☆ A Macro- and Micro-Hierarchical Transfer Learning Framework for Cross-Domain Fake News Detection
Cross-domain fake news detection aims to mitigate domain shift and improve
detection performance by transferring knowledge across domains. Existing
approaches transfer knowledge based on news content and user engagements from a
source domain to a target domain. However, these approaches face two main
limitations, hindering effective knowledge transfer and optimal fake news
detection performance. Firstly, from a micro perspective, they neglect the
negative impact of veracity-irrelevant features in news content when
transferring domain-shared features across domains. Secondly, from a macro
perspective, existing approaches ignore the relationship between user
engagement and news content, which reveals shared behaviors of common users
across domains and can facilitate more effective knowledge transfer. To address
these limitations, we propose a novel macro- and micro- hierarchical transfer
learning framework (MMHT) for cross-domain fake news detection. Firstly, we
propose a micro-hierarchical disentangling module to disentangle
veracity-relevant and veracity-irrelevant features from news content in the
source domain for improving fake news detection performance in the target
domain. Secondly, we propose a macro-hierarchical transfer learning module to
generate engagement features based on common users' shared behaviors in
different domains for improving effectiveness of knowledge transfer. Extensive
experiments on real-world datasets demonstrate that our framework significantly
outperforms the state-of-the-art baselines.
comment: 11 pages, 8 figures
☆ Enhancing Portuguese Variety Identification with Cross-Domain Approaches AAAI 2025
Recent advances in natural language processing have raised expectations for
generative models to produce coherent text across diverse language varieties.
In the particular case of the Portuguese language, the predominance of
Brazilian Portuguese corpora online introduces linguistic biases in these
models, limiting their applicability outside of Brazil. To address this gap and
promote the creation of European Portuguese resources, we developed a
cross-domain language variety identifier (LVI) to discriminate between European
and Brazilian Portuguese. Motivated by the findings of our literature review,
we compiled the PtBrVarId corpus, a cross-domain LVI dataset, and study the
effectiveness of transformer-based LVI classifiers for cross-domain scenarios.
Although this research focuses on two Portuguese varieties, our contribution
can be extended to other varieties and languages. We open source the code,
corpus, and models to foster further research in this task.
comment: AAAI 2025
☆ Leveraging Small LLMs for Argument Mining in Education: Argument Component Identification, Classification, and Assessment
Argument mining algorithms analyze the argumentative structure of essays,
making them a valuable tool for enhancing education by providing targeted
feedback on the students' argumentation skills. While current methods often use
encoder or encoder-decoder deep learning architectures, decoder-only models
remain largely unexplored, offering a promising research direction.
This paper proposes leveraging open-source, small Large Language Models
(LLMs) for argument mining through few-shot prompting and fine-tuning. These
models' small size and open-source nature ensure accessibility, privacy, and
computational efficiency, enabling schools and educators to adopt and deploy
them locally. Specifically, we perform three tasks: segmentation of student
essays into arguments, classification of the arguments by type, and assessment
of their quality. We empirically evaluate the models on the Feedback Prize -
Predicting Effective Arguments dataset of grade 6-12 students essays and
demonstrate how fine-tuned small LLMs outperform baseline methods in segmenting
the essays and determining the argument types while few-shot prompting yields
comparable performance to that of the baselines in assessing quality. This work
highlights the educational potential of small, open-source LLMs to provide
real-time, personalized feedback, enhancing independent learning and writing
skills while ensuring low computational cost and privacy.
☆ Tradutor: Building a Variety Specific Translation Model AAAI 2025
Language models have become foundational to many widely used systems.
However, these seemingly advantageous models are double-edged swords. While
they excel in tasks related to resource-rich languages like English, they often
lose the fine nuances of language forms, dialects, and varieties that are
inherent to languages spoken in multiple regions of the world. Languages like
European Portuguese are neglected in favor of their more popular counterpart,
Brazilian Portuguese, leading to suboptimal performance in various linguistic
tasks. To address this gap, we introduce the first open-source translation
model specifically tailored for European Portuguese, along with a novel dataset
specifically designed for this task. Results from automatic evaluations on two
benchmark datasets demonstrate that our best model surpasses existing
open-source translation systems for Portuguese and approaches the performance
of industry-leading closed-source systems for European Portuguese. By making
our dataset, models, and code publicly available, we aim to support and
encourage further research, fostering advancements in the representation of
underrepresented language varieties.
comment: AAAI 2025
☆ Rumor Detection by Multi-task Suffix Learning based on Time-series Dual Sentiments
The widespread dissemination of rumors on social media has a significant
impact on people's lives, potentially leading to public panic and fear. Rumors
often evoke specific sentiments, resonating with readers and prompting sharing.
To effectively detect and track rumors, it is essential to observe the
fine-grained sentiments of both source and response message pairs as the rumor
evolves over time. However, current rumor detection methods fail to account for
this aspect. In this paper, we propose MSuf, the first multi-task suffix
learning framework for rumor detection and tracking using time series dual
(coupled) sentiments. MSuf includes three modules: (1) an LLM to extract
sentiment intensity features and sort them chronologically; (2) a module that
fuses the sorted sentiment features with their source text word embeddings to
obtain an aligned embedding; (3) two hard prompts are combined with the aligned
vector to perform rumor detection and sentiment analysis using one frozen LLM.
MSuf effectively enhances the performance of LLMs for rumor detection with only
minimal parameter fine-tuning. Evaluating MSuf on four rumor detection
benchmarks, we find significant improvements compared to other emotion-based
methods.
comment: work in progress
☆ Affinity and Diversity: A Unified Metric for Demonstration Selection via Internal Representations
The performance of In-Context Learning (ICL) is highly sensitive to the
selected demonstrations. Existing approaches to demonstration selection
optimize different objectives, yielding inconsistent results. To address this,
we propose a unified metric--affinity and diversity--that leverages ICL model's
internal representations. Our experiments show that both affinity and diversity
strongly correlate with test accuracies, indicating their effectiveness for
demonstration selection. Moreover, we show that our proposed metrics align well
with various previous works to unify the inconsistency.
comment: 8 pages, 10 figures
☆ A Similarity Paradigm Through Textual Regularization Without Forgetting
Prompt learning has emerged as a promising method for adapting pre-trained
visual-language models (VLMs) to a range of downstream tasks. While optimizing
the context can be effective for improving performance on specific tasks, it
can often lead to poor generalization performance on unseen classes or datasets
sampled from different distributions. It may be attributed to the fact that
textual prompts tend to overfit downstream data distributions, leading to the
forgetting of generalized knowledge derived from hand-crafted prompts. In this
paper, we propose a novel method called Similarity Paradigm with Textual
Regularization (SPTR) for prompt learning without forgetting. SPTR is a
two-pronged design based on hand-crafted prompts that is an inseparable
framework. 1) To avoid forgetting general textual knowledge, we introduce the
optimal transport as a textual regularization to finely ensure approximation
with hand-crafted features and tuning textual features. 2) In order to
continuously unleash the general ability of multiple hand-crafted prompts, we
propose a similarity paradigm for natural alignment score and adversarial
alignment score to improve model robustness for generalization. Both modules
share a common objective in addressing generalization issues, aiming to
maximize the generalization capability derived from multiple hand-crafted
prompts. Four representative tasks (i.e., non-generalization few-shot learning,
base-to-novel generalization, cross-dataset generalization, domain
generalization) across 11 datasets demonstrate that SPTR outperforms existing
prompt learning methods.
☆ Entropy-UID: A Method for Optimizing Information Density ACL 2025
Balanced and efficient information flow is essential for optimizing language
generation models. In this work, we propose Entropy-UID, a new token selection
method that balances entropy and Uniform Information Density (UID) principles
for enhanced efficiency of text generation. Our approach adaptively adjusts
token selection by jointly minimizing entropy and surprisal, promoting more
even information distribution across generated sequences. Theoretical
validation demonstrates that Entropy-UID optimally reduces information spikes
while maintaining fluency and coherence. The method has been evulated using
information-theoretic metrics on multiple benchmark datasets, including
WikiText-2, OpenWebText, and WMT. Experimental results show that Entropy-UID
achieves lower surprisal and entropy variance compared to standard GPT-2 and
alternative heuristics, leading to more balanced and human-like text
generation. Our findings point towards the potential of leveraging
information-theoretic constraints to refine token selection strategies in
autoregressive language models.
comment: 5pages, 1 figures, submitting to ACL 2025
☆ Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests
Filippo Momentè, Alessandro Suglia, Mario Giulianelli, Ambra Ferrari, Alexander Koller, Oliver Lemon, David Schlangen, Raquel Fernández, Raffaella Bernardi
We examine three evaluation paradigms: large question-answering benchmarks
(e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and
cognitive tests (e.g., for working memory or theory of mind). First, we
investigate which of the former two-benchmarks or games-is most effective at
discriminating LLMs of varying quality. Then, inspired by human cognitive
assessments, we compile a suite of targeted tests that measure cognitive
abilities deemed essential for effective language use, and we investigate their
correlation with model performance in benchmarks and games. Our analyses reveal
that interactive games are superior to standard benchmarks in discriminating
models. Causal and logical reasoning correlate with both static and interactive
tests, while differences emerge regarding core executive functions and
social/emotional skills, which correlate more with games. We advocate the
development of new interactive benchmarks and targeted cognitive tasks inspired
by assessing human abilities but designed specifically for LLMs.
☆ Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning
Direct Preference Optimization (DPO) often struggles with long-chain
mathematical reasoning. Existing approaches, such as Step-DPO, typically
improve this by focusing on the first erroneous step in the reasoning chain.
However, they overlook all other steps and rely heavily on humans or GPT-4 to
identify erroneous steps. To address these issues, we propose Full-Step-DPO, a
novel DPO framework tailored for mathematical reasoning. Instead of optimizing
only the first erroneous step, it leverages step-wise rewards from the entire
reasoning chain. This is achieved by training a self-supervised process reward
model, which automatically scores each step, providing rewards while avoiding
reliance on external signals. Furthermore, we introduce a novel step-wise DPO
loss, which dynamically updates gradients based on these step-wise rewards.
This endows stronger reasoning capabilities to language models. Extensive
evaluations on both in-domain and out-of-domain mathematical reasoning
benchmarks across various base language models, demonstrate that Full-Step-DPO
achieves superior performance compared to state-of-the-art baselines.
☆ Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment
Multi-Objective Alignment (MOA) aims to align LLMs' responses with multiple
human preference objectives, with Direct Preference Optimization (DPO) emerging
as a prominent approach. However, we find that DPO-based MOA approaches suffer
from widespread preference conflicts in the data, where different objectives
favor different responses. This results in conflicting optimization directions,
hindering the optimization on the Pareto Front. To address this, we propose to
construct Pareto-optimal responses to resolve preference conflicts. To
efficiently obtain and utilize such responses, we propose a self-improving DPO
framework that enables LLMs to self-generate and select Pareto-optimal
responses for self-supervised preference alignment. Extensive experiments on
two datasets demonstrate the superior Pareto Front achieved by our framework
compared to various baselines. Code is available at
\url{https://github.com/zyttt-coder/SIPO}.
comment: Under review
☆ SR-LLM: Rethinking the Structured Representation in Large Language Model
Jiahuan Zhang, Tianheng Wang, Hanqing Wu, Ziyi Huang, Yulong Wu, Dongbai Chen, Linfeng Song, Yue Zhang, Guozheng Rao, Kaicheng Yu
Structured representations, exemplified by Abstract Meaning Representation
(AMR), have long been pivotal in computational linguistics. However, their role
remains ambiguous in the Large Language Models (LLMs) era. Initial attempts to
integrate structured representation into LLMs via a zero-shot setting yielded
inferior performance. We hypothesize that such a decline stems from the
structure information being passed into LLMs in a code format unfamiliar to
LLMs' training corpora. Consequently, we propose SR-LLM, an innovative
framework with two settings to explore a superior way of integrating structured
representation with LLMs from training-free and training-dependent
perspectives. The former integrates structural information through natural
language descriptions in LLM prompts, whereas its counterpart augments the
model's inference capability through fine-tuning on linguistically described
structured representations. Performance improvements were observed in widely
downstream datasets, with particularly notable gains of 3.17% and 12.38% in
PAWS. To the best of our knowledge, this work represents the pioneering
demonstration that leveraging structural representations can substantially
enhance LLMs' inference capability. We hope that our work sheds light and
encourages future research to enhance the reasoning and interoperability of
LLMs by structure data.
☆ Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective ICLR 2025
Direct Preference Optimization (DPO) has gained attention as an efficient
alternative to reinforcement learning from human feedback (RLHF) for aligning
large language models (LLMs) with human preferences. Despite its advantages,
DPO suffers from a length bias, generating responses longer than those from the
reference model. Existing solutions like SimPO and SamPO address this issue but
uniformly treat the contribution of rewards across sequences, overlooking
temporal dynamics. To this end, we propose an enhanced preference optimization
method that incorporates a temporal decay factor controlled by a gamma
parameter. This dynamic weighting mechanism adjusts the influence of each
reward based on its position in the sequence, prioritizing earlier tokens that
are more critical for alignment. By adaptively focusing on more relevant
feedback, our approach mitigates overfitting to less pertinent data and remains
responsive to evolving human preferences. Experimental results on several
benchmarks show that our approach consistently outperforms vanilla DPO by
5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across
different model architectures and sizes. Furthermore, additional experiments on
mathematical and reasoning benchmarks (MMLU, GSM8K, and MATH) confirm that our
method enhances performance without compromising general capabilities. Our
codebase would be available at \url{https://github.com/LotuSrc/D2PO}.
comment: Accepted by ICLR 2025
☆ English Please: Evaluating Machine Translation for Multilingual Bug Reports
Accurate translation of bug reports is critical for efficient collaboration
in global software development. In this study, we conduct the first
comprehensive evaluation of machine translation (MT) performance on bug
reports, analyzing the capabilities of DeepL, AWS Translate, and ChatGPT using
data from the Visual Studio Code GitHub repository, specifically focusing on
reports labeled with the english-please tag. To thoroughly assess the accuracy
and effectiveness of each system, we employ multiple machine translation
metrics, including BLEU, BERTScore, COMET, METEOR, and ROUGE. Our findings
indicate that DeepL consistently outperforms the other systems across most
automatic metrics, demonstrating strong lexical and semantic alignment. AWS
Translate performs competitively, particularly in METEOR, while ChatGPT lags in
key metrics. This study underscores the importance of domain adaptation for
translating technical texts and offers guidance for integrating automated
translation into bug-triaging workflows. Moreover, our results establish a
foundation for future research to refine machine translation solutions for
specialized engineering contexts. The code and dataset for this paper are
available at GitHub: https://github.com/av9ash/gitbugs/tree/main/multilingual.
comment: 8 Pages, 4 Figures, 3 Tables
☆ Information Types in Product Reviews
Information in text is communicated in a way that supports a goal for its
reader. Product reviews, for example, contain opinions, tips, product
descriptions, and many other types of information that provide both direct
insights, as well as unexpected signals for downstream applications. We devise
a typology of 24 communicative goals in sentences from the product review
domain, and employ a zero-shot multi-label classifier that facilitates
large-scale analyses of review data. In our experiments, we find that the
combination of classes in the typology forecasts helpfulness and sentiment of
reviews, while supplying explanations for these decisions. In addition, our
typology enables analysis of review intent, effectiveness and rhetorical
structure. Characterizing the types of information in reviews unlocks many
opportunities for more effective consumption of this genre.
☆ A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics
Recent progress in large language models (LLM) found chain-of-thought
prompting strategies to improve the reasoning ability of LLMs by encouraging
problem solving through multiple steps. Therefore, subsequent research aimed to
integrate the multi-step reasoning process into the LLM itself through process
rewards as feedback and achieved improvements over prompting strategies. Due to
the cost of step-level annotation, some turn to outcome rewards as feedback.
Aside from these training-based approaches, training-free techniques leverage
frozen LLMs or external tools for feedback at each step to enhance the
reasoning process. With the abundance of work in mathematics due to its logical
nature, we present a survey of strategies utilizing feedback at the step and
outcome levels to enhance multi-step math reasoning for LLMs. As multi-step
reasoning emerges a crucial component in scaling LLMs, we hope to establish its
foundation for easier understanding and empower further research.
☆ Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems
Large Language Models (LLMs) have recently demonstrated remarkable
capabilities in reasoning, planning, and decision-making. Building upon these
strengths, researchers have begun incorporating LLMs into multi-agent systems
(MAS), where agents collaborate or compete through natural language
interactions to tackle tasks beyond the scope of single-agent setups. In this
survey, we present a communication-centric perspective on LLM-based multi-agent
systems, examining key system-level features such as architecture design and
communication goals, as well as internal mechanisms like communication
strategies, paradigms, objects and content. We illustrate how these
communication elements interplay to enable collective intelligence and flexible
collaboration. Furthermore, we discuss prominent challenges, including
scalability, security, and multimodal integration, and propose directions for
future work to advance research in this emerging domain. Ultimately, this
survey serves as a catalyst for further innovation, fostering more robust,
scalable, and intelligent multi-agent systems across diverse application
domains.
☆ Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models
Large language models (LLMs) regularly demonstrate new and impressive
performance on a wide range of language, knowledge, and reasoning benchmarks.
Such rapid progress has led many commentators to argue that LLM general
cognitive capabilities have likewise rapidly improved, with the implication
that such models are becoming progressively more capable on various real-world
tasks. Here I summarise theoretical and empirical considerations to challenge
this narrative. I argue that inherent limitations with the benchmarking
paradigm, along with specific limitations of existing benchmarks, render
benchmark performance highly unsuitable as a metric for generalisable
competence over cognitive tasks. I also contend that alternative methods for
assessing LLM capabilities, including adversarial stimuli and interpretability
techniques, have shown that LLMs do not have robust competence in many language
and reasoning tasks, and often fail to learn representations which facilitate
generalisable inferences. I conclude that benchmark performance should not be
used as a reliable indicator of general LLM cognitive capabilities.
comment: 10 pages
☆ ParallelComp: Parallel Long-Context Compressor for Length Extrapolation
Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong
Efficiently handling long contexts is crucial for large language models
(LLMs). While rotary position embeddings (RoPEs) enhance length generalization,
effective length extrapolation remains challenging and often requires costly
fine-tuning. In contrast, recent training-free approaches suffer from the
attention sink phenomenon, leading to severe performance degradation. In this
paper, we introduce ParallelComp, a novel training-free method for long-context
extrapolation that extends LLMs' context length from 4K to 128K while
maintaining high throughput and preserving perplexity, and integrates
seamlessly with Flash Attention. Our analysis offers new insights into
attention biases in parallel attention mechanisms and provides practical
solutions to tackle these challenges. To mitigate the attention sink issue, we
propose an attention calibration strategy that reduces biases, ensuring more
stable long-range attention. Additionally, we introduce a chunk eviction
strategy to efficiently manage ultra-long contexts on a single A100 80GB GPU.
To further enhance efficiency, we propose a parallel KV cache eviction
technique, which improves chunk throughput by 1.76x, thereby achieving a 23.50x
acceleration in the prefilling stage with negligible performance loss due to
attention calibration. Furthermore, ParallelComp achieves 91.17% of GPT-4's
performance on long-context tasks using an 8B model trained on 8K-length
context, outperforming powerful closed-source models such as Claude-2 and
Kimi-Chat.
comment: We will release the code soon
☆ Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension
Despite the impressive performance of multilingual large language models
(mLLMs) in various natural language processing tasks, their ability to
understand procedural texts, particularly those with culture-specific content,
remains largely unexplored. Texts describing cultural procedures, including
rituals, traditional craftsmanship, and social etiquette, require an inherent
understanding of cultural context, presenting a significant challenge for
mLLMs. In this work, we introduce CAPTex, a benchmark designed to evaluate
mLLMs' ability to process and reason about culturally diverse procedural texts
across multiple languages using various methodologies to assess their
performance. Our findings indicate that (1) mLLMs face difficulties with
culturally contextualized procedural texts, showing notable performance
declines in low-resource languages, (2) model performance fluctuates across
cultural domains, with some areas presenting greater difficulties, and (3)
language models exhibit better performance on multiple-choice tasks within
conversational frameworks compared to direct questioning. These results
underscore the current limitations of mLLMs in handling culturally nuanced
procedural texts and highlight the need for culturally aware benchmarks like
CAPTex to enhance their adaptability and comprehension across diverse
linguistic and cultural landscapes.
☆ The Impact and Feasibility of Self-Confidence Shaping for AI-Assisted Decision-Making
In AI-assisted decision-making, it is crucial but challenging for humans to
appropriately rely on AI, especially in high-stakes domains such as finance and
healthcare. This paper addresses this problem from a human-centered perspective
by presenting an intervention for self-confidence shaping, designed to
calibrate self-confidence at a targeted level. We first demonstrate the impact
of self-confidence shaping by quantifying the upper-bound improvement in
human-AI team performance. Our behavioral experiments with 121 participants
show that self-confidence shaping can improve human-AI team performance by
nearly 50% by mitigating both over- and under-reliance on AI. We then introduce
a self-confidence prediction task to identify when our intervention is needed.
Our results show that simple machine-learning models achieve 67% accuracy in
predicting self-confidence. We further illustrate the feasibility of such
interventions. The observed relationship between sentiment and self-confidence
suggests that modifying sentiment could be a viable strategy for shaping
self-confidence. Finally, we outline future research directions to support the
deployment of self-confidence shaping in a real-world scenario for effective
human-AI collaboration.
☆ MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models
Advancements in Large Language Models (LLMs) and their increasing use in
medical question-answering necessitate rigorous evaluation of their
reliability. A critical challenge lies in hallucination, where models generate
plausible yet factually incorrect outputs. In the medical domain, this poses
serious risks to patient safety and clinical decision-making. To address this,
we introduce MedHallu, the first benchmark specifically designed for medical
hallucination detection. MedHallu comprises 10,000 high-quality question-answer
pairs derived from PubMedQA, with hallucinated answers systematically generated
through a controlled pipeline. Our experiments show that state-of-the-art LLMs,
including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical,
struggle with this binary hallucination detection task, with the best model
achieving an F1 score as low as 0.625 for detecting "hard" category
hallucinations. Using bidirectional entailment clustering, we show that
harder-to-detect hallucinations are semantically closer to ground truth.
Through experiments, we also show incorporating domain-specific knowledge and
introducing a "not sure" category as one of the answer categories improves the
precision and F1 scores by up to 38% relative to baselines.
comment: Code and dataset are available at https://medhallu.github.io/
☆ SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xian Bin Yong, Weiqi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, William Chandra Tjhi
With the rapid emergence of novel capabilities in Large Language Models
(LLMs), the need for rigorous multilingual and multicultural benchmarks that
are integrated has become more pronounced. Though existing LLM benchmarks are
capable of evaluating specific capabilities of LLMs in English as well as in
various mid- to low-resource languages, including those in the Southeast Asian
(SEA) region, a comprehensive and authentic evaluation suite for the SEA
languages has not been developed thus far. Here, we present SEA-HELM, a
holistic linguistic and cultural LLM evaluation suite that emphasizes SEA
languages, comprising five core pillars: (1) NLP Classics, (2) LLM-specifics,
(3) SEA Linguistics, (4) SEA Culture, (5) Safety. SEA-HELM currently supports
Filipino, Indonesian, Tamil, Thai, and Vietnamese. We also introduce the
SEA-HELM leaderboard, which allows users to understand models' multilingual and
multicultural performance in a systematic and user-friendly manner.
☆ Drift: Decoding-time Personalized Alignments with Implicit User Preferences
Personalized alignments for individual users have been a long-standing goal
in large language models (LLMs). We introduce Drift, a novel framework that
personalizes LLMs at decoding time with implicit user preferences. Traditional
Reinforcement Learning from Human Feedback (RLHF) requires thousands of
annotated examples and expensive gradient updates. In contrast, Drift
personalizes LLMs in a training-free manner, using only a few dozen examples to
steer a frozen model through efficient preference modeling. Our approach models
user preferences as a composition of predefined, interpretable attributes and
aligns them at decoding time to enable personalized generation. Experiments on
both a synthetic persona dataset (Perspective) and a real human-annotated
dataset (PRISM) demonstrate that Drift significantly outperforms RLHF baselines
while using only 50-100 examples. Our results and analysis show that Drift is
both computationally efficient and interpretable.
comment: 19 pages, 6 figures
☆ Vulnerability of Text-to-Image Models to Prompt Template Stealing: A Differential Evolution Approach
Yurong Wu, Fangwen Mu, Qiuhong Zhang, Jinjing Zhao, Xinrun Xu, Lingrui Mei, Yang Wu, Lin Shi, Junjie Wang, Zhiming Ding, Yiwei Wang
Prompt trading has emerged as a significant intellectual property concern in
recent years, where vendors entice users by showcasing sample images before
selling prompt templates that can generate similar images. This work
investigates a critical security vulnerability: attackers can steal prompt
templates using only a limited number of sample images. To investigate this
threat, we introduce Prism, a prompt-stealing benchmark consisting of 50
templates and 450 images, organized into Easy and Hard difficulty levels. To
identify the vulnerabity of VLMs to prompt stealing, we propose EvoStealer, a
novel template stealing method that operates without model fine-tuning by
leveraging differential evolution algorithms. The system first initializes
population sets using multimodal large language models (MLLMs) based on
predefined patterns, then iteratively generates enhanced offspring through
MLLMs. During evolution, EvoStealer identifies common features across offspring
to derive generalized templates. Our comprehensive evaluation conducted across
open-source (INTERNVL2-26B) and closed-source models (GPT-4o and GPT-4o-mini)
demonstrates that EvoStealer's stolen templates can reproduce images highly
similar to originals and effectively generalize to other subjects,
significantly outperforming baseline methods with an average improvement of
over 10%. Moreover, our cost analysis reveals that EvoStealer achieves template
stealing with negligible computational expenses. Our code and dataset are
available at https://github.com/whitepagewu/evostealer.
comment: 14 pages,8 figures,4 tables
☆ EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts
Subhajit Chaudhury, Payel Das, Sarathkrishna Swaminathan, Georgios Kollias, Elliot Nelson, Khushbu Pahwa, Tejaswini Pedapati, Igor Melnyk, Matthew Riemer
Recent advances in Large Language Models (LLMs) have yielded impressive
successes on many language tasks. However, efficient processing of long
contexts using LLMs remains a significant challenge. We introduce
\textbf{EpMAN} -- a method for processing long contexts in an \textit{episodic
memory} module while \textit{holistically attending to} semantically relevant
context chunks. The output of \textit{episodic attention} is then used to
reweigh the decoder's self-attention to the stored KV cache of the context
during training and generation. When an LLM decoder is trained using
\textbf{EpMAN}, its performance on multiple challenging single-hop long-context
recall and question-answering benchmarks is found to be stronger and more
robust across the range from 16k to 256k tokens than baseline decoders trained
with self-attention, and popular retrieval-augmented generation frameworks.
☆ STeCa: Step-level Trajectory Calibration for LLM Agent Learning
Large language model (LLM)-based agents have shown promise in tackling
complex tasks by interacting dynamically with the environment. Existing work
primarily focuses on behavior cloning from expert demonstrations and preference
learning through exploratory trajectory sampling. However, these methods often
struggle in long-horizon tasks, where suboptimal actions accumulate step by
step, causing agents to deviate from correct task trajectories. To address
this, we highlight the importance of timely calibration and the need to
automatically construct calibration trajectories for training agents. We
propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM
agent learning. Specifically, STeCa identifies suboptimal actions through a
step-level reward comparison during exploration. It constructs calibrated
trajectories using LLM-driven reflection, enabling agents to learn from
improved decision-making processes. These calibrated trajectories, together
with successful trajectory data, are utilized for reinforced training.
Extensive experiments demonstrate that STeCa significantly outperforms existing
methods. Further analysis highlights that step-level calibration enables agents
to complete tasks with greater robustness. Our code and data are available at
https://github.com/WangHanLinHenry/STeCa.
☆ Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment
Large language models (LLMs) have been widely adopted in various downstream
task domains. However, their ability to directly recall and apply factual
medical knowledge remains under-explored. Most existing medical QA benchmarks
assess complex reasoning or multi-hop inference, making it difficult to isolate
LLMs' inherent medical knowledge from their reasoning capabilities. Given the
high-stakes nature of medical applications, where incorrect information can
have critical consequences, it is essential to evaluate how well LLMs encode,
retain, and recall fundamental medical facts.
To bridge this gap, we introduce the Medical Knowledge Judgment, a dataset
specifically designed to measure LLMs' one-hop factual medical knowledge. MKJ
is constructed from the Unified Medical Language System (UMLS), a large-scale
repository of standardized biomedical vocabularies and knowledge graphs. We
frame knowledge assessment as a binary judgment task, requiring LLMs to verify
the correctness of medical statements extracted from reliable and structured
knowledge sources.
Our experiments reveal that LLMs struggle with factual medical knowledge
retention, exhibiting significant performance variance across different
semantic categories, particularly for rare medical conditions. Furthermore,
LLMs show poor calibration, often being overconfident in incorrect answers. To
mitigate these issues, we explore retrieval-augmented generation, demonstrating
its effectiveness in improving factual accuracy and reducing uncertainty in
medical decision-making.
comment: 15 pages, 11 figures
☆ Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models
Aligning small language models (SLMs) with human values typically involves
distilling preference knowledge from large language models (LLMs). However,
existing distillation methods model preference knowledge in teacher LLMs by
comparing pairwise responses, overlooking the extent of difference between
responses. This limitation hinders student SLMs from capturing the nuanced
preferences for multiple responses. In this paper, we propose a
Preference-Aligned Distillation (PAD) framework, which models teacher's
preference knowledge as a probability distribution over all potential
preferences, thereby providing more nuanced supervisory signals. Our insight in
developing PAD is rooted in the demonstration that language models can serve as
reward functions, reflecting their intrinsic preferences. Based on this, PAD
comprises three key steps: (1) sampling diverse responses using
high-temperature; (2) computing rewards for both teacher and student to
construct their intrinsic preference; and (3) training the student's intrinsic
preference distribution to align with the teacher's. Experiments on four
mainstream alignment benchmarks demonstrate that PAD consistently and
significantly outperforms existing approaches, achieving over 20\% improvement
on AlpacaEval 2 and Arena-Hard, indicating superior alignment with human
preferences. Notably, on MT-Bench, using the \textsc{Gemma} model family, the
student trained by PAD surpasses its teacher, further validating the
effectiveness of our PAD.
comment: Under review
♻ ☆ Large Language Model Confidence Estimation via Black-Box Access
Estimating uncertainty or confidence in the responses of a model can be
significant in evaluating trust not only in the responses, but also in the
model as a whole. In this paper, we explore the problem of estimating
confidence for responses of large language models (LLMs) with simply black-box
or query access to them. We propose a simple and extensible framework where, we
engineer novel features and train a (interpretable) model (viz. logistic
regression) on these features to estimate the confidence. We empirically
demonstrate that our simple framework is effective in estimating confidence of
Flan-ul2, Llama-13b, Mistral-7b and GPT-4 on four benchmark Q\&A tasks as well
as of Pegasus-large and BART-large on two benchmark summarization tasks with it
surpassing baselines by even over $10\%$ (on AUROC) in some cases.
Additionally, our interpretable approach provides insight into features that
are predictive of confidence, leading to the interesting and useful discovery
that our confidence models built for one LLM generalize zero-shot across others
on a given dataset.
♻ ☆ The Computational Limits of State-Space Models and Mamba via the Lens of Circuit Complexity
In this paper, we analyze the computational limitations of Mamba and
State-space Models (SSMs) by using the circuit complexity framework. Despite
Mamba's stateful design and recent attention as a strong candidate to
outperform Transformers, we have demonstrated that both Mamba and SSMs with
$\mathrm{poly}(n)$-precision and constant-depth layers reside within the
$\mathsf{DLOGTIME}$-uniform $\mathsf{TC}^0$ complexity class. This result
indicates Mamba has the same computational capabilities as Transformer
theoretically, and it cannot solve problems like arithmetic formula problems,
boolean formula value problems, and permutation composition problems if
$\mathsf{TC}^0 \neq \mathsf{NC}^1$. Therefore, it challenges the assumption
Mamba is more computationally expressive than Transformers. Our contributions
include rigorous proofs showing that Selective SSM and Mamba architectures can
be simulated by $\mathsf{DLOGTIME}$-uniform $\mathsf{TC}^0$ circuits, and they
cannot solve problems outside $\mathsf{TC}^0$.
comment: CPAL 2025
♻ ☆ Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
Recent advancements in 3D Large Language Models (3DLLMs) have highlighted
their potential in building general-purpose agents in the 3D real world, yet
challenges remain due to the lack of high-quality robust instruction-following
data, leading to limited discriminative power and generalization of 3DLLMs. In
this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale
instruction-following data generated by our novel data engine, Robust
Instruction Generation (RIG) engine. RIG generates two key instruction data: 1)
the Adversarial Instruction-following data, which features mixed negative and
positive samples to enhance the model's discriminative understanding. 2) the
Diverse Instruction-following data, which contains various instruction styles
to enhance model's generalization. As a result, we construct 1 million
instruction-following data, consisting of 344K Adversarial samples, 508K
Diverse samples, and 165K benchmark training set samples. To better handle
these complex instructions, Robin3D first incorporates Relation-Augmented
Projector to enhance spatial understanding, and then strengthens the object
referring and grounding ability through ID-Feature Bonding. Robin3D
consistently outperforms previous methods across five widely-used 3D multimodal
learning benchmarks, without the need for task-specific fine-tuning. Notably,
we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\%
improvement in the captioning task (Scan2Cap).
comment: 8 pages
♻ ☆ How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations NAACL 2025
Multimodal foundation models aim to create a unified representation space
that abstracts away from surface features like language syntax or modality
differences. To investigate this, we study the internal representations of
three recent models, analyzing the model activations from semantically
equivalent sentences across languages in the text and speech modalities. Our
findings reveal that: 1) Cross-modal representations converge over model
layers, except in the initial layers specialized at text and speech processing.
2) Length adaptation is crucial for reducing the cross-modal gap between text
and speech, although current approaches' effectiveness is primarily limited to
high-resource languages. 3) Speech exhibits larger cross-lingual differences
than text. 4) For models not explicitly trained for modality-agnostic
representations, the modality gap is more prominent than the language gap.
comment: NAACL 2025
♻ ☆ Safety Evaluation of DeepSeek Models in Chinese Contexts
Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Ning Wang, Zhenhong Long, Peijun Yang, Jiaojiao Zhao, Minjie Hua, Chaoyang Ma, Kai Wang, Shiguo Lian
Recently, the DeepSeek series of models, leveraging their exceptional
reasoning capabilities and open-source strategy, is reshaping the global AI
landscape. Despite these advantages, they exhibit significant safety
deficiencies. Research conducted by Robust Intelligence, a subsidiary of Cisco,
in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1
has a 100\% attack success rate when processing harmful prompts. Additionally,
multiple safety companies and research institutions have confirmed critical
safety vulnerabilities in this model. As models demonstrating robust
performance in Chinese and English, DeepSeek models require equally crucial
safety assessments in both language contexts. However, current research has
predominantly focused on safety evaluations in English environments, leaving a
gap in comprehensive assessments of their safety performance in Chinese
contexts. In response to this gap, this study introduces CHiSafetyBench, a
Chinese-specific safety evaluation benchmark. This benchmark systematically
evaluates the safety of DeepSeek-R1 and DeepSeek-V3 in Chinese contexts,
revealing their performance across safety categories. The experimental results
quantify the deficiencies of these two models in Chinese contexts, providing
key insights for subsequent improvements. It should be noted that, despite our
efforts to establish a comprehensive, objective, and authoritative evaluation
benchmark, the selection of test samples, characteristics of data distribution,
and the setting of evaluation criteria may inevitably introduce certain biases
into the evaluation results. We will continuously optimize the evaluation
benchmark and periodically update this report to provide more comprehensive and
accurate assessment outcomes. Please refer to the latest version of the paper
for the most recent evaluation results and conclusions.
comment: 12 pages, 2 tables, 7 figures
♻ ☆ Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation
Despite the remarkable capabilities of Large Language Models (LLMs) in
various NLP tasks, they remain vulnerable to hallucinations due to their
limited parametric knowledge and lack of domain-specific expertise.
Retrieval-Augmented Generation (RAG) addresses this challenge by incorporating
external document retrieval to augment the knowledge base of LLMs. In this
approach, RAG retrieves document chunks from an external corpus in response to
a query, which are then used as context for the downstream language model to
generate an answer. However, these retrieved knowledge sources often include
irrelevant or erroneous information, undermining the effectiveness of RAG in
downstream tasks. To overcome this limitation, we introduce a compact,
efficient, and pluggable module designed to refine external knowledge sources
before feeding them to the generator. The module reconstructs retrieved content
by extracting the most relevant and supportive information and reorganising it
into a concise, query-specific format. Through a three-stage training paradigm
- comprising supervised fine-tuning, contrastive multi-task learning, and
reinforcement learning-based alignment - it prioritises critical knowledge and
aligns it with the generator's preferences. This method enables LLMs to produce
outputs that are more accurate, reliable, and contextually appropriate.
comment: 14 pages
♻ ☆ Reading between the Lines: Can LLMs Identify Cross-Cultural Communication Gaps?
In a rapidly globalizing and digital world, content such as book and product
reviews created by people from diverse cultures are read and consumed by others
from different corners of the world. In this paper, we investigate the extent
and patterns of gaps in understandability of book reviews due to the presence
of culturally-specific items and elements that might be alien to users from
another culture. Our user-study on 57 book reviews from Goodreads reveal that
83\% of the reviews had at least one culture-specific difficult-to-understand
element. We also evaluate the efficacy of GPT-4o in identifying such items,
given the cultural background of the reader; the results are mixed, implying a
significant scope for improvement. Our datasets are available here:
https://github.com/sougata-ub/reading_between_lines
♻ ☆ metabench -- A Sparse Benchmark of Reasoning and Knowledge in Large Language Models ICLR 2025
Large Language Models (LLMs) vary in their abilities on a range of tasks.
Initiatives such as the Open LLM Leaderboard aim to quantify these differences
with several large benchmarks (sets of test items to which an LLM can respond
either correctly or incorrectly). However, high correlations within and between
benchmark scores suggest that (1) there exists a small set of common underlying
abilities that these benchmarks measure, and (2) items tap into redundant
information and the benchmarks may thus be considerably compressed. We use data
from n > 5000 LLMs to identify the most informative items of six benchmarks,
ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande (with d = 28,632 items
in total). From them we distill a sparse benchmark, metabench, that has less
than 3% of the original size of all six benchmarks combined. This new sparse
benchmark goes beyond point scores by yielding estimators of the underlying
benchmark-specific abilities. We show that these estimators (1) can be used to
reconstruct each original individual benchmark score with, on average, 1.24%
root mean square error (RMSE), (2) reconstruct the original total score with
0.58% RMSE, and (3) have a single underlying common factor whose Spearman
correlation with the total score is r = 0.94.
comment: accepted for publication at ICLR 2025
♻ ☆ Certified Robustness Under Bounded Levenshtein Distance ICLR 2025
Text classifiers suffer from small perturbations, that if chosen
adversarially, can dramatically change the output of the model. Verification
methods can provide robustness certificates against such adversarial
perturbations, by computing a sound lower bound on the robust accuracy.
Nevertheless, existing verification methods incur in prohibitive costs and
cannot practically handle Levenshtein distance constraints. We propose the
first method for computing the Lipschitz constant of convolutional classifiers
with respect to the Levenshtein distance. We use these Lipschitz constant
estimates for training 1-Lipschitz classifiers. This enables computing the
certified radius of a classifier in a single forward pass. Our method, LipsLev,
is able to obtain $38.80$% and $13.93$% verified accuracy at distance $1$ and
$2$ respectively in the AG-News dataset, while being $4$ orders of magnitude
faster than existing approaches. We believe our work can open the door to more
efficient verification in the text domain.
comment: Accepted in ICLR 2025
♻ ☆ SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters ICLR 2025
Existing preference optimization objectives for language model alignment
require additional hyperparameters that must be extensively tuned to achieve
optimal performance, increasing both the complexity and time required for
fine-tuning large language models. In this paper, we propose a simple yet
effective hyperparameter-free preference optimization algorithm for alignment.
We observe that promising performance can be achieved simply by optimizing
inverse perplexity, which is calculated as the inverse of the exponentiated
average log-likelihood of the chosen and rejected responses in the preference
dataset. The resulting simple learning objective, SimPER, is easy to implement
and eliminates the need for expensive hyperparameter tuning and a reference
model, making it both computationally and memory efficient. Extensive
experiments on widely used real-world benchmarks, including MT-Bench,
AlpacaEval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base
models, demonstrate that SimPER consistently and significantly outperforms
existing approaches-even without any hyperparameters or a reference model . For
example, despite its simplicity, SimPER outperforms state-of-the-art methods by
up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking
across 10 benchmarks on the Open LLM Leaderboard. The source code for SimPER is
publicly available at: https://github.com/tengxiao1/SimPER.
comment: ICLR 2025
♻ ☆ OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Ningyu Zhang, Jiang Yong, Pengjun Xie, Fei Huang, Huajun Chen
Machine writing with large language models often relies on
retrieval-augmented generation. However, these approaches remain confined
within the boundaries of the model's predefined scope, limiting the generation
of content with rich information. Specifically, vanilla-retrieved information
tends to lack depth, novelty, and suffers from redundancy, which negatively
impacts the quality of generated articles, leading to shallow, unoriginal, and
repetitive outputs. To address these issues, we propose OmniThink, a
slow-thinking machine writing framework that emulates the human-like process of
iterative expansion and reflection. The core idea behind OmniThink is to
simulate the cognitive behavior of learners as they slowly deepen their
knowledge of the topics. Experimental results demonstrate that OmniThink
improves the knowledge density of generated articles without compromising
metrics such as coherence and depth. Human evaluations and expert feedback
further highlight the potential of OmniThink to address real-world challenges
in the generation of long-form articles.
comment: Code is available at https://github.com/zjunlp/OmniThink
♻ ☆ CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs
Chinese, as a linguistic system rich in depth and complexity, is
characterized by distinctive elements such as ancient poetry, proverbs, idioms,
and other cultural constructs. However, current Large Language Models (LLMs)
face limitations in these specialized domains, highlighting the need for the
development of comprehensive datasets that can assess, continuously update, and
progressively improve these culturally-grounded linguistic competencies through
targeted training optimizations. To address this gap, we introduce CKnowEdit,
the first-ever Chinese knowledge editing dataset designed to correct
linguistic, factual, and logical errors in LLMs. We collect seven types of
knowledge from a wide range of sources, including classical texts, idioms, and
content from Baidu Tieba Ruozhiba, taking into account the unique polyphony,
antithesis, and logical structures inherent in the Chinese language. By
analyzing this dataset, we highlight the challenges current LLMs face in
mastering Chinese. Furthermore, our evaluation of state-of-the-art knowledge
editing techniques reveals opportunities to advance the correction of Chinese
knowledge. Code and dataset are available at
https://github.com/zjunlp/EasyEdit.
comment: Ongoing work; project website is available at
https://zjunlp.github.io/project/CKnowEdit code and dataset are available at
https://github.com/zjunlp/EasyEdit
♻ ☆ Non-Contextual BERT or FastText? A Comparative Analysis
Natural Language Processing (NLP) for low-resource languages, which lack
large annotated datasets, faces significant challenges due to limited
high-quality data and linguistic resources. The selection of embeddings plays a
critical role in achieving strong performance in NLP tasks. While contextual
BERT embeddings require a full forward pass, non-contextual BERT embeddings
rely only on table lookup. Existing research has primarily focused on
contextual BERT embeddings, leaving non-contextual embeddings largely
unexplored. In this study, we analyze the effectiveness of non-contextual
embeddings from BERT models (MuRIL and MahaBERT) and FastText models (IndicFT
and MahaFT) for tasks such as news classification, sentiment analysis, and hate
speech detection in one such low-resource language Marathi. We compare these
embeddings with their contextual and compressed variants. Our findings indicate
that non-contextual BERT embeddings extracted from the model's first embedding
layer outperform FastText embeddings, presenting a promising alternative for
low-resource NLP.
♻ ☆ Extracting Sentence Embeddings from Pretrained Transformer Models
Pre-trained transformer models shine in many natural language processing
tasks and therefore are expected to bear the representation of the input
sentence or text meaning. These sentence-level embeddings are also important in
retrieval-augmented generation. But do commonly used plain averaging or prompt
templates sufficiently capture and represent the underlying meaning? After
providing a comprehensive review of existing sentence embedding extraction and
refinement methods, we thoroughly test different combinations and our original
extensions of the most promising ones on pretrained models. Namely, given 110 M
parameters, BERT's hidden representations from multiple layers, and many
tokens, we try diverse ways to extract optimal sentence embeddings. We test
various token aggregation and representation post-processing techniques. We
also test multiple ways of using a general Wikitext dataset to complement
BERT's sentence embeddings. All methods are tested on eight Semantic Textual
Similarity (STS), six short text clustering, and twelve classification tasks.
We also evaluate our representation-shaping techniques on other static models,
including random token representations. Proposed representation extraction
methods improve the performance on STS and clustering tasks for all models
considered. Very high improvements for static token-based models, especially
random embeddings for STS tasks, almost reach the performance of BERT-derived
representations. Our work shows that the representation-shaping techniques
significantly improve sentence embeddings extracted from BERT-based and simple
baseline models.
comment: Postprint update
♻ ☆ Revisiting In-context Learning Inference Circuit in Large Language Models ICLR 2025
In-context Learning (ICL) is an emerging few-shot learning paradigm on
Language Models (LMs) with inner mechanisms un-explored. There are already
existing works describing the inner processing of ICL, while they struggle to
capture all the inference phenomena in large language models. Therefore, this
paper proposes a comprehensive circuit to model the inference dynamics and try
to explain the observed phenomena of ICL. In detail, we divide ICL inference
into 3 major operations: (1) Input Text Encode: LMs encode every input text (in
the demonstrations and queries) into linear representation in the hidden states
with sufficient information to solve ICL tasks. (2) Semantics Merge: LMs merge
the encoded representations of demonstrations with their corresponding label
tokens to produce joint representations of labels and demonstrations. (3)
Feature Retrieval and Copy: LMs search the joint representations of
demonstrations similar to the query representation on a task subspace, and copy
the searched representations into the query. Then, language model heads capture
these copied label representations to a certain extent and decode them into
predicted labels. Through careful measurements, the proposed inference circuit
successfully captures and unifies many fragmented phenomena observed during the
ICL process, making it a comprehensive and practical explanation of the ICL
inference process. Moreover, ablation analysis by disabling the proposed steps
seriously damages the ICL performance, suggesting the proposed inference
circuit is a dominating mechanism. Additionally, we confirm and list some
bypass mechanisms that solve ICL tasks in parallel with the proposed circuit.
comment: 37 pages, 41 figures, 8 tables. ICLR 2025 Accepted. Camera-ready
Version
♻ ☆ M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation
Recent advancements in large language models (LLMs) have given rise to the
LLM-as-a-judge paradigm, showcasing their potential to deliver human-like
judgments. However, in the field of machine translation (MT) evaluation,
current LLM-as-a-judge methods fall short of learned automatic metrics. In this
paper, we propose Multidimensional Multi-Agent Debate (M-MAD), a systematic
LLM-based multi-agent framework for advanced LLM-as-a-judge MT evaluation. Our
findings demonstrate that M-MAD achieves significant advancements by (1)
decoupling heuristic MQM criteria into distinct evaluation dimensions for
fine-grained assessments; (2) employing multi-agent debates to harness the
collaborative reasoning capabilities of LLMs; (3) synthesizing
dimension-specific results into a final evaluation judgment to ensure robust
and reliable outcomes. Comprehensive experiments show that M-MAD not only
outperforms all existing LLM-as-a-judge methods but also competes with
state-of-the-art reference-based automatic metrics, even when powered by a
suboptimal model like GPT-4o mini. Detailed ablations and analysis highlight
the superiority of our framework design, offering a fresh perspective for
LLM-as-a-judge paradigm. Our code and data are publicly available at
https://github.com/SU-JIAYUAN/M-MAD.
comment: Code and data are available at https://github.com/SU-JIAYUAN/M-MAD
♻ ☆ More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression
Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li
As large language models (LLMs) process increasing context windows, the
memory usage of KV cache has become a critical bottleneck during inference. The
mainstream KV compression methods, including KV pruning and KV quantization,
primarily focus on either token or precision dimension separately. However,
these works leaving the trade-off between these two orthogonal dimensions
largely under-explored. In this paper, we comprehensively investigate the
token-precision trade-off in KV cache compression.Experiments demonstrate that
storing more tokens in the KV cache with lower precision,a strategy we term
quantized pruning, can significantly enhance the long-context performance of
LLMs. In-depth analysis of the token-precision trade-off across key aspects
demonstrates that, quantized pruning achieves substantial improvements in
retrieval-related tasks and consistently performs well across varying input
lengths. Furthermore, quantized pruning demonstrates notable stability and
effectiveness across different KV pruning methods, quantization strategies, and
model scales. These findings offer valuable insights into optimizing KV cache
compression through balanced token-precision trade-off strategies. Our code is
available at https://github.com/zhzihao/QPruningKV.
comment: 13pages,9 figures
♻ ☆ T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
Text-to-image (T2I) models have rapidly advanced, enabling the generation of
high-quality images from text prompts across various domains. However, these
models present notable safety concerns, including the risk of generating
harmful, biased, or private content. Current research on assessing T2I safety
remains in its early stages. While some efforts have been made to evaluate
models on specific safety dimensions, many critical risks remain unexplored. To
address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I
models across three key domains: toxicity, fairness, and bias. We build a
detailed hierarchy of 12 tasks and 44 categories based on these three domains,
and meticulously collect 70K corresponding prompts. Based on this taxonomy and
prompt set, we build a large-scale T2I dataset with 68K manually annotated
images and train an evaluator capable of detecting critical risks that previous
work has failed to identify, including risks that even ultra-large proprietary
models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion
models on T2ISafety and reveal several concerns including persistent issues
with racial fairness, a tendency to generate toxic content, and significant
variation in privacy protection across the models, even with defense methods
like concept erasing. Data and evaluator are released under
https://github.com/adwardlee/t2i_safety.
♻ ☆ Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples
The evaluation of cross-lingual semantic search capabilities of models is
often limited to existing datasets from tasks such as information retrieval and
semantic textual similarity. To allow for domain-specific evaluation, we
introduce Cross Lingual Semantic Discrimination (CLSD), a novel cross-lingual
semantic search task that does not require a large evaluation corpus, only
parallel sentences of the language pair of interest within the target domain.
This task focuses on the ability of a model to cross-lingually rank the true
parallel sentence higher than challenging distractors generated by a large
language model. We create a case study of our introduced CLSD task for the
language pair German-French in the news domain. Within this case study, we find
that models that are also fine-tuned for retrieval tasks benefit from pivoting
through English, while bitext mining models perform best directly
cross-lingually. A fine-grained similarity analysis enabled by our distractor
generation strategy indicate that different embedding models are sensitive to
different types of perturbations.
♻ ☆ How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
In the age of misinformation, hallucination -- the tendency of Large Language
Models (LLMs) to generate non-factual or unfaithful responses -- represents the
main risk for their global utility. Despite LLMs becoming increasingly
multilingual, the vast majority of research on detecting and quantifying LLM
hallucination are (a) English-centric and (b) focus on machine translation (MT)
and summarization, tasks that are less common ``in the wild'' than open
information seeking. In contrast, we aim to quantify the extent of LLM
hallucination across languages in knowledge-intensive long-form question
answering. To this end, we train a multilingual hallucination detection model
and conduct a large-scale study across 30 languages and 6 open-source LLM
families. We start from an English hallucination detection dataset and rely on
MT to generate (noisy) training data in other languages. We also manually
annotate gold data for five high-resource languages; we then demonstrate, for
these languages, that the estimates of hallucination rates are similar between
silver (LLM-generated) and gold test sets, validating the use of silver data
for estimating hallucination rates for other languages. For the final rates
estimation, we build a knowledge-intensive QA dataset for 30 languages with
LLM-generated prompts and Wikipedia articles as references. We find that, while
LLMs generate longer responses with more hallucinated tokens for
higher-resource languages, there is no correlation between length-normalized
hallucination rates of languages and their digital representation. Further, we
find that smaller LLMs exhibit larger hallucination rates than larger models.
♻ ☆ Lexical Complexity Prediction and Lexical Simplification for Catalan and Spanish: Resource Creation, Quality Assessment, and Ethical Considerations
Automatic lexical simplification is a task to substitute lexical items that
may be unfamiliar and difficult to understand with easier and more common
words. This paper presents the description and analysis of two novel datasets
for lexical simplification in Spanish and Catalan. This dataset represents the
first of its kind in Catalan and a substantial addition to the sparse data on
automatic lexical simplification which is available for Spanish. Specifically,
it is the first dataset for Spanish which includes scalar ratings of the
understanding difficulty of lexical items. In addition, we present a detailed
analysis aiming at assessing the appropriateness and ethical dimensions of the
data for the lexical simplification task.
♻ ☆ LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization ICLR 2025
Large Language Models (LLMs) have demonstrated remarkable capabilities
through pretraining and alignment. However, superior short-context LLMs may
underperform in long-context scenarios due to insufficient long-context
alignment. This alignment process remains challenging due to the impracticality
of human annotation for extended contexts and the difficulty in balancing
short- and long-context performance. To address these challenges, we introduce
LongPO, that enables short-context LLMs to self-evolve to excel on long-context
tasks by internally transferring short-context capabilities. LongPO harnesses
LLMs to learn from self-generated short-to-long preference data, comprising
paired responses generated for identical instructions with long-context inputs
and their compressed short-context counterparts, respectively. This preference
reveals capabilities and potentials of LLMs cultivated during short-context
alignment that may be diminished in under-aligned long-context scenarios.
Additionally, LongPO incorporates a short-to-long KL constraint to mitigate
short-context performance decline during long-context alignment. When applied
to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully
retains short-context performance and largely outperforms naive SFT and DPO in
both long- and short-context tasks. Specifically, LongPO-trained models can
achieve results on long-context benchmarks comparable to, or even surpassing,
those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context
annotation and larger parameter scales. Our code is available at
https://github.com/DAMO-NLP-SG/LongPO.
comment: ICLR 2025
♻ ☆ Neural Attention Search
We present Neural Attention Search (NAtS), a framework that automatically
evaluates the importance of each token within a sequence and determines if the
corresponding token can be dropped after several steps. This approach can
efficiently reduce the KV cache sizes required by transformer-based models
during inference and thus reduce inference costs. In this paper, we design a
search space that contains three token types: (i) Global Tokens will be
preserved and queried by all the following tokens. (ii) Local Tokens survive
until the next global token appears. (iii) Sliding Window Tokens have an impact
on the inference of a fixed size of the next following tokens. Similar to the
One-Shot Neural Architecture Search approach, this token-type information can
be learned jointly with the architecture weights via a learnable attention
mask. Experiments on both training a new transformer from scratch and
fine-tuning existing large language models show that NAtS can efficiently
reduce the KV cache size required for the models while maintaining the models'
performance.
comment: 18 pages, 8 figures
♻ ☆ Repetition Neurons: How Do Language Models Produce Repetitions? NAACL 2025
This paper introduces repetition neurons, regarded as skill neurons
responsible for the repetition problem in text generation tasks. These neurons
are progressively activated more strongly as repetition continues, indicating
that they perceive repetition as a task to copy the previous context
repeatedly, similar to in-context learning. We identify these repetition
neurons by comparing activation values before and after the onset of repetition
in texts generated by recent pre-trained language models. We analyze the
repetition neurons in three English and one Japanese pre-trained language
models and observe similar patterns across them.
comment: NAACL 2025
♻ ☆ RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars
Alignment tuning is crucial for ensuring large language models (LLMs) behave
ethically and helpfully. Current alignment approaches require high-quality
annotations and significant training resources. This paper proposes a low-cost,
tuning-free method using in-context learning (ICL) to enhance LLM alignment.
Through an analysis of high-quality ICL demos, we identified style as a key
factor influencing LLM alignment capabilities and explicitly restyled ICL
exemplars based on this stylistic framework. Additionally, we combined the
restyled demos to achieve a balance between the two conflicting aspects of LLM
alignment--factuality and safety. We packaged the restyled examples as prompts
to trigger few-shot learning, improving LLM alignment. Compared to the best
baseline approach, with an average score of 5.00 as the maximum, our method
achieves a maximum 0.10 increase on the Alpaca task (from 4.50 to 4.60), a 0.22
enhancement on the Just-eval benchmark (from 4.34 to 4.56), and a maximum
improvement of 0.32 (from 3.53 to 3.85) on the MT-Bench dataset. We release the
code and data at https://github.com/AnonymousCode-ComputerScience/RIDE.
comment: 38 pages, 2 figures, 20 tables; The paper is under review in ARR
♻ ☆ MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to
evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA
includes 4,460 questions spanning 17 specialties and 11 body systems. It
includes two subsets, Text for text evaluation and MM for multimodal
evaluation. Notably, MM introduces expert-level exam questions with diverse
images and rich clinical information, including patient records and examination
results, setting it apart from traditional medical multimodal benchmarks with
simple QA pairs generated from image captions. MedXpertQA applies rigorous
filtering and augmentation to address the insufficient difficulty of existing
benchmarks like MedQA, and incorporates specialty board questions to improve
clinical relevance and comprehensiveness. We perform data synthesis to mitigate
data leakage risk and conduct multiple rounds of expert reviews to ensure
accuracy and reliability. We evaluate 16 leading models on MedXpertQA.
Moreover, medicine is deeply connected to real-world decision-making, providing
a rich and representative setting for assessing reasoning abilities beyond
mathematics and code. To this end, we develop a reasoning-oriented subset to
facilitate the assessment of o1-like models.
♻ ☆ DP-MemArc: Differential Privacy Transfer Learning for Memory Efficient Language Models
Yanming Liu, Xinyue Peng, Yuwei Zhang, Xiaolan Ke, Songhang Deng, Jiannan Cao, Chen Ma, Mengchen Fu, Tianyu Du, Sheng Cheng, Xun Wang, Jianwei Yin, Xuhong Zhang
Large language models have repeatedly shown outstanding performance across
diverse applications. However, deploying these models can inadvertently risk
user privacy. The significant memory demands during training pose a major
challenge in terms of resource consumption. This substantial size places a
heavy load on memory resources, raising considerable practical concerns. In
this paper, we introduce DP-MemArc, a novel training framework aimed at
reducing the memory costs of large language models while emphasizing the
protection of user data privacy. DP-MemArc incorporates side network or
reversible network designs to support a variety of differential privacy
memory-efficient fine-tuning schemes. Our approach not only achieves about 2.5
times in memory optimization but also ensures robust privacy protection,
keeping user data secure and confidential. Extensive experiments have
demonstrated that DP-MemArc effectively provides differential privacy-efficient
fine-tuning across different task scenarios.
comment: Fix metadata error
♻ ☆ Grammar Induction from Visual, Speech and Text
Grammar Induction could benefit from rich heterogeneous signals, such as
text, vision, and acoustics. In the process, features from distinct modalities
essentially serve complementary roles to each other. With such intuition, this
work introduces a novel \emph{unsupervised visual-audio-text grammar induction}
task (named \textbf{VAT-GI}), to induce the constituent grammar trees from
parallel images, text, and speech inputs. Inspired by the fact that language
grammar natively exists beyond the texts, we argue that the text has not to be
the predominant modality in grammar induction. Thus we further introduce a
\emph{textless} setting of VAT-GI, wherein the task solely relies on visual and
auditory inputs. To approach the task, we propose a visual-audio-text
inside-outside recursive autoencoder (\textbf{VaTiora}) framework, which
leverages rich modal-specific and complementary features for effective grammar
parsing. Besides, a more challenging benchmark data is constructed to assess
the generalization ability of VAT-GI system. Experiments on two benchmark
datasets demonstrate that our proposed VaTiora system is more effective in
incorporating the various multimodal signals, and also presents new
state-of-the-art performance of VAT-GI.
♻ ☆ Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics
The Theory of Multiple Intelligences underscores the hierarchical nature of
cognitive capabilities. To advance Spatial Artificial Intelligence, we pioneer
a psychometric framework defining five Basic Spatial Abilities (BSAs) in Visual
Language Models (VLMs): Spatial Perception, Spatial Relation, Spatial
Orientation, Mental Rotation, and Spatial Visualization. Benchmarking 13
mainstream VLMs through nine validated psychometric experiments reveals
significant gaps versus humans (average score 24.95 vs. 68.38), with three key
findings: 1) VLMs mirror human hierarchies (strongest in 2D orientation,
weakest in 3D rotation) with independent BSAs (Pearson's r<0.4); 2) Smaller
models such as Qwen2-VL-7B surpass larger counterparts, with Qwen leading
(30.82) and InternVL2 lagging (19.6); 3) Interventions like chain-of-thought
(0.100 accuracy gain) and 5-shot training (0.259 improvement) show limits from
architectural constraints. Identified barriers include weak geometry encoding
and missing dynamic simulation. By linking psychometric BSAs to VLM
capabilities, we provide a diagnostic toolkit for spatial intelligence
evaluation, methodological foundations for embodied AI development, and a
cognitive science-informed roadmap for achieving human-like spatial
intelligence.
♻ ☆ QUILL: Quotation Generation Enhancement of Large Language Models
Jin Xiao, Bowei Zhang, Qianyu He, Jiaqing Liang, Feng Wei, Jinglei Chen, Zujie Liang, Deqing Yang, Yanghua Xiao
While Large language models (LLMs) have become excellent writing assistants,
they still struggle with quotation generation. This is because they either
hallucinate when providing factual quotations or fail to provide quotes that
exceed human expectations. To bridge the gap, we systematically study how to
evaluate and improve LLMs' performance in quotation generation tasks. We first
establish a holistic and automatic evaluation system for quotation generation
task, which consists of five criteria each with corresponding automatic metric.
To improve the LLMs' quotation generation abilities, we construct a bilingual
knowledge base that is broad in scope and rich in dimensions, containing up to
32,022 quotes. Moreover, guided by our critiria, we further design a
quotation-specific metric to rerank the retrieved quotations from the knowledge
base. Extensive experiments show that our metrics strongly correlate with human
preferences. Existing LLMs struggle to generate desired quotes, but our
quotation knowledge base and reranking metric help narrow this gap. Our dataset
and code are publicly available at https://github.com/GraceXiaoo/QUILL.
comment: 17 pages, 6 figures
♻ ☆ Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder
Recent research has shown that CLIP models struggle with visual reasoning
tasks that require grounding compositionality, understanding spatial
relationships, or capturing fine-grained details. One natural hypothesis is
that the CLIP vision encoder does not embed essential information for these
tasks. However, we find that this is not always the case: The encoder gathers
query-relevant visual information, while CLIP fails to extract it. In
particular, we show that another branch of Vision-Language Models (VLMs),
Generative Multimodal Large Language Models (MLLMs), achieve significantly
higher accuracy than CLIP in many of these tasks using the same vision encoder
and weights, indicating that these Generative MLLMs perceive more -- as they
extract and utilize visual information more effectively. We conduct a series of
controlled experiments and reveal that their success is attributed to multiple
key design choices, including patch tokens, position embeddings, and
prompt-based weighting. On the other hand, enhancing the training data alone or
applying a stronger text encoder does not suffice to solve the task, and
additional text tokens offer little benefit. Interestingly, we find that
fine-grained visual reasoning is not exclusive to generative models trained by
an autoregressive loss: When converted into CLIP-like encoders by contrastive
finetuning, these MLLMs still outperform CLIP under the same cosine
similarity-based evaluation protocol. Our study highlights the importance of
VLM architectural choices and suggests directions for improving the performance
of CLIP-like contrastive VLMs.
comment: 17 pages, 3 figures
♻ ☆ Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models
Qingsong Zou, Jingyu Xiao, Qing Li, Zhi Yan, Yuhang Wang, Li Xu, Wenxuan Wang, Kuofeng Gao, Ruoyu Li, Yong Jiang
Recent advances in large language models (LLMs) have demonstrated remarkable
potential in the field of natural language processing. Unfortunately, LLMs face
significant security and ethical risks. Although techniques such as safety
alignment are developed for defense, prior researches reveal the possibility of
bypassing such defenses through well-designed jailbreak attacks. In this paper,
we propose QueryAttack, a novel framework to examine the generalizability of
safety alignment. By treating LLMs as knowledge databases, we translate
malicious queries in natural language into structured non-natural query
language to bypass the safety alignment mechanisms of LLMs. We conduct
extensive experiments on mainstream LLMs, and the results show that QueryAttack
not only can achieve high attack success rates (ASRs), but also can jailbreak
various defense methods. Furthermore, we tailor a defense method against
QueryAttack, which can reduce ASR by up to 64% on GPT-4-1106. Our code is
available at https://github.com/horizonsinzqs/QueryAttack.
comment: 15 pages, 11 figures
♻ ☆ A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models
Finetuning language models (LMs) is crucial for adapting the models to
downstream data and tasks. However, full finetuning is usually costly. Existing
work, such as parameter-efficient finetuning (PEFT), often focuses on
\textit{how to finetune} but neglects the issue of \textit{where to finetune}.
As a pioneering work on reducing the cost of backpropagation (at the layer
level) by answering where to finetune, we conduct a semantic analysis of the LM
inference process. We first propose using transition traces of the latent
representation to compute deviations (or loss). Then, using a derived formula
of scaling law, we estimate the gain of each layer in reducing deviation (or
loss). Further, we narrow down the scope for finetuning, and also, study the
cost-benefit balance of LM finetuning. We perform extensive experiments across
well-known LMs and datasets. The results show that our approach is effective
and efficient, and outperforms the existing baselines. Our approach is
orthogonal to other techniques on improving finetuning efficiency, such as PEFT
methods, offering practical values on LM finetuning.
comment: 14 pages, 6 figures, under peer-review
♻ ☆ Graph-Guided Textual Explanation Generation Framework
Shuzhou Yuan, Jingyi Sun, Ran Zhang, Michael Färber, Steffen Eger, Pepa Atanasova, Isabelle Augenstein
Natural language explanations (NLEs) are commonly used to provide plausible
free-text explanations of a model's reasoning about its predictions. However,
recent work has questioned their faithfulness, as they may not accurately
reflect the model's internal reasoning process regarding its predicted answer.
In contrast, highlight explanations--input fragments critical for the model's
predicted answers--exhibit measurable faithfulness. Building on this
foundation, we propose G-Tex, a Graph-Guided Textual Explanation Generation
framework designed to enhance the faithfulness of NLEs. Specifically, highlight
explanations are first extracted as faithful cues reflecting the model's
reasoning logic toward answer prediction. They are subsequently encoded through
a graph neural network layer to guide the NLE generation, which aligns the
generated explanations with the model's underlying reasoning toward the
predicted answer. Experiments on T5 and BART using three reasoning datasets
show that G-Tex improves NLE faithfulness by up to 12.18% compared to baseline
methods. Additionally, G-Tex generates NLEs with greater semantic and lexical
similarity to human-written ones. Human evaluations show that G-Tex can
decrease redundant content and enhance the overall quality of NLEs. Our work
presents a novel method for explicitly guiding NLE generation to enhance
faithfulness, serving as a foundation for addressing broader criteria in NLE
and generated text.
♻ ☆ Towards Geo-Culturally Grounded LLM Generations
Generative large language models (LLMs) have been demonstrated to have gaps
in diverse, cultural knowledge across the globe. We investigate the effect of
retrieval augmented generation and search-grounding techniques on the ability
of LLMs to display familiarity with a diverse range of national cultures.
Specifically, we compare the performance of standard LLMs, LLMs augmented with
retrievals from a bespoke knowledge base (i.e., KB grounding), and LLMs
augmented with retrievals from a web search (i.e., search grounding) on a
series of cultural familiarity benchmarks. We find that search grounding
significantly improves the LLM performance on multiple-choice benchmarks that
test propositional knowledge (e.g., the norms, artifacts, and institutions of
national cultures), while KB grounding's effectiveness is limited by inadequate
knowledge base coverage and a suboptimal retriever. However, search grounding
also increases the risk of stereotypical judgments by language models, while
failing to improve evaluators' judgments of cultural familiarity in a human
evaluation with adequate statistical power. These results highlight the
distinction between propositional knowledge about a culture and open-ended
cultural fluency when it comes to evaluating the cultural familiarity of
generative LLMs.
♻ ☆ SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution
Large Language Models (LLMs) have demonstrated remarkable proficiency across
a variety of complex tasks. One significant application of LLMs is in tackling
software engineering challenges, particularly in resolving real-world tasks on
GitHub by fixing code based on the issues reported by the users. However, many
current approaches rely on proprietary LLMs, which limits reproducibility,
accessibility, and transparency. The critical components of LLMs for addressing
software engineering issues and how their capabilities can be effectively
enhanced remain unclear. To address these challenges, we introduce SWE-Fixer, a
novel open-source framework designed to effectively and efficiently resolve
GitHub issues. SWE-Fixer comprises two essential modules: a code file retrieval
module and a code editing module. The retrieval module employs BM25 along with
a lightweight model to achieve coarse-to-fine file retrieval. Subsequently, the
code editing module utilizes the other model to generate patches for the
identified files. To mitigate the lack of publicly available datasets, we
compile an extensive dataset that includes 110K GitHub issues along with their
corresponding patches and train the two models of SWE-Fixer separately. We
assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving
state-of-the-art performance among open-source models with scores of 24.7% and
32.8%, respectively. Additionally, our approach requires only two model calls
per instance, making it significantly more efficient than existing methods.
These results highlight the effectiveness of SWE-Fixer in real-world
code-fixing scenarios. We will make our model, dataset, and code publicly
available at https://github.com/InternLM/SWE-Fixer.
comment: Our code, data, and model will be released at
https://github.com/InternLM/SWE-Fixer
♻ ☆ Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models NAACL
Fine-tuning Large Language Models (LLMs) on specific datasets is a common
practice to improve performance on target tasks. However, this performance gain
often leads to overfitting, where the model becomes too specialized in either
the task or the characteristics of the training data, resulting in a loss of
generalization. This paper introduces Selective Self-to-Supervised Fine-Tuning
(S3FT), a fine-tuning approach that achieves better performance than the
standard supervised fine-tuning (SFT) while improving generalization. S3FT
leverages the existence of multiple valid responses to a query. By utilizing
the model's correct responses, S3FT reduces model specialization during the
fine-tuning stage. S3FT first identifies the correct model responses from the
training set by deploying an appropriate judge. Then, it fine-tunes the model
using the correct model responses and the gold response (or its paraphrase) for
the remaining samples. The effectiveness of S3FT is demonstrated through
experiments on mathematical reasoning, Python programming and reading
comprehension tasks. The results show that standard SFT can lead to an average
performance drop of up to $4.4$ on multiple benchmarks, such as MMLU and
TruthfulQA. In contrast, S3FT reduces this drop by half, i.e. $2.5$, indicating
better generalization capabilities than SFT while performing significantly
better on the fine-tuning tasks.
comment: 10 pages, Accepted to NAACL Findings 2025. arXiv admin note: text
overlap with arXiv:2409.04787
♻ ☆ BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in Alignment NAACL 2025
Reinforcement Learning with Human Feedback (RLHF) is the key to the success
of large language models (LLMs) in recent years. In this work, we first
introduce the concepts of knowledge breadth and knowledge depth, which measure
the comprehensiveness and depth of an LLM or knowledge source respectively. We
reveal that the imbalance in the number of prompts and responses can lead to a
potential disparity in breadth and depth learning within alignment tuning
datasets by showing that even a simple uniform method for balancing the number
of instructions and responses can lead to significant improvements. Building on
this, we further propose Balanced Preference Optimization (BPO), designed to
dynamically augment the knowledge depth of each sample. BPO is motivated by the
observation that the usefulness of knowledge varies across samples,
necessitating tailored learning of knowledge depth. To achieve this, we
introduce gradient-based clustering, estimating the knowledge informativeness
and usefulness of each augmented sample based on the model's optimization
direction. Our experimental results across various benchmarks demonstrate that
BPO outperforms other baseline methods in alignment tuning while maintaining
training efficiency. Furthermore, we conduct a detailed analysis of each
component of BPO, providing guidelines for future research in preference data
optimization.
comment: The 2025 Annual Conference of the Nations of the Americas Chapter of
the Association for Computational Linguistics (NAACL 2025)- Main Conference
♻ ☆ SpinQuant: LLM quantization with learned rotations ICLR 2025
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort
Post-training quantization (PTQ) techniques applied to weights, activations,
and the KV cache greatly reduce memory usage, latency, and power consumption of
Large Language Models (LLMs), but may lead to large quantization errors when
outliers are present. Rotating activation or weight matrices helps remove
outliers and benefits quantization. In this work, we identify a collection of
applicable rotation parameterizations that lead to identical outputs in
full-precision Transformer architectures while enhancing quantization accuracy.
In addition, we find that some random rotations lead to much better
quantization than others, with an up to 13 points difference in downstream
zero-shot reasoning performance. As a result, we propose SpinQuant, a novel
approach that incorporates learned rotation matrices for optimal quantized
network accuracy. With 4-bit quantization of weight, activation, and KV-cache,
SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full
precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by
19.1 points and SmoothQuant by 25.0 points. Furthermore, SpinQuant also
outperforms concurrent work QuaRot, which applies random rotations to remove
outliers. In particular, for LLaMA-3 8B models that are hard to quantize,
SpinQuant reduces the gap to full precision by up to 45.1% relative to QuaRot.
Code is available at https://github.com/facebookresearch/SpinQuant.
comment: ICLR 2025
♻ ☆ MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding
Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, Linda Ruth Petzold, Stephen D. Wilson, Woosang Lim, William Yang Wang
Scientific figure interpretation is a crucial capability for AI-driven
scientific assistants built on advanced Large Vision Language Models. However,
current datasets and benchmarks primarily focus on simple charts or other
relatively straightforward figures from limited science domains. To address
this gap, we present a comprehensive dataset compiled from peer-reviewed Nature
Communications articles covering 72 scientific fields, encompassing complex
visualizations such as schematic diagrams, microscopic images, and experimental
data which require graduate-level expertise to interpret. We evaluated 19
proprietary and open-source models on two benchmark tasks, figure captioning
and multiple-choice, and conducted human expert annotation. Our analysis
revealed significant task challenges and performance gaps among models. Beyond
serving as a benchmark, this dataset serves as a valuable resource for
large-scale training. Fine-tuning Qwen2-VL-7B with our task-specific data
achieved better performance than GPT-4o and even human experts in
multiple-choice evaluations. Furthermore, continuous pre-training on our
interleaved article and figure data substantially enhanced the model's
downstream task performance in materials science. We have released our dataset
to support further research.
comment: Code and data are available at https://github.com/Leezekun/MMSci
♻ ☆ Measuring the Quality of Answers in Political Q&As with Large Language Models
This article proposes a new approach for assessing the quality of answers in
political question-and-answer sessions. We measure the quality of an answer
based on how easily and accurately it can be recognized in a random set of
candidate answers given the question's text. This measure reflects the answer's
relevance and depth of engagement with the question. Like semantic search, we
can implement this approach by training a language model on the corpus of
observed questions and answers without additional human-labeled data. We
showcase and validate our methodology within the context of the Question Period
in the Canadian House of Commons. Our analysis reveals that while some answers
have a weak semantic connection to questions, hinting at some evasion or
obfuscation, they are generally at least moderately relevant, far exceeding
what we would expect from random replies. We also find a meaningful correlation
between answer quality and the party affiliation of the members of Parliament
asking the questions.