Computation and Language
☆ Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang
Currently OpenAI o1 has sparked a surge of interest in the study of large
reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on
disciplines with standard answers, such as mathematics, physics, and coding --
which are well-suited for reinforcement learning (RL) -- but also places
greater emphasis on open-ended resolutions. We aim to address the question:
"Can the o1 model effectively generalize to broader domains where clear
standards are absent and rewards are challenging to quantify?" Marco-o1 is
powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS),
reflection mechanisms, and innovative reasoning strategies -- optimized for
complex real-world problem-solving tasks.
☆ Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings COLING 2025
With the recent proliferation of large language models (LLMs), enterprises
have been able to rapidly develop proof-of-concepts and prototypes. As a
result, there is a growing need to implement robust guardrails that monitor,
quantize and control an LLM's behavior, ensuring that the use is reliable,
safe, accurate and also aligned with the users' expectations. Previous
approaches for filtering out inappropriate user prompts or system outputs, such
as LlamaGuard and OpenAI's MOD API, have achieved significant success by
fine-tuning existing LLMs. However, using fine-tuned LLMs as guardrails
introduces increased latency and higher maintenance costs, which may not be
practical or scalable for cost-efficient deployments. We take a different
approach, focusing on fine-tuning a lightweight architecture: Sentence-BERT.
This method reduces the model size from LlamaGuard's 7 billion parameters to
approximately 67 million, while maintaining comparable performance on the AEGIS
safety benchmark.
comment: To appear in Proceedings of COLING 2025
☆ POS-tagging to highlight the skeletal structure of sentences
This study presents the development of a part-of-speech (POS) tagging model
to extract the skeletal structure of sentences using transfer learning with the
BERT architecture for token classification. The model, fine-tuned on Russian
text, demonstrating its effectiveness. The approach offers potential
applications in enhancing natural language processing tasks, such as improving
machine translation.
Keywords: part of speech tagging, morphological analysis, natural language
processing, BERT.
comment: in Russian language. Conference: Automated control systems and
information technologies https://asuit.pstu.ru/ Section: IT and automated
systems
☆ UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages
Large language models (LLMs) under-perform on low-resource languages due to
limited training data. We present a method to efficiently collect text data for
low-resource languages from the entire Common Crawl corpus. Our approach,
UnifiedCrawl, filters and extracts common crawl using minimal compute
resources, yielding mono-lingual datasets much larger than previously available
sources. We demonstrate that leveraging this data to fine-tuning multilingual
LLMs via efficient adapter methods (QLoRA) significantly boosts performance on
the low-resource language, while minimizing VRAM usage. Our experiments show
large improvements in language modeling perplexity and an increase in few-shot
prompting scores. Our work and released source code provide an affordable
approach to improve LLMs for low-resource languages using consumer hardware.
Our source code is available here at
https://github.com/bethelmelesse/unifiedcrawl.
☆ Velocitune: A Velocity-based Dynamic Domain Reweighting Method for Continual Pre-training
It is well-known that a diverse corpus is critical for training large
language models, which are typically constructed from a mixture of various
domains. In general, previous efforts resort to sampling training data from
different domains with static proportions, as well as adjusting data
proportions during training. However, few methods have addressed the
complexities of domain-adaptive continual pre-training. To fill this gap, we
propose Velocitune, a novel framework dynamically assesses learning velocity
and adjusts data proportions accordingly, favoring slower-learning domains
while shunning faster-learning ones, which is guided by a scaling law to
indicate the desired learning goal for each domain with less associated cost.
To evaluate the effectiveness of Velocitune, we conduct experiments in a
reasoning-focused dataset with CodeLlama, as well as in a corpus specialised
for system command generation with Llama3 and Mistral. Velocitune achieves
performance gains in both math and code reasoning tasks and command-line
generation benchmarks. Further analysis reveals that key factors driving
Velocitune's effectiveness include target loss prediction and data ordering.
comment: Work in progress
☆ Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance
Large vision-language models (LVLMs) have achieved impressive results in
various vision-language tasks. However, despite showing promising performance,
LVLMs suffer from hallucinations caused by language bias, leading to diminished
focus on images and ineffective visual comprehension. We identify two primary
reasons for this bias: 1. Different scales of training data between the
pretraining stage of LLM and multimodal alignment stage. 2. The learned
inference bias due to short-term dependency of text data. Therefore, we propose
LACING, a systemic framework designed to address the language bias of LVLMs
with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG).
Specifically, MDA introduces a parallel dual-attention mechanism that enhances
the integration of visual inputs across the model. IFG introduces a learnable
soft visual prompt during training and inference to replace visual inputs,
designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes
a novel decoding strategy using the soft visual prompt to mitigate the model's
over-reliance on adjacent text inputs. Comprehensive experiments demonstrate
that our method effectively debiases LVLMs from their language bias, enhancing
visual comprehension and reducing hallucinations without requiring additional
training resources or data. The code and model are available at
[lacing-lvlm.github.io](https://lacing-lvlm.github.io).
comment: 19 pages, 12 figures
☆ Efficient Aspect-Based Summarization of Climate Change Reports with Small Language Models
The use of Natural Language Processing (NLP) for helping decision-makers with
Climate Change action has recently been highlighted as a use case aligning with
a broader drive towards NLP technologies for social good. In this context,
Aspect-Based Summarization (ABS) systems that extract and summarize relevant
information are particularly useful as they provide stakeholders with a
convenient way of finding relevant information in expert-curated reports. In
this work, we release a new dataset for ABS of Climate Change reports and we
employ different Large Language Models (LLMs) and so-called Small Language
Models (SLMs) to tackle this problem in an unsupervised way. Considering the
problem at hand, we also show how SLMs are not significantly worse for the
problem while leading to reduced carbon footprint; we do so by applying for the
first time an existing framework considering both energy efficiency and task
performance to the evaluation of zero-shot generative models for ABS. Overall,
our results show that modern language models, both big and small, can
effectively tackle ABS for Climate Change reports but more research is needed
when we frame the problem as a Retrieval Augmented Generation (RAG) problem and
our work and dataset will help foster efforts in this direction.
☆ Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective
Large Language Models (LLMs) have revolutionized Natural Language Processing
(NLP) based applications including automated text generation, question
answering, chatbots, and others. However, they face a significant challenge:
hallucinations, where models produce plausible-sounding but factually incorrect
responses. This undermines trust and limits the applicability of LLMs in
different domains. Knowledge Graphs (KGs), on the other hand, provide a
structured collection of interconnected facts represented as entities (nodes)
and their relationships (edges). In recent research, KGs have been leveraged to
provide context that can fill gaps in an LLM understanding of certain topics
offering a promising approach to mitigate hallucinations in LLMs, enhancing
their reliability and accuracy while benefiting from their wide applicability.
Nonetheless, it is still a very active area of research with various unresolved
open problems. In this paper, we discuss these open challenges covering
state-of-the-art datasets and benchmarks as well as methods for knowledge
integration and evaluating hallucinations. In our discussion, we consider the
current use of KGs in LLM systems and identify future directions within each of
these challenges.
comment: 7 pages, 2 Figures, 1 Table
☆ Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Hallucinations in large language models are a widespread problem, yet the
mechanisms behind whether models will hallucinate are poorly understood,
limiting our ability to solve this problem. Using sparse autoencoders as an
interpretability tool, we discover that a key part of these mechanisms is
entity recognition, where the model detects if an entity is one it can recall
facts about. Sparse autoencoders uncover meaningful directions in the
representation space, these detect whether the model recognizes an entity, e.g.
detecting it doesn't know about an athlete or a movie. This suggests that
models can have self-knowledge: internal representations about their own
capabilities. These directions are causally relevant: capable of steering the
model to refuse to answer questions about known entities, or to hallucinate
attributes of unknown entities when it would otherwise refuse. We demonstrate
that despite the sparse autoencoders being trained on the base model, these
directions have a causal effect on the chat model's refusal behavior,
suggesting that chat finetuning has repurposed this existing mechanism.
Furthermore, we provide an initial exploration into the mechanistic role of
these directions in the model, finding that they disrupt the attention of
downstream heads that typically move entity attributes to the final token.
☆ Intent-Aware Dialogue Generation and Multi-Task Contrastive Learning for Multi-Turn Intent Classification
Generating large-scale, domain-specific, multilingual multi-turn dialogue
datasets remains a significant hurdle for training effective Multi-Turn Intent
Classification models in chatbot systems. In this paper, we introduce
Chain-of-Intent, a novel mechanism that combines Hidden Markov Models with
Large Language Models (LLMs) to generate contextually aware, intent-driven
conversations through self-play. By extracting domain-specific knowledge from
e-commerce chat logs, we estimate conversation turns and intent transitions,
which guide the generation of coherent dialogues. Leveraging LLMs to enhance
emission probabilities, our approach produces natural and contextually
consistent questions and answers. We also propose MINT-CL, a framework for
multi-turn intent classification using multi-task contrastive learning,
improving classification accuracy without the need for extensive annotated
data. Evaluations show that our methods outperform baselines in dialogue
quality and intent classification accuracy, especially in multilingual
settings, while significantly reducing data generation efforts. Furthermore, we
release MINT-E, a multilingual, intent-aware multi-turn e-commerce dialogue
corpus to support future research in this area.
☆ Natural Language Reinforcement Learning
Xidong Feng, Ziyu Wan, Haotian Fu, Bo Liu, Mengyue Yang, Girish A. Koushik, Zhiyuan Hu, Ying Wen, Jun Wang
Reinforcement Learning (RL) mathematically formulates decision-making with
Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable
breakthroughs across various domains, including games, robotics, and language
models. This paper seeks a new possibility, Natural Language Reinforcement
Learning (NLRL), by extending traditional MDP to natural language-based
representation space. Specifically, NLRL innovatively redefines RL principles,
including task objectives, policy, value function, Bellman equation, and policy
iteration, into their language counterparts. With recent advancements in large
language models (LLMs), NLRL can be practically implemented to achieve RL-like
policy and value improvement by either pure prompting or gradient-based
training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games
demonstrate the effectiveness, efficiency, and interpretability of the NLRL
framework among diverse use cases. Our code will be released at
https://github.com/waterhorse1/Natural-language-RL.
comment: Extension of arXiv:2402.07157
☆ Evaluating the Robustness of Analogical Reasoning in Large Language Models
LLMs have performed well on several reasoning benchmarks, including ones that
test analogical reasoning abilities. However, there is debate on the extent to
which they are performing general abstract reasoning versus employing
non-robust processes, e.g., that overly rely on similarity to pre-training
data. Here we investigate the robustness of analogy-making abilities previously
claimed for LLMs on three of four domains studied by Webb, Holyoak, and Lu
(2023): letter-string analogies, digit matrices, and story analogies. For each
domain we test humans and GPT models on robustness to variants of the original
analogy problems that test the same abstract reasoning abilities but are likely
dissimilar from tasks in the pre-training data. The performance of a system
that uses robust abstract reasoning should not decline substantially on these
variants.
On simple letter-string analogies, we find that while the performance of
humans remains high for two types of variants we tested, the GPT models'
performance declines sharply. This pattern is less pronounced as the complexity
of these problems is increased, as both humans and GPT models perform poorly on
both the original and variant problems requiring more complex analogies. On
digit-matrix problems, we find a similar pattern but only on one out of the two
types of variants we tested. On story-based analogy problems, we find that,
unlike humans, the performance of GPT models are susceptible to answer-order
effects, and that GPT models also may be more sensitive than humans to
paraphrasing.
This work provides evidence that LLMs often lack the robustness of zero-shot
human analogy-making, exhibiting brittleness on most of the variations we
tested. More generally, this work points to the importance of carefully
evaluating AI systems not only for accuracy but also robustness when testing
their cognitive capabilities.
comment: 31 pages, 13 figures. arXiv admin note: text overlap with
arXiv:2402.08955
☆ OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen-tau Yih, Pang Wei Koh, Hannaneh Hajishirzi
Scientific progress depends on researchers' ability to synthesize the growing
body of literature. Can large language models (LMs) assist scientists in this
task? We introduce OpenScholar, a specialized retrieval-augmented LM that
answers scientific queries by identifying relevant passages from 45 million
open-access papers and synthesizing citation-backed responses. To evaluate
OpenScholar, we develop ScholarQABench, the first large-scale multi-domain
benchmark for literature search, comprising 2,967 expert-written queries and
208 long-form answers across computer science, physics, neuroscience, and
biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and
PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o
hallucinates citations 78 to 90% of the time, OpenScholar achieves citation
accuracy on par with human experts. OpenScholar's datastore, retriever, and
self-feedback inference loop also improves off-the-shelf LMs: for instance,
OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations,
experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over
expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's
32%. We open-source all of our code, models, datastore, data and a public demo.
☆ Why do language models perform worse for morphologically complex languages?
Language models perform differently across languages. It has been previously
suggested that morphological typology may explain some of this variability
(Cotterell et al., 2018). We replicate previous analyses and find additional
new evidence for a performance gap between agglutinative and fusional
languages, where fusional languages, such as English, tend to have better
language modeling performance than morphologically more complex languages like
Turkish. We then propose and test three possible causes for this performance
gap: morphological alignment of tokenizers, tokenization quality, and
disparities in dataset sizes and measurement. To test the morphological
alignment hypothesis, we present MorphScore, a tokenizer evaluation metric, and
supporting datasets for 22 languages. We find some evidence that tokenization
quality explains the performance gap, but none for the role of morphological
alignment. Instead we find that the performance gap is most reduced when
training datasets are of equivalent size across language types, but only when
scaled according to the so-called "byte-premium" -- the different encoding
efficiencies of different languages and orthographies. These results suggest
that no language is harder or easier for a language model to learn on the basis
of its morphological typology. Differences in performance can be attributed to
disparities in dataset size. These results bear on ongoing efforts to improve
performance for low-performing and under-resourced languages.
comment: 9 pages
☆ Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset
The ability to perform complex reasoning across multimodal inputs is
essential for models to effectively interact with humans in real-world
scenarios. Advancements in vision-language models have significantly improved
performance on tasks that require processing explicit and direct textual
inputs, such as Visual Question Answering (VQA) and Visual Grounding (VG).
However, less attention has been given to improving the model capabilities to
comprehend nuanced and ambiguous forms of communication. This presents a
critical challenge, as human language in real-world interactions often convey
hidden intentions that rely on context for accurate interpretation. To address
this gap, we propose VAGUE, a multimodal benchmark comprising 3.9K indirect
human utterances paired with corresponding scenes. Additionally, we contribute
a model-based pipeline for generating prompt-solution pairs from input images.
Our work aims to delve deeper into the ability of models to understand indirect
communication and seek to contribute to the development of models capable of
more refined and human-like interactions. Extensive evaluation on multiple VLMs
reveals that mainstream models still struggle with indirect communication when
required to perform complex linguistic and visual reasoning. We release our
code and data at https://github.com/Hazel-Heejeong-Nam/VAGUE.git.
☆ Learning from "Silly" Questions Improves Large Language Models, But Only Slightly
Constructing high-quality Supervised Fine-Tuning (SFT) datasets is critical
for the training of large language models (LLMs). Recent studies have shown
that using data from a specific source, Ruozhiba, a Chinese website where users
ask "silly" questions to better understand certain topics, can lead to better
fine-tuning performance. This paper aims to explore some hidden factors: the
potential interpretations of its success and a large-scale evaluation of the
performance. First, we leverage GPT-4 to analyze the successful cases of
Ruozhiba questions from the perspective of education, psychology, and cognitive
science, deriving a set of explanatory rules. Then, we construct fine-tuning
datasets by applying these rules to the MMLU training set. Surprisingly, our
results indicate that rules can significantly improve model performance in
certain tasks, while potentially diminishing performance on others. For
example, SFT data generated following the "Counterintuitive Thinking" rule can
achieve approximately a 5% improvement on the "Global Facts" task, whereas the
"Blurring the Conceptual Boundaries" rule leads to a performance drop of 6.14%
on the "Econometrics" task. In addition, for specific tasks, different rules
tend to have a consistent impact on model performance. This suggests that the
differences between the extracted rules are not as significant, and the
effectiveness of the rules is relatively consistent across tasks. Our research
highlights the importance of considering task diversity and rule applicability
when constructing SFT datasets to achieve more comprehensive performance
improvements.
comment: 27 pages, 14 figures
☆ Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
In the recent past, a popular way of evaluating natural language
understanding (NLU), was to consider a model's ability to perform natural
language inference (NLI) tasks. In this paper, we investigate if NLI tasks,
that are rarely used for LLM evaluation, can still be informative for
evaluating LLMs. Focusing on five different NLI benchmarks across six models of
different scales, we investigate if they are able to discriminate models of
different size and quality and how their accuracies develop during training.
Furthermore, we investigate the extent to which the softmax distributions of
models align with human distributions in cases where statements are ambiguous
or vague. Overall, our results paint a positive picture for the NLI tasks: we
find that they are able to discriminate well between models at various stages
of training, yet are not (all) saturated. Furthermore, we find that while the
similarity of model distributions with human label distributions increases with
scale, it is still much higher than the similarity between two populations of
humans, making it a potentially interesting statistic to consider.
comment: preprint, 13 pages
☆ BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection ICASSP 2025
Spoken term detection (STD) is often hindered by reliance on frame-level
features and the computationally intensive DTW-based template matching,
limiting its practicality. To address these challenges, we propose a novel
approach that encodes speech into discrete, speaker-agnostic semantic tokens.
This facilitates fast retrieval using text-based search algorithms and
effectively handles out-of-vocabulary terms. Our approach focuses on generating
consistent token sequences across varying utterances of the same term. We also
propose a bidirectional state space modeling within the Mamba encoder, trained
in a self-supervised learning framework, to learn contextual frame-level
features that are further encoded into discrete tokens. Our analysis shows that
our speech tokens exhibit greater speaker invariance than those from existing
tokenizers, making them more suitable for STD tasks. Empirical evaluation on
LibriSpeech and TIMIT databases indicates that our method outperforms existing
STD baselines while being more efficient.
comment: Submitted to ICASSP 2025
☆ Meaning at the Planck scale? Contextualized word embeddings for doing history, philosophy, and sociology of science
This paper explores the potential of contextualized word embeddings (CWEs) as
a new tool in the history, philosophy, and sociology of science (HPSS) for
studying contextual and evolving meanings of scientific concepts. Using the
term "Planck" as a test case, I evaluate five BERT-based models with varying
degrees of domain-specific pretraining, including my custom model
Astro-HEP-BERT, trained on the Astro-HEP Corpus, a dataset containing 21.84
million paragraphs from 600,000 articles in astrophysics and high-energy
physics. For this analysis, I compiled two labeled datasets: (1) the
Astro-HEP-Planck Corpus, consisting of 2,900 labeled occurrences of "Planck"
sampled from 1,500 paragraphs in the Astro-HEP Corpus, and (2) a
physics-related Wikipedia dataset comprising 1,186 labeled occurrences of
"Planck" across 885 paragraphs. Results demonstrate that the domain-adapted
models outperform the general-purpose ones in disambiguating the target term,
predicting its known meanings, and generating high-quality sense clusters, as
measured by a novel purity indicator I developed. Additionally, this approach
reveals semantic shifts in the target term over three decades in the unlabeled
Astro-HEP Corpus, highlighting the emergence of the Planck space mission as a
dominant sense. The study underscores the importance of domain-specific
pretraining for analyzing scientific language and demonstrates the
cost-effectiveness of adapting pretrained models for HPSS research. By offering
a scalable and transferable method for modeling the meanings of scientific
concepts, CWEs open up new avenues for investigating the socio-historical
dynamics of scientific discourses.
comment: 18 pages, 7 figures (1 in the Supplement)
☆ The Master-Slave Encoder Model for Improving Patent Text Summarization: A New Approach to Combining Specifications and Claims
In order to solve the problem of insufficient generation quality caused by
traditional patent text abstract generation models only originating from patent
specifications, the problem of new terminology OOV caused by rapid patent
updates, and the problem of information redundancy caused by insufficient
consideration of the high professionalism, accuracy, and uniqueness of patent
texts, we proposes a patent text abstract generation model (MSEA) based on a
master-slave encoder architecture; Firstly, the MSEA model designs a
master-slave encoder, which combines the instructions in the patent text with
the claims as input, and fully explores the characteristics and details between
the two through the master-slave encoder; Then, the model enhances the
consideration of new technical terms in the input sequence based on the pointer
network, and further enhances the correlation with the input text by re
weighing the "remembered" and "for-gotten" parts of the input sequence from the
encoder; Finally, an enhanced repetition suppression mechanism for patent text
was introduced to ensure accurate and non redundant abstracts generated. On a
publicly available patent text dataset, compared to the state-of-the-art model,
Improved Multi-Head Attention Mechanism (IMHAM), the MSEA model achieves an
improvement of 0.006, 0.005, and 0.005 in Rouge-1, Rouge-2, and Rouge-L scores,
respectively. MSEA leverages the characteristics of patent texts to effectively
enhance the quality of patent text generation, demonstrating its advancement
and effectiveness in the experiments.
comment: 25pages, 1 figure
☆ MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective
Large Multimodal Models (LMMs) have demonstrated remarkable capabilities.
While existing benchmarks for evaluating LMMs mainly focus on image
comprehension, few works evaluate them from the image generation perspective.
To address this issue, we propose a straightforward automated evaluation
pipeline. Specifically, this pipeline requires LMMs to generate an image-prompt
from a given input image. Subsequently, it employs text-to-image generative
models to create a new image based on these generated prompts. Finally, we
evaluate the performance of LMMs by comparing the original image with the
generated one. Furthermore, we introduce MMGenBench-Test, a comprehensive
benchmark developed to evaluate LMMs across 13 distinct image patterns, and
MMGenBench-Domain, targeting the performance evaluation of LMMs within the
generative image domain. A thorough evaluation involving over 50 popular LMMs
demonstrates the effectiveness and reliability in both the pipeline and
benchmark. Our observations indicate that numerous LMMs excelling in existing
benchmarks fail to adequately complete the basic tasks, related to image
understanding and description. This finding highlights the substantial
potential for performance improvement in current LMMs and suggests avenues for
future model optimization. Concurrently, our pipeline facilitates the efficient
assessment of LMMs performance across diverse domains by using solely image
inputs.
comment: This project is available at: https://github.com/lerogo/MMGenBench
☆ DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization
Large language models (LLMs) deliver impressive results but face challenges
from increasing model sizes and computational costs. Structured pruning reduces
model size and speeds up inference but often causes uneven degradation across
domains, leading to biased performance. To address this, we propose DRPruning,
which incorporates distributionally robust optimization to restore balanced
performance across domains, along with further improvements to enhance
robustness. Experiments in monolingual and multilingual settings show that our
method surpasses similarly sized models in pruning and continued pretraining
over perplexity, downstream tasks, and instruction tuning. We further provide
analysis demonstrating the robustness of our method towards various domains and
distribution shifts. Furthermore, our method automatically determines optimal
reference losses and data ratios, suggesting potential for broader
applications. Our code is available at https://github.com/hexuandeng/DRPruning.
comment: Work in Progress
☆ FunctionChat-Bench: Comprehensive Evaluation of Language Models' Generative Capabilities in Korean Tool-use Dialogs
This study investigates language models' generative capabilities in tool-use
dialogs. We categorize the models' outputs in tool-use dialogs into four
distinct types: Tool Call, Answer Completion, Slot Question, and Relevance
Detection, which serve as aspects for evaluation. We introduce
FunctionChat-Bench, comprising 700 evaluation items and automated assessment
programs. Using this benchmark, we evaluate several language models that
support function calling. Our findings indicate that while language models may
exhibit high accuracy in single-turn Tool Call scenarios, this does not
necessarily translate to superior generative performance in multi-turn
environments. We argue that the capabilities required for function calling
extend beyond generating tool call messages; they must also effectively
generate conversational messages that engage the user.
comment: 8 pages
☆ Forecasting Future International Events: A Reliable Dataset for Text-Based Event Modeling EMNLP 2024
Predicting future international events from textual information, such as news
articles, has tremendous potential for applications in global policy, strategic
decision-making, and geopolitics. However, existing datasets available for this
task are often limited in quality, hindering the progress of related research.
In this paper, we introduce WORLDREP (WORLD Relationship and Event Prediction),
a novel dataset designed to address these limitations by leveraging the
advanced reasoning capabilities of large-language models (LLMs). Our dataset
features high-quality scoring labels generated through advanced prompt modeling
and rigorously validated by domain experts in political science. We showcase
the quality and utility of WORLDREP for real-world event prediction tasks,
demonstrating its effectiveness through extensive experiments and analysis.
Furthermore, we publicly release our dataset along with the full automation
source code for data collection, labeling, and benchmarking, aiming to support
and advance research in text-based event prediction.
comment: EMNLP 2024 Findings
☆ Logic Augmented Generation
Semantic Knowledge Graphs (SKG) face challenges with scalability,
flexibility, contextual understanding, and handling unstructured or ambiguous
information. However, they offer formal and structured knowledge enabling
highly interpretable and reliable results by means of reasoning and querying.
Large Language Models (LLMs) overcome those limitations making them suitable in
open-ended tasks and unstructured environments. Nevertheless, LLMs are neither
interpretable nor reliable. To solve the dichotomy between LLMs and SKGs we
envision Logic Augmented Generation (LAG) that combines the benefits of the two
worlds. LAG uses LLMs as Reactive Continuous Knowledge Graphs that can generate
potentially infinite relations and tacit knowledge on-demand. SKGs are key for
injecting a discrete heuristic dimension with clear logical and factual
boundaries. We exemplify LAG in two tasks of collective intelligence, i.e.,
medical diagnostics and climate projections. Understanding the properties and
limitations of LAG, which are still mostly unknown, is of utmost importance for
enabling a variety of tasks involving tacit knowledge in order to provide
interpretable and effective results.
comment: 10 pages, 2 figures
☆ Sentiment Analysis of Economic Text: A Lexicon-Based Approach
We propose an Economic Lexicon (EL) specifically designed for textual
applications in economics. We construct the dictionary with two important
characteristics: 1) to have a wide coverage of terms used in documents
discussing economic concepts, and 2) to provide a human-annotated sentiment
score in the range [-1,1]. We illustrate the use of the EL in the context of a
simple sentiment measure and consider several applications in economics. The
comparison to other lexicons shows that the EL is superior due to its wider
coverage of domain relevant terms and its more accurate categorization of the
word sentiment.
comment: 37 pages, 9 figures, 6 tables, in press
☆ Towards Full Delegation: Designing Ideal Agentic Behaviors for Travel Planning
How are LLM-based agents used in the future? While many of the existing work
on agents has focused on improving the performance of a specific family of
objective and challenging tasks, in this work, we take a different perspective
by thinking about full delegation: agents take over humans' routine
decision-making processes and are trusted by humans to find solutions that fit
people's personalized needs and are adaptive to ever-changing context. In order
to achieve such a goal, the behavior of the agents, i.e., agentic behaviors,
should be evaluated not only on their achievements (i.e., outcome evaluation),
but also how they achieved that (i.e., procedure evaluation). For this, we
propose APEC Agent Constitution, a list of criteria that an agent should follow
for good agentic behaviors, including Accuracy, Proactivity, Efficiency and
Credibility. To verify whether APEC aligns with human preferences, we develop
APEC-Travel, a travel planning agent that proactively extracts hidden
personalized needs via multi-round dialog with travelers. APEC-Travel is
constructed purely from synthetic data generated by Llama3.1-405B-Instruct with
a diverse set of travelers' persona to simulate rich distribution of dialogs.
Iteratively fine-tuned to follow APEC Agent Constitution, APEC-Travel surpasses
baselines by 20.7% on rule-based metrics and 9.1% on LLM-as-a-Judge scores
across the constitution axes.
☆ PIORS: Personalized Intelligent Outpatient Reception based on Large Language Model with Multi-Agents Medical Scenario Simulation
Zhijie Bao, Qingyun Liu, Ying Guo, Zhengqiang Ye, Jun Shen, Shirong Xie, Jiajie Peng, Xuanjing Huang, Zhongyu Wei
In China, receptionist nurses face overwhelming workloads in outpatient
settings, limiting their time and attention for each patient and ultimately
reducing service quality. In this paper, we present the Personalized
Intelligent Outpatient Reception System (PIORS). This system integrates an
LLM-based reception nurse and a collaboration between LLM and hospital
information system (HIS) into real outpatient reception setting, aiming to
deliver personalized, high-quality, and efficient reception services.
Additionally, to enhance the performance of LLMs in real-world healthcare
scenarios, we propose a medical conversational data generation framework named
Service Flow aware Medical Scenario Simulation (SFMSS), aiming to adapt the LLM
to the real-world environments and PIORS settings. We evaluate the
effectiveness of PIORS and SFMSS through automatic and human assessments
involving 15 users and 15 clinical experts. The results demonstrate that
PIORS-Nurse outperforms all baselines, including the current state-of-the-art
model GPT-4o, and aligns with human preferences and clinical needs. Further
details and demo can be found at https://github.com/FudanDISC/PIORS
☆ Robust Detection of Watermarks for Large Language Models Under Human Edits
Watermarking has offered an effective approach to distinguishing text
generated by large language models (LLMs) from human-written text. However, the
pervasive presence of human edits on LLM-generated text dilutes watermark
signals, thereby significantly degrading detection performance of existing
methods. In this paper, by modeling human edits through mixture model
detection, we introduce a new method in the form of a truncated goodness-of-fit
test for detecting watermarked text under human edits, which we refer to as
Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection
of the Gumbel-max watermark in a certain asymptotic regime of substantial text
modifications and vanishing watermark signals. Importantly, Tr-GoF achieves
this optimality \textit{adaptively} as it does not require precise knowledge of
human edit levels or probabilistic specifications of the LLMs, in contrast to
the optimal but impractical (Neyman--Pearson) likelihood ratio test. Moreover,
we establish that the Tr-GoF test attains the highest detection efficiency rate
in a certain regime of moderate text modifications. In stark contrast, we show
that sum-based detection rules, as employed by existing methods, fail to
achieve optimal robustness in both regimes because the additive nature of their
statistics is less resilient to edit-induced noise. Finally, we demonstrate the
competitive and sometimes superior empirical performance of the Tr-GoF test on
both synthetic data and open-source LLMs in the OPT and LLaMA families.
☆ HARec: Hyperbolic Graph-LLM Alignment for Exploration and Exploitation in Recommender Systems
Modern recommendation systems often create information cocoons, limiting
users' exposure to diverse content. To enhance user experience, a crucial
challenge is developing systems that can balance content exploration and
exploitation, allowing users to adjust their recommendation preferences.
Intuitively, this balance can be achieved through a tree-structured
representation, where depth search facilitates exploitation and breadth search
enables exploration. However, current works face two challenges to achieve this
target: (1) Euclidean methods fail to fully capture hierarchical structures and
lack flexibility in balancing exploration-exploitation, while (2) hyperbolic
approaches, despite better hierarchical modeling, suffer from insufficient
semantic alignment due to their reliance on Euclidean text encoders. To address
these challenges, we propose HARec, a hyperbolic representation learning
framework that jointly aligns user-item collaborative information with textual
descriptions in hyperbolic space. Our framework introduces two key technique
novelty: (1) a hierarchical-aware graph-llm alignment mechanism that enables
better hierarchical representation, and (2) a hyperbolic hierarchical tree
structure that facilitates user-adjustable exploration-exploitation trade-offs.
Extensive experiments demonstrate that HARec consistently outperforms both
Euclidean and hyperbolic baselines, achieving up to 5.49% improvement in
utility metrics and 11.39% increase in diversity metrics.
☆ Interactive and Expressive Code-Augmented Planning with Large Language Models
Anthony Z. Liu, Xinhe Wang, Jacob Sansom, Yao Fu, Jongwook Choi, Sungryull Sohn, Jaekyeom Kim, Honglak Lee
Large Language Models (LLMs) demonstrate strong abilities in common-sense
reasoning and interactive decision-making, but often struggle with complex,
long-horizon planning tasks. Recent techniques have sought to structure LLM
outputs using control flow and other code-adjacent techniques to improve
planning performance. These techniques include using variables (to track
important information) and functions (to divide complex tasks into smaller
re-usable sub-tasks). However, purely code-based approaches can be error-prone
and insufficient for handling ambiguous or unstructured data. To address these
challenges, we propose REPL-Plan, an LLM planning approach that is fully
code-expressive (it can utilize all the benefits of code) while also being
dynamic (it can flexibly adapt from errors and use the LLM for fuzzy
situations). In REPL-Plan, an LLM solves tasks by interacting with a
Read-Eval-Print Loop (REPL), which iteratively executes and evaluates code,
similar to language shells or interactive code notebooks, allowing the model to
flexibly correct errors and handle tasks dynamically. We demonstrate that
REPL-Plan achieves strong results across various planning domains compared to
previous methods.
☆ InstCache: A Predictive Cache for LLM Serving
Large language models are revolutionizing every aspect of human life.
However, the unprecedented power comes at the cost of significant computing
intensity, suggesting long latency and large energy footprint. Key-Value Cache
and Semantic Cache have been proposed as a solution to the above problem, but
both suffer from limited scalability due to significant memory cost for each
token or instruction embeddings. Motivated by the observations that most
instructions are short, repetitive and predictable by LLMs, we propose to
predict user-instructions by an instruction-aligned LLM and store them in a
predictive cache, so-called InstCache. We introduce an instruction
pre-population algorithm based on the negative log likelihood of instructions,
determining the cache size with regard to the hit rate. The proposed InstCache
is efficiently implemented as a hash table with minimal lookup latency for
deployment. Experimental results show that InstCache can achieve up to 51.34%
hit rate on LMSys dataset, which corresponds to a 2x speedup, at a memory cost
of only 4.5GB.
☆ SemiKong: Curating, Training, and Evaluating A Semiconductor Industry-Specific Large Language Model
Christopher Nguyen, William Nguyen, Atsushi Suzuki, Daisuke Oku, Hong An Phan, Sang Dinh, Zooey Nguyen, Anh Ha, Shruti Raghavan, Huy Vo, Thang Nguyen, Lan Nguyen, Yoshikuni Hirayama
Large Language Models (LLMs) have demonstrated the potential to address some
issues within the semiconductor industry. However, they are often
general-purpose models that lack the specialized knowledge needed to tackle the
unique challenges of this sector, such as the intricate physics and chemistry
of semiconductor devices and processes. SemiKong, the first industry-specific
LLM for the semiconductor domain, provides a foundation that can be used to
develop tailored proprietary models. With SemiKong 1.0, we aim to develop a
foundational model capable of understanding etching problems at an expert
level. Our key contributions include (a) curating a comprehensive corpus of
semiconductor-related texts, (b) creating a foundational model with in-depth
semiconductor knowledge, and (c) introducing a framework for integrating expert
knowledge, thereby advancing the evaluation process of domain-specific AI
models. Through fine-tuning a pre-trained LLM using our curated dataset, we
have shown that SemiKong outperforms larger, general-purpose LLMs in various
semiconductor manufacturing and design tasks. Our extensive experiments
underscore the importance of developing domain-specific LLMs as a foundation
for company- or tool-specific proprietary models, paving the way for further
research and applications in the semiconductor domain. Code and dataset will be
available at https://github.com/aitomatic/semikong
comment: On-going work
☆ Explaining GPT-4's Schema of Depression Using Machine Behavior Analysis
Adithya V Ganesan, Vasudha Varadarajan, Yash Kumar Lal, Veerle C. Eijsbroek, Katarina Kjell, Oscar N. E. Kjell, Tanuja Dhanasekaran, Elizabeth C. Stade, Johannes C. Eichstaedt, Ryan L. Boyd, H. Andrew Schwartz, Lucie Flek
Use of large language models such as ChatGPT (GPT-4) for mental health
support has grown rapidly, emerging as a promising route to assess and help
people with mood disorders, like depression. However, we have a limited
understanding of GPT-4's schema of mental disorders, that is, how it internally
associates and interprets symptoms. In this work, we leveraged contemporary
measurement theory to decode how GPT-4 interrelates depressive symptoms to
inform both clinical utility and theoretical understanding. We found GPT-4's
assessment of depression: (a) had high overall convergent validity (r = .71
with self-report on 955 samples, and r = .81 with experts judgments on 209
samples); (b) had moderately high internal consistency (symptom
inter-correlates r = .23 to .78 ) that largely aligned with literature and
self-report; except that GPT-4 (c) underemphasized suicidality's -- and
overemphasized psychomotor's -- relationship with other symptoms, and (d) had
symptom inference patterns that suggest nuanced hypotheses (e.g. sleep and
fatigue are influenced by most other symptoms while feelings of
worthlessness/guilt is mostly influenced by depressed mood).
comment: 21 pages, 3 tables, 6 figures, 1 supplementary table, 83 references
☆ NewsInterview: a Dataset and a Playground to Evaluate LLMs' Ground Gap via Informational Interviews
Large Language Models (LLMs) have demonstrated impressive capabilities in
generating coherent text but often struggle with grounding language and
strategic dialogue. To address this gap, we focus on journalistic interviews, a
domain rich in grounding communication and abundant in data. We curate a
dataset of 40,000 two-person informational interviews from NPR and CNN, and
reveal that LLMs are significantly less likely than human interviewers to use
acknowledgements and to pivot to higher-level questions. Realizing that a
fundamental deficit exists in multi-turn planning and strategic thinking, we
develop a realistic simulated environment, incorporating source personas and
persuasive elements, in order to facilitate the development of agents with
longer-horizon rewards. Our experiments show that while source LLMs mimic human
behavior in information sharing, interviewer LLMs struggle with recognizing
when questions are answered and engaging persuasively, leading to suboptimal
information extraction across model size and capability. These findings
underscore the need for enhancing LLMs' strategic dialogue capabilities.
☆ Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels
This study presents a comprehensive evaluation of GPT-4's translation
capabilities compared to human translators of varying expertise levels. Through
systematic human evaluation using the MQM schema, we assess translations across
three language pairs (Chinese$\longleftrightarrow$English,
Russian$\longleftrightarrow$English, and Chinese$\longleftrightarrow$Hindi) and
three domains (News, Technology, and Biomedical). Our findings reveal that
GPT-4 achieves performance comparable to junior-level translators in terms of
total errors, while still lagging behind senior translators. Unlike traditional
Neural Machine Translation systems, which show significant performance
degradation in resource-poor language directions, GPT-4 maintains consistent
translation quality across all evaluated language pairs. Through qualitative
analysis, we identify distinctive patterns in translation approaches: GPT-4
tends toward overly literal translations and exhibits lexical inconsistency,
while human translators sometimes over-interpret context and introduce
hallucinations. This study represents the first systematic comparison between
LLM and human translators across different proficiency levels, providing
valuable insights into the current capabilities and limitations of LLM-based
translation systems.
comment: Work in progress
☆ A Framework for Evaluating LLMs Under Task Indeterminacy NeurIPS 2024
Large language model (LLM) evaluations often assume there is a single correct
response -- a gold label -- for each item in the evaluation corpus. However,
some tasks can be ambiguous -- i.e., they provide insufficient information to
identify a unique interpretation -- or vague -- i.e., they do not clearly
indicate where to draw the line when making a determination. Both ambiguity and
vagueness can cause task indeterminacy -- the condition where some items in the
evaluation corpus have more than one correct response. In this paper, we
develop a framework for evaluating LLMs under task indeterminacy. Our framework
disentangles the relationships between task specification, human ratings, and
LLM responses in the LLM evaluation pipeline. Using our framework, we conduct a
synthetic experiment showing that evaluations that use the "gold label"
assumption underestimate the true performance. We also provide a method for
estimating an error-adjusted performance interval given partial knowledge about
indeterminate items in the evaluation corpus. We conclude by outlining
implications of our work for the research community.
comment: To Appear in NeurIPS 2024 Workshops on Evaluating Evaluations
(EvalEval) and Statistical Foundations of LLMs and Foundation Models (SFLLM)
♻ ☆ Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning
Yihong Tang, Ao Qu, Zhaokai Wang, Dingyi Zhuang, Zhaofeng Wu, Wei Ma, Shenhao Wang, Yunhan Zheng, Zhan Zhao, Jinhua Zhao
Vision language models (VLMs) have demonstrated impressive performance across
a wide range of downstream tasks. However, their proficiency in spatial
reasoning remains limited, despite its crucial role in tasks involving
navigation and interaction with physical environments. Specifically, most of
these tasks rely on the core spatial reasoning capabilities in two-dimensional
(2D) environments, and our evaluation reveals that state-of-the-art VLMs
frequently generate implausible and incorrect responses to composite spatial
reasoning problems, including simple pathfinding tasks that humans can solve
effortlessly at a glance. To address this, we explore an effective approach to
enhance 2D spatial reasoning within VLMs by training the model solely on basic
spatial capabilities. We begin by disentangling the key components of 2D
spatial reasoning: direction comprehension, distance estimation, and
localization. Our central hypothesis is that mastering these basic spatial
capabilities can significantly enhance a model's performance on composite
spatial tasks requiring advanced spatial understanding and combinatorial
problem-solving, with generalized improvements in visual-spatial tasks. To
investigate this hypothesis, we introduce Sparkle, a framework that fine-tunes
VLMs on these three basic spatial capabilities by synthetic data generation and
targeted supervision to form an instruction dataset for each capability. Our
experiments demonstrate that VLMs fine-tuned with Sparkle achieve significant
performance gains, not only in the basic tasks themselves but also in
generalizing to composite and out-of-distribution spatial reasoning tasks.
These findings underscore the effectiveness of mastering basic spatial
capabilities in enhancing composite spatial problem-solving, offering insights
into systematic strategies for improving VLMs' spatial reasoning capabilities.
♻ ☆ LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings
Zero-shot graph machine learning, especially with graph neural networks
(GNNs), has garnered significant interest due to the challenge of scarce
labeled data. While methods like self-supervised learning and graph prompt
learning have been extensively explored, they often rely on fine-tuning with
task-specific labels, limiting their effectiveness in zero-shot scenarios.
Inspired by the zero-shot capabilities of instruction-fine-tuned large language
models (LLMs), we introduce a novel framework named Token Embedding-Aligned
Graph Language Model (TEA-GLM) that leverages LLMs as cross-dataset and
cross-task zero-shot learners for graph machine learning. Concretely, we
pretrain a GNN, aligning its representations with token embeddings of an LLM.
We then train a linear projector that transforms the GNN's representations into
a fixed number of graph token embeddings without tuning the LLM. A unified
instruction is designed for various graph tasks at different levels, such as
node classification (node-level) and link prediction (edge-level). These design
choices collectively enhance our method's effectiveness in zero-shot learning,
setting it apart from existing methods. Experiments show that our graph token
embeddings help the LLM predictor achieve state-of-the-art performance on
unseen datasets and tasks compared to other methods using LLMs as predictors.
♻ ☆ LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts
As large language models (LLMs) show impressive performance on complex tasks,
they still struggle with longer contextual understanding and high computational
costs. To balance efficiency and quality, we introduce LLMSteer, a
fine-tuning-free framework that enhances LLMs through query-independent
attention steering. Tested on popular LLMs and datasets, LLMSteer narrows the
performance gap with baselines by 65.9% and reduces the runtime delay by up to
4.8x compared to recent attention steering methods.
♻ ☆ AUTALIC: A Dataset for Anti-AUTistic Ableist Language In Context
Naba Rizvi, Harper Strickland, Daniel Gitelman, Tristan Cooper, Alexis Morales-Flores, Michael Golden, Aekta Kallepalli, Akshat Alurkar, Haaset Owens, Saleha Ahmedi, Isha Khirwadkar, Imani Munyaka, Nedjma Ousidhoum
As our understanding of autism and ableism continues to increase, so does our
understanding of ableist language towards autistic people. Such language poses
a significant challenge in NLP research due to its subtle and context-dependent
nature. Yet, detecting anti-autistic ableist language remains underexplored,
with existing NLP tools often failing to capture its nuanced expressions. We
present AUTALIC, the first benchmark dataset dedicated to the detection of
anti-autistic ableist language in context, addressing a significant gap in the
field. The dataset comprises 2,400 autism-related sentences collected from
Reddit, accompanied by surrounding context, and is annotated by trained experts
with backgrounds in neurodiversity. Our comprehensive evaluation reveals that
current language models, including state-of-the-art LLMs, struggle to reliably
identify anti-autistic ableism and align with human judgments, underscoring
their limitations in this domain. We publicly release AUTALIC along with the
individual annotations which serve as a valuable resource to researchers
working on ableism, neurodiversity, and also studying disagreements in
annotation tasks. This dataset serves as a crucial step towards developing more
inclusive and context-aware NLP systems that better reflect diverse
perspectives.
comment: 9 pages, 5 figures, 7 tables
♻ ☆ Linguacodus: A Synergistic Framework for Transformative Code Generation in Machine Learning Pipelines
In the ever-evolving landscape of machine learning, seamless translation of
natural language descriptions into executable code remains a formidable
challenge. This paper introduces Linguacodus, an innovative framework designed
to tackle this challenge by deploying a dynamic pipeline that iteratively
transforms natural language task descriptions into code through high-level
data-shaping instructions. The core of Linguacodus is a fine-tuned large
language model (LLM), empowered to evaluate diverse solutions for various
problems and select the most fitting one for a given task. This paper details
the fine-tuning process, and sheds light on how natural language descriptions
can be translated into functional code. Linguacodus represents a substantial
leap towards automated code generation, effectively bridging the gap between
task descriptions and executable code. It holds great promise for advancing
machine learning applications across diverse domains. Additionally, we propose
an algorithm capable of transforming a natural description of an ML task into
code with minimal human interaction. In extensive experiments on a vast machine
learning code dataset originating from Kaggle, we showcase the effectiveness of
Linguacodus. The investigations highlight its potential applications across
diverse domains, emphasizing its impact on applied machine learning in various
scientific fields.
♻ ☆ EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation
Shih-Yang Liu, Huck Yang, Chien-Yi Wang, Nai Chit Fung, Hongxu Yin, Charbel Sakr, Saurav Muralidharan, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen
In this work, we re-formulate the model compression problem into the
customized compensation problem: Given a compressed model, we aim to introduce
residual low-rank paths to compensate for compression errors under customized
requirements from users (e.g., tasks, compression ratios), resulting in greater
flexibility in adjusting overall capacity without being constrained by specific
compression formats. However, naively applying SVD to derive residual paths
causes suboptimal utilization of the low-rank representation capacity. Instead,
we propose Training-free Eigenspace Low-Rank Approximation (EoRA), a method
that directly minimizes compression-induced errors without requiring
gradient-based training, achieving fast optimization in minutes using a small
amount of calibration data. EoRA projects compression errors into the
eigenspace of input activations, leveraging eigenvalues to effectively
prioritize the reconstruction of high-importance error components. Moreover,
EoRA can be seamlessly integrated with fine-tuning and quantization to further
improve effectiveness and efficiency. EoRA consistently outperforms previous
methods in compensating errors for compressed LLaMA2/3 models on various tasks,
such as language generation, commonsense reasoning, and math reasoning tasks
(e.g., 31.31%/12.88% and 9.69% improvements on ARC-Easy/ARC-Challenge and
MathQA when compensating LLaMA3-8B that is quantized to 4-bit and pruned to 2:4
sparsity). EoRA offers a scalable, training-free solution to compensate for
compression errors, making it a powerful tool to deploy LLMs in various
capacity and efficiency requirements.
♻ ☆ BERTrend: Neural Topic Modeling for Emerging Trends Detection EMNLP 2024
Detecting and tracking emerging trends and weak signals in large, evolving
text corpora is vital for applications such as monitoring scientific
literature, managing brand reputation, surveilling critical infrastructure and
more generally to any kind of text-based event detection. Existing solutions
often fail to capture the nuanced context or dynamically track evolving
patterns over time. BERTrend, a novel method, addresses these limitations using
neural topic modeling in an online setting. It introduces a new metric to
quantify topic popularity over time by considering both the number of documents
and update frequency. This metric classifies topics as noise, weak, or strong
signals, flagging emerging, rapidly growing topics for further investigation.
Experimentation on two large real-world datasets demonstrates BERTrend's
ability to accurately detect and track meaningful weak signals while filtering
out noise, offering a comprehensive solution for monitoring emerging trends in
large-scale, evolving text corpora. The method can also be used for
retrospective analysis of past events. In addition, the use of Large Language
Models together with BERTrend offers efficient means for the interpretability
of trends of events.
comment: 17 pages, 12 figures, FuturED 2024: Workshop on Future of Event
Detection (CoLocated with EMNLP 2024)
♻ ☆ What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages ACL 2024
Nadav Borenstein, Anej Svete, Robin Chan, Josef Valvoda, Franz Nowak, Isabelle Augenstein, Eleanor Chodroff, Ryan Cotterell
What can large language models learn? By definition, language models (LM) are
distributions over strings. Therefore, an intuitive way of addressing the above
question is to formalize it as a matter of learnability of classes of
distributions over strings. While prior work in this direction focused on
assessing the theoretical limits, in contrast, we seek to understand the
empirical learnability. Unlike prior empirical work, we evaluate neural LMs on
their home turf-learning probabilistic languages-rather than as classifiers of
formal languages. In particular, we investigate the learnability of regular LMs
(RLMs) by RNN and Transformer LMs. We empirically test the learnability of RLMs
as a function of various complexity parameters of the RLM and the hidden state
size of the neural LM. We find that the RLM rank, which corresponds to the size
of linear space spanned by the logits of its conditional distributions, and the
expected length of sampled strings are strong and significant predictors of
learnability for both RNNs and Transformers. Several other predictors also
reach significance, but with differing patterns between RNNs and Transformers.
comment: Accepted to ACL 2024
♻ ☆ Fine-Grained Detection of Solidarity for Women and Migrants in 155 Years of German Parliamentary Debates EMNLP 2024
Solidarity is a crucial concept to understand social relations in societies.
In this paper, we explore fine-grained solidarity frames to study solidarity
towards women and migrants in German parliamentary debates between 1867 and
2022. Using 2,864 manually annotated text snippets (with a cost exceeding 18k
Euro), we evaluate large language models (LLMs) like Llama 3, GPT-3.5, and
GPT-4. We find that GPT-4 outperforms other LLMs, approaching human annotation
quality. Using GPT-4, we automatically annotate more than 18k further instances
(with a cost of around 500 Euro) across 155 years and find that solidarity with
migrants outweighs anti-solidarity but that frequencies and solidarity types
shift over time. Most importantly, group-based notions of (anti-)solidarity
fade in favor of compassionate solidarity, focusing on the vulnerability of
migrant groups, and exchange-based anti-solidarity, focusing on the lack of
(economic) contribution. Our study highlights the interplay of historical
events, socio-economic needs, and political ideologies in shaping migration
discourse and social cohesion. We also show that powerful LLMs, if carefully
prompted, can be cost-effective alternatives to human annotation for hard
social scientific tasks.
comment: EMNLP 2024 (Main Conference) Camera-Ready Version
♻ ☆ Is Less More? Exploring Token Condensation as Training-free Adaptation for CLIP
Contrastive language-image pre-training (CLIP) has shown remarkable
generalization ability in image classification. However, CLIP sometimes
encounters performance drops on downstream datasets during zero-shot inference.
Test-time adaptation methods attempt to mitigate this by adjusting
normalization layers or tuning context prompts with large batch sizes and
extensive augmentations; yet, these methods are computationally intensive. This
raises an important question: Is there a training-free approach that can
efficiently address CLIP's performance drop in such cases? To explore this, we
benchmark token condensation techniques, originally designed to enhance the
efficiency of vision transformers, on CLIP zero-shot inference tasks. We
observe that although token condensation may compromise in-domain accuracy, it
surprisingly enhances CLIP's performance on certain cross-dataset benchmarks.
This motivates two key inquiries: (1) Can token condensation serve as a
"free-lunch" solution for CLIP zero-shot inference? (2) What criteria should
guide condensation -- how can essential tokens be identified and redundant ones
eliminated? To address these questions, we propose Token Condensation as
Adaptation (TCA), a training-free adaptation method for CLIP by pruning
class-irrelevant visual tokens while merging class-ambiguous tokens. As the
first approach for CLIP's token efficiency, TCA demonstrates superior
performance across cross-dataset tasks, achieving up to a 21.4\% improvement
over the strongest baseline while reducing GFLOPs by 12.2\% to 48.9\%, with
minimized hyperparameter dependency.
comment: 15 pages, 7 figures
♻ ☆ Reconciling Kaplan and Chinchilla Scaling Laws
Kaplan et al. [2020] (`Kaplan') and Hoffmann et al. [2022] (`Chinchilla')
studied the scaling behavior of transformers trained on next-token language
prediction. These studies produced different estimates for how the number of
parameters ($N$) and training tokens ($D$) should be set to achieve the lowest
possible loss for a given compute budget ($C$). Kaplan: $N_\text{optimal}
\propto C^{0.73}$, Chinchilla: $N_\text{optimal} \propto C^{0.50}$. This paper
finds that much of this discrepancy can be attributed to Kaplan counting
non-embedding rather than total parameters, combined with their analysis being
performed at small scale. Simulating the Chinchilla study under these
conditions produces biased scaling coefficients close to Kaplan's. Hence, this
paper reaffirms Chinchilla's scaling coefficients, by explaining the primary
cause of Kaplan's original overestimation. As a second contribution, the paper
explains differences in the reported relationships between loss and compute.
These findings lead us to recommend that future scaling studies use total
parameters and compute.
comment: Published in TMLR 2024
♻ ☆ Improving Steering Vectors by Targeting Sparse Autoencoder Features
To control the behavior of language models, steering methods attempt to
ensure that outputs of the model satisfy specific pre-defined properties.
Adding steering vectors to the model is a promising method of model control
that is easier than finetuning, and may be more robust than prompting. However,
it can be difficult to anticipate the effects of steering vectors produced by
methods such as CAA [Panickssery et al., 2024] or the direct use of SAE latents
[Templeton et al., 2024]. In our work, we address this issue by using SAEs to
measure the effects of steering vectors, giving us a method that can be used to
understand the causal effect of any steering vector intervention. We use this
method for measuring causal effects to develop an improved steering method,
SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific
SAE features while minimizing unintended side effects. We show that overall,
SAE-TS balances steering effects with coherence better than CAA and SAE feature
steering, when evaluated on a range of tasks.
comment: 8 maintext pages and 9 appendix pages
♻ ☆ How Does A Text Preprocessing Pipeline Affect Ontology Syntactic Matching?
The generic text preprocessing pipeline, comprising Tokenisation,
Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been
implemented in many ontology matching (OM) systems. However, the lack of
standardisation in text preprocessing creates diversity in mapping results. In
this paper, we investigate the effect of the text preprocessing pipeline on OM
tasks at syntactic levels. Our experiments on 8 Ontology Alignment Evaluation
Initiative (OAEI) track repositories with 49 distinct alignments indicate: (1)
Tokenisation and Normalisation are currently more effective than Stop Words
Removal and Stemming/Lemmatisation; and (2) The selection of Lemmatisation and
Stemming is task-specific. We recommend standalone Lemmatisation or Stemming
with post-hoc corrections. We find that (3) Porter Stemmer and Snowball Stemmer
perform better than Lancaster Stemmer; and that (4) Part-of-Speech (POS)
Tagging does not help Lemmatisation. To repair less effective Stop Words
Removal and Stemming/Lemmatisation used in OM tasks, we propose a novel
context-based pipeline repair approach that significantly improves matching
correctness and overall matching performance. We also discuss the use of text
preprocessing pipeline in the new era of large language models (LLMs).
comment: 13 pages, 26 figures, 4 tables
♻ ☆ OASIS: Open Agents Social Interaction Simulations on One Million Agents
Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, Prateek Gupta, Shuyue Hu, Zhenfei Yin, Guohao Li, Xu Jia, Lijun Wang, Bernard Ghanem, Huchuan Lu, Wanli Ouyang, Yu Qiao, Philip Torr, Jing Shao
There has been a growing interest in enhancing rule-based agent-based models
(ABMs) for social media platforms (i.e., X, Reddit) with more realistic large
language model (LLM) agents, thereby allowing for a more nuanced study of
complex systems. As a result, several LLM-based ABMs have been proposed in the
past year. While they hold promise, each simulator is specifically designed to
study a particular scenario, making it time-consuming and resource-intensive to
explore other phenomena using the same ABM. Additionally, these models simulate
only a limited number of agents, whereas real-world social media platforms
involve millions of users. To this end, we propose OASIS, a generalizable and
scalable social media simulator. OASIS is designed based on real-world social
media platforms, incorporating dynamically updated environments (i.e., dynamic
social networks and post information), diverse action spaces (i.e., following,
commenting), and recommendation systems (i.e., interest-based and
hot-score-based). Additionally, OASIS supports large-scale user simulations,
capable of modeling up to one million users. With these features, OASIS can be
easily extended to different social media platforms to study large-scale group
phenomena and behaviors. We replicate various social phenomena, including
information spreading, group polarization, and herd effects across X and Reddit
platforms. Moreover, we provide observations of social phenomena at different
agent group scales. We observe that the larger agent group scale leads to more
enhanced group dynamics and more diverse and helpful agents' opinions. These
findings demonstrate OASIS's potential as a powerful tool for studying complex
systems in digital environments.
♻ ☆ CulturePark: Boosting Cross-cultural Understanding in Large Language Models NeurIPS 2024
Cultural bias is pervasive in many large language models (LLMs), largely due
to the deficiency of data representative of different cultures. Typically,
cultural datasets and benchmarks are constructed either by extracting subsets
of existing datasets or by aggregating from platforms such as Wikipedia and
social media. However, these approaches are highly dependent on real-world data
and human annotations, making them costly and difficult to scale. Inspired by
cognitive theories on social communication, this paper introduces CulturePark,
an LLM-powered multi-agent communication framework for cultural data
collection. CulturePark simulates cross-cultural human communication with
LLM-based agents playing roles in different cultures. It generates high-quality
cross-cultural dialogues encapsulating human beliefs, norms, and customs. Using
CulturePark, we generated 41,000 cultural samples to fine-tune eight
culture-specific LLMs. We evaluated these models across three downstream tasks:
content moderation, cultural alignment, and cultural education. Results show
that for content moderation, our GPT-3.5-based models either match or
outperform GPT-4 on datasets. Regarding cultural alignment, our models surpass
GPT-4 on Hofstede's VSM 13 framework. Furthermore, for cultural education of
human participants, our models demonstrate superior outcomes in both learning
efficacy and user experience compared to GPT-4. CulturePark proves an important
step in addressing cultural bias and advancing the democratization of AI,
highlighting the critical role of culturally inclusive data in model training.
Code is released at https://github.com/Scarelette/CulturePark.
comment: NeurIPS 2024; Code is released at
https://github.com/Scarelette/CulturePark. arXiv admin note: substantial text
overlap with arXiv:2402.10946
♻ ☆ mHuBERT-147: A Compact Multilingual HuBERT Model
We present mHuBERT-147, the first general-purpose massively multilingual
HuBERT speech representation model trained on 90K hours of clean, open-license
data. To scale up the multi-iteration HuBERT approach, we use faiss-based
clustering, achieving 5.2x faster label assignment than the original method. We
also apply a new multilingual batching up-sampling strategy, leveraging both
language and dataset diversity. After 3 training iterations, our compact 95M
parameter mHuBERT-147 outperforms larger models trained on substantially more
data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with
SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses
XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against
the much larger MMS (1B params; 491K hours). Our findings indicate that
mHuBERT-147 is a promising model for multilingual speech tasks, offering an
unprecedented balance between high performance and parameter efficiency.
comment: Extended version of the Interspeech 2024 paper of same name
♻ ☆ Verifying the Robustness of Automatic Credibility Assessment
Text classification methods have been widely investigated as a way to detect
content of low credibility: fake news, social media bots, propaganda, etc.
Quite accurate models (likely based on deep neural networks) help in moderating
public electronic platforms and often cause content creators to face rejection
of their submissions or removal of already published texts. Having the
incentive to evade further detection, content creators try to come up with a
slightly modified version of the text (known as an attack with an adversarial
example) that exploit the weaknesses of classifiers and result in a different
output. Here we systematically test the robustness of common text classifiers
against available attacking techniques and discover that, indeed,
meaning-preserving changes in input text can mislead the models. The approaches
we test focus on finding vulnerable spans in text and replacing individual
characters or words, taking into account the similarity between the original
and replacement content. We also introduce BODEGA: a benchmark for testing both
victim models and attack methods on four misinformation detection tasks in an
evaluation framework designed to simulate real use-cases of content moderation.
The attacked tasks include (1) fact checking and detection of (2) hyperpartisan
news, (3) propaganda and (4) rumours. Our experimental results show that modern
large language models are often more vulnerable to attacks than previous,
smaller solutions, e.g. attacks on GEMMA being up to 27\% more successful than
those on BERT. Finally, we manually analyse a subset adversarial examples and
check what kinds of modifications are used in successful attacks.
♻ ☆ Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
Rapidly developing large language models (LLMs) have brought tremendous
intelligent applications. Especially, the GPT-4o's excellent duplex speech
interaction ability has brought impressive experience to users. Researchers
have recently proposed several multi-modal LLMs in this direction that can
achieve user-agent speech-to-speech conversations. This paper proposes a novel
speech-text multimodal LLM architecture called Freeze-Omni. Our main
contribution is that the speech input and output modalities can be easily
connected to a textual LLM while keeping the LLM's parameters frozen throughout
the training process. We design a three-stage training strategy for modeling
both the speech input and output, enabling Freeze-Omni to obtain
speech-to-speech conversation ability using text-speech paired data (such as
ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs.
Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in
the speech modality is at the same level compared with that in the text
modality of its backbone LLM, while achieving low latency end-to-end spoken
response. In addition, we also designed a method to achieve duplex dialogue
ability through multi-task training, giving Freeze-Omni a more natural style of
dialogue ability between users and agents. In summary, Freeze-Omni holds great
potential to conduct speech-to-speech dialogue based on a multimodal LLM under
the condition of a frozen LLM, avoiding the catastrophic forgetting problem
caused by limited data and training resources.
comment: Project Page: https://freeze-omni.github.io/
♻ ☆ A Confidence-based Acquisition Model for Self-supervised Active Learning and Label Correction ACL
Carel van Niekerk, Christian Geishauser, Michael Heck, Shutong Feng, Hsien-chin Lin, Nurul Lubis, Benjamin Ruppik, Renato Vukovic, Milica Gašić
Supervised neural approaches are hindered by their dependence on large,
meticulously annotated datasets, a requirement that is particularly cumbersome
for sequential tasks. The quality of annotations tends to deteriorate with the
transition from expert-based to crowd-sourced labelling. To address these
challenges, we present CAMEL (Confidence-based Acquisition Model for Efficient
self-supervised active Learning), a pool-based active learning framework
tailored to sequential multi-output problems. CAMEL possesses two core
features: (1) it requires expert annotators to label only a fraction of a
chosen sequence, and (2) it facilitates self-supervision for the remainder of
the sequence. By deploying a label correction mechanism, CAMEL can also be
utilised for data cleaning. We evaluate CAMEL on two sequential tasks, with a
special emphasis on dialogue belief tracking, a task plagued by the constraints
of limited and noisy datasets. Our experiments demonstrate that CAMEL
significantly outperforms the baselines in terms of efficiency. Furthermore,
the data corrections suggested by our method contribute to an overall
improvement in the quality of the resulting datasets.
comment: Accepted at TACL
♻ ☆ REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering EMNLP 2024
Considering the limited internal parametric knowledge, retrieval-augmented
generation (RAG) has been widely used to extend the knowledge scope of large
language models (LLMs). Despite the extensive efforts on RAG research, in
existing methods, LLMs cannot precisely assess the relevance of retrieved
documents, thus likely leading to misleading or even incorrect utilization of
external knowledge (eg., retrieved documents). To address this issue, in this
paper, we propose REAR, a RElevance-Aware Retrieval-augmented approach for
open-domain question answering (QA). As the key motivation, we aim to enhance
the self-awareness regarding the reliability of external knowledge for LLMs, so
as to adaptively utilize external knowledge in RAG systems. Specially, we
develop a novel architecture for LLM-based RAG systems, by incorporating a
specially designed assessment module that precisely assesses the relevance of
retrieved documents. Furthermore, we propose an improved training method based
on bi-granularity relevance fusion and noise-resistant training. By combining
the improvements in both architecture and training, our proposed REAR can
better utilize external knowledge by effectively perceiving the relevance of
retrieved documents. Experiments on four open-domain QA tasks show that REAR
significantly outperforms previous a number of competitive RAG approaches. Our
codes can be accessed at https://github.com/RUCAIBox/REAR.
comment: Accepted to EMNLP 2024 Main Conference. Published on ACL Anthology:
https://aclanthology.org/2024.emnlp-main.321.pdf
♻ ☆ LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning
Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, Dongzhan Zhou
This paper presents an advanced mathematical problem-solving framework,
LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language
Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with
iterative Self-Refine to optimize the reasoning path and utilizes a pairwise
reward model to evaluate different paths globally. By leveraging the
self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS
(SR-MCTS) overcomes the inefficiencies and limitations of conventional
step-wise and greedy search algorithms by fostering a more efficient
exploration of solution spaces. Pairwise Preference Reward Model~(PPRM),
inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to
model pairwise preferences between solutions, utilizing an Enhanced Borda Count
(EBC) method to synthesize these preferences into a global ranking score to
find better answers. This approach addresses the challenges of scoring
variability and non-independent distributions in mathematical reasoning tasks.
The framework has been tested on general and advanced benchmarks, showing
superior performance in terms of search efficiency and problem-solving
capability compared to existing methods like ToT and rStar, particularly in
complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.
♻ ☆ Probing Multimodal Large Language Models for Global and Local Semantic Representations LREC
The advancement of Multimodal Large Language Models (MLLMs) has greatly
accelerated the development of applications in understanding integrated texts
and images. Recent works leverage image-caption datasets to train MLLMs,
achieving state-of-the-art performance on image-to-text tasks. However, there
are few studies exploring which layers of MLLMs make the most effort to the
global image information, which plays vital roles in multimodal comprehension
and generation. In this study, we find that the intermediate layers of models
can encode more global semantic information, whose representation vectors
perform better on visual-language entailment tasks, rather than the topmost
layers. We further probe models regarding local semantic representations
through object recognition tasks. We find that the topmost layers may
excessively focus on local information, leading to a diminished ability to
encode global information. Our code and data are released via
https://github.com/kobayashikanna01/probing_MLLM_rep.
comment: Accepted by LREC-COLING 2024 as a short paper. ACL Anthology URL:
[https://aclanthology.org/2024.lrec-main.1142/]
♻ ☆ PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition
In this study, we aim to reduce generation latency for Named Entity
Recognition (NER) with Large Language Models (LLMs). The main cause of high
latency in LLMs is the sequential decoding process, which autoregressively
generates all labels and mentions for NER, significantly increase the sequence
length. To this end, we introduce Parallel Decoding in LLM for NE}
(PaDeLLM-NER), a approach that integrates seamlessly into existing generative
model frameworks without necessitating additional modules or architectural
modifications. PaDeLLM-NER allows for the simultaneous decoding of all
mentions, thereby reducing generation latency. Experiments reveal that
PaDeLLM-NER significantly increases inference speed that is 1.76 to 10.22 times
faster than the autoregressive approach for both English and Chinese.
Simultaneously it maintains the quality of predictions as evidenced by the
performance that is on par with the state-of-the-art across various datasets.
comment: Accepted to Neurips2024
♻ ☆ Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering
Generative AI has made rapid advancements in recent years, achieving
unprecedented capabilities in multimodal understanding and code generation.
This can enable a new paradigm of front-end development in which multimodal
large language models (MLLMs) directly convert visual designs into code
implementations. In this work, we construct Design2Code - the first real-world
benchmark for this task. Specifically, we manually curate 484 diverse
real-world webpages as test cases and develop a set of automatic evaluation
metrics to assess how well current multimodal LLMs can generate the code
implementations that directly render into the given reference webpages, given
the screenshots as input. We also complement automatic metrics with
comprehensive human evaluations to validate the performance ranking. To
rigorously benchmark MLLMs, we test various multimodal prompting methods on
frontier models such as GPT-4o, GPT-4V, Gemini, and Claude. Our fine-grained
break-down metrics indicate that models mostly lag in recalling visual elements
from the input webpages and generating correct layout designs.
comment: The first two authors contributed equally
♻ ☆ High Risk of Political Bias in Black Box Emotion Inference Models
This paper investigates the presence of political bias in emotion inference
models used for sentiment analysis (SA) in social science research. Machine
learning models often reflect biases in their training data, impacting the
validity of their outcomes. While previous research has highlighted gender and
race biases, our study focuses on political bias - an underexplored yet
pervasive issue that can skew the interpretation of text data across a wide
array of studies. We conducted a bias audit on a Polish sentiment analysis
model developed in our lab. By analyzing valence predictions for names and
sentences involving Polish politicians, we uncovered systematic differences
influenced by political affiliations. Our findings indicate that annotations by
human raters propagate political biases into the model's predictions. To
mitigate this, we pruned the training dataset of texts mentioning these
politicians and observed a reduction in bias, though not its complete
elimination. Given the significant implications of political bias in SA, our
study emphasizes caution in employing these models for social science research.
We recommend a critical examination of SA results and propose using
lexicon-based systems as a more ideologically neutral alternative. This paper
underscores the necessity for ongoing scrutiny and methodological adjustments
to ensure the reliability and impartiality of the use of machine learning in
academic and applied contexts.
♻ ☆ SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation
Large language models demonstrate exceptional performance in simple code
generation tasks but still face challenges in tackling complex problems. These
challenges may stem from insufficient reasoning and problem decomposition
capabilities. To address this issue, we propose a reasoning-augmented data
generation process, SRA-MCTS, which guides the model to autonomously generate
high-quality intermediate reasoning paths. This creates a positive feedback
loop, enabling continuous improvement. Our method operates entirely through the
model itself without requiring additional supervision. By synthesizing natural
language reasoning paths and translating them into executable code, the
approach ensures analytical accuracy and enhances the success rate in solving
complex tasks. Experimental results show that, even without additional
supervisory signals, our method achieves performance improvements across
different model scales, demonstrating the significant potential of
self-improvement in small models. Furthermore, the method remains robust when
traditional Chain-of-Thought (CoT) approaches exhibit performance degradation,
with notable improvements observed in diversity metrics such as pass@10. We
encourage further exploration of reasoning processes within training data to
enhance the ability of language models to address complex problems.
♻ ☆ Disentangling Memory and Reasoning Ability in Large Language Models
Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, Yongfeng Zhang
Large Language Models (LLMs) have demonstrated strong performance in handling
complex tasks requiring both extensive knowledge and reasoning abilities.
However, the existing LLM inference pipeline operates as an opaque process
without explicit separation between knowledge retrieval and reasoning steps,
making the model's decision-making process unclear and disorganized. This
ambiguity can lead to issues such as hallucinations and knowledge forgetting,
which significantly impact the reliability of LLMs in high-stakes domains. In
this paper, we propose a new inference paradigm that decomposes the complex
inference process into two distinct and clear actions: (1) memory recall: which
retrieves relevant knowledge, and (2) reasoning: which performs logical steps
based on the recalled knowledge. To facilitate this decomposition, we introduce
two special tokens memory and reason, guiding the model to distinguish between
steps that require knowledge retrieval and those that involve reasoning. Our
experiment results show that this decomposition not only improves model
performance but also enhances the interpretability of the inference process,
enabling users to identify sources of error and refine model responses
effectively. The code is available at
https://github.com/MingyuJ666/Disentangling-Memory-and-Reasoning.
♻ ☆ A Closer Look at Machine Unlearning for Large Language Models
Large language models (LLMs) may memorize sensitive or copyrighted content,
raising privacy and legal concerns. Due to the high cost of retraining from
scratch, researchers attempt to employ machine unlearning to remove specific
content from LLMs while preserving the overall performance. In this paper, we
discuss several issues in machine unlearning for LLMs and provide our insights
on possible approaches. To address the issue of inadequate evaluation of model
outputs after unlearning, we introduce three additional metrics to evaluate
token diversity, sentence semantics, and factual correctness. We then
categorize unlearning methods into untargeted and targeted, and discuss their
issues respectively. Specifically, the behavior that untargeted unlearning
attempts to approximate is unpredictable and may involve hallucinations, and
existing regularization is insufficient for targeted unlearning. To alleviate
these issues, we propose using the objective of maximizing entropy (ME) for
untargeted unlearning and incorporate answer preservation (AP) loss as
regularization for targeted unlearning. Experimental results across three
scenarios, i.e., fictitious unlearning, continual unlearning, and real-world
unlearning, demonstrate the effectiveness of our approaches. The code is
available at https://github.com/sail-sg/closer-look-LLM-unlearning.
♻ ☆ When Context Leads but Parametric Memory Follows in Large Language Models EMNLP 2024
Large language models (LLMs) have demonstrated remarkable progress in
leveraging diverse knowledge sources. This study investigates how nine widely
used LLMs allocate knowledge between local context and global parameters when
answering open-ended questions in knowledge-consistent scenarios. We introduce
a novel dataset, WikiAtomic, and systematically vary context sizes to analyze
how LLMs prioritize and utilize the provided information and their parametric
knowledge in knowledge-consistent scenarios. Additionally, we also study their
tendency to hallucinate under varying context sizes. Our findings reveal
consistent patterns across models, including a consistent reliance on both
contextual (around 70%) and parametric (around 30%) knowledge, and a decrease
in hallucinations with increasing context. These insights highlight the
importance of more effective context organization and developing models that
use input more deterministically for robust performance.
comment: Accepted by EMNLP 2024 Main Conference
♻ ☆ Transformer-Based Contextualized Language Models Joint with Neural Networks for Natural Language Inference in Vietnamese
Natural Language Inference (NLI) is a task within Natural Language Processing
(NLP) that holds value for various AI applications. However, there have been
limited studies on Natural Language Inference in Vietnamese that explore the
concept of joint models. Therefore, we conducted experiments using various
combinations of contextualized language models (CLM) and neural networks. We
use CLM to create contextualized work presentations and use Neural Networks for
classification. Furthermore, we have evaluated the strengths and weaknesses of
each joint model and identified the model failure points in the Vietnamese
context. The highest F1 score in this experiment, up to 82.78% in the benchmark
dataset (ViNLI). By conducting experiments with various models, the most
considerable size of the CLM is XLM-R (355M). That combination has consistently
demonstrated superior performance compared to fine-tuning strong pre-trained
language models like PhoBERT (+6.58%), mBERT (+19.08%), and XLM-R (+0.94%) in
terms of F1-score. This article aims to introduce a novel approach or model
that attains improved performance for Vietnamese NLI. Overall, we find that the
joint approach of CLM and neural networks is simple yet capable of achieving
high-quality performance, which makes it suitable for applications that require
efficient resource utilization.
♻ ☆ On the Trustworthiness Landscape of State-of-the-art Generative Models: A Survey and Outlook
Diffusion models and large language models have emerged as leading-edge
generative models, revolutionizing various aspects of human life. However, the
practical implementations of these models have also exposed inherent risks,
bringing to the forefront their evil sides and sparking concerns regarding
their trustworthiness. Despite the wealth of literature on this subject, a
comprehensive survey specifically delving into the intersection of large-scale
generative models and their trustworthiness remains largely absent. To bridge
this gap, this paper investigates both the long-standing and emerging threats
associated with these models across four fundamental dimensions: 1) privacy, 2)
security, 3) fairness, and 4) responsibility. Based on the investigation
results, we develop an extensive map outlining the trustworthiness of large
generative models. After that, we provide practical recommendations and
potential research directions for future secure applications equipped with
large generative models, ultimately promoting the trustworthiness of the models
and benefiting the society as a whole.
comment: draft
♻ ☆ Chat Bankman-Fried: an Exploration of LLM Alignment in Finance
Advancements in large language models (LLMs) have renewed concerns about AI
alignment - the consistency between human and AI goals and values. As various
jurisdictions enact legislation on AI safety, the concept of alignment must be
defined and measured across different domains. This paper proposes an
experimental framework to assess whether LLMs adhere to ethical and legal
standards in the relatively unexplored context of finance. We prompt nine LLMs
to impersonate the CEO of a financial institution and test their willingness to
misuse customer assets to repay outstanding corporate debt. Beginning with a
baseline configuration, we adjust preferences, incentives and constraints,
analyzing the impact of each adjustment with logistic regression. Our findings
reveal significant heterogeneity in the baseline propensity for unethical
behavior of LLMs. Factors such as risk aversion, profit expectations, and
regulatory environment consistently influence misalignment in ways predicted by
economic theory, although the magnitude of these effects varies across LLMs.
This paper highlights both the benefits and limitations of simulation-based, ex
post safety testing. While it can inform financial authorities and institutions
aiming to ensure LLM safety, there is a clear trade-off between generality and
cost.
♻ ☆ A dataset of questions on decision-theoretic reasoning in Newcomb-like problems
We introduce a dataset of natural-language questions in the decision theory
of so-called Newcomb-like problems. Newcomb-like problems include, for
instance, decision problems in which an agent interacts with a similar other
agent, and thus has to reason about the fact that the other agent will likely
reason in similar ways. Evaluating LLM reasoning about Newcomb-like problems is
important because interactions between foundation-model-based agents will often
be Newcomb-like. Some ways of reasoning about Newcomb-like problems may allow
for greater cooperation between models.
Our dataset contains both capabilities questions (i.e., questions with a
unique, uncontroversially correct answer) and attitude questions (i.e.,
questions about which decision theorists would disagree). We use our dataset
for an investigation of decision-theoretical capabilities and expressed
attitudes and their interplay in existing models (different models by OpenAI,
Anthropic, Meta, GDM, Reka, etc.), as well as models under simple prompt-based
interventions. We find, among other things, that attitudes vary significantly
between existing models; that high capabilities are associated with attitudes
more favorable toward so-called evidential decision theory; and that attitudes
are consistent across different types of questions.
comment: 48 pages, 15 figures; code and data at
https://github.com/casparoe/newcomblike_questions_dataset
♻ ☆ Language Models as Hierarchy Encoders NeurIPS 2024
Interpreting hierarchical structures latent in language is a key limitation
of current language models (LMs). While previous research has implicitly
leveraged these hierarchies to enhance LMs, approaches for their explicit
encoding are yet to be explored. To address this, we introduce a novel approach
to re-train transformer encoder-based LMs as Hierarchy Transformer encoders
(HiTs), harnessing the expansive nature of hyperbolic space. Our method
situates the output embedding space of pre-trained LMs within a Poincar\'e ball
with a curvature that adapts to the embedding dimension, followed by training
on hyperbolic clustering and centripetal losses. These losses are designed to
effectively cluster related entities (input as texts) and organise them
hierarchically. We evaluate HiTs against pre-trained LMs, standard fine-tuned
LMs, and several hyperbolic embedding baselines, focusing on their capabilities
in simulating transitive inference, predicting subsumptions, and transferring
knowledge across hierarchies. The results demonstrate that HiTs consistently
outperform all baselines in these tasks, underscoring the effectiveness and
transferability of our re-trained hierarchy encoders.
comment: Accept at NeurIPS 2024