- Where is the signal in tokenization space?
Renato Lui Geh, Honghua Zhang, Kareem Ahmed, Benjie Wang, Guy Van den Broeck
EMNLP 2024.
[abstract]
Large Language Models (LLMs) are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences, to which the LLMs assign probability values.
One common assumption is that the probability of a piece of text is the probability of its canonical token sequence.
However, the tokenization of a string is not unique: e.g., the Llama2 tokenizer encodes Tokens as [Tok,ens], but [Tok,en,s] also represents the same text.
In this paper, we study non-canonical tokenizations.
We prove that, given a string, it is computationally hard to find the most likely tokenization for an autoregressive LLM, as well as to compute the marginal probability over all possible tokenizations.
We then show how the marginal is, in most cases, indistinguishable from the canonical probability.
Surprisingly, we then empirically demonstrate the existence of a significant amount of signal hidden within tokenization space.
Notably, by simply aggregating the probabilities of non-canonical tokenizations, we achieve improvements across a range of LLM evaluation benchmarks for a variety of architectures, including transformers and state space models.
- Scaling Tractable Probabilistic Circuits: A Systems Perspective
Anji Liu, Kareem Ahmed, Guy Van den Broeck
ICML 2024.
[abstract]
Probabilistic Circuits (PCs) are a general framework for tractable deep generative models, which support exact and efficient probabilistic inference on their learned distributions.
Recent modeling and training advancements have enabled their application to complex real-world tasks. However, the time and memory inefficiency of existing PC implementations hinders further scaling up.
This paper proposes PyJuice, a general GPU implementation design for PCs that improves prior art in several regards. Specifically, PyJuice is 1-2 orders of magnitude faster than existing systems (including very recent ones) at training large-scale PCs.
Moreover, PyJuice consumes 2-5x less GPU memory, which enables us to train larger models.
At the core of our system is a compilation process that converts a PC into a compact representation amenable to efficient block-based parallelization, which significantly reduces IO and makes it possible to leverage Tensor Cores available in modern GPUs.
Empirically, PyJuice can be used to improve state-of-the-art PCs trained on image (e.g., ImageNet32) and language (e.g., WikiText, CommonGen) datasets.
We further establish a new set of baselines on natural image and language datasets by benchmarking existing PC structures but with much larger sizes and more training epochs, with the hope of incentivizing future research.
- Probabilistically Rewired Message-Passing Neural Networks
Chendi Qian*, Andrei Manolache*, Kareem Ahmed, Zhe Zeng, Guy Van den Broeck, Mathias Niepert*, Christopher Morris*
ICLR 2024.
[abstract]
Message-passing graph neural networks (MPNNs) emerged as powerful tools for
processing graph-structured input. However, they operate on a fixed input graph
structure, ignoring potential noise and missing information. Furthermore, their
local aggregation mechanism can lead to problems such as over-squashing and
limited expressive power in capturing relevant graph structures. Existing solutions
to these challenges have primarily relied on heuristic methods, often disregarding
the underlying data distribution. Hence, devising principled approaches for learning
to infer graph structures relevant to the given prediction task remains an open
challenge. In this work, leveraging recent progress in exact and differentiable ksubset sampling, we devise probabilistically rewired MPNNs (PR-MPNNs), which
learn to add relevant edges while omitting less beneficial ones. For the first time,
our theoretical analysis explores how PR-MPNNs enhance expressive power, and
we identify precise conditions under which they outperform purely randomized
approaches. Empirically, we demonstrate that our approach effectively mitigates
issues like over-squashing and under-reaching. In addition, on established realworld datasets, our method exhibits competitive or superior predictive performance
compared to traditional MPNN models and recent graph transformer architectures.
- A Pseudo-Semantic Loss for Deep Generative Models with Logical Constraints
Kareem Ahmed, Kai-Wei Chang, Guy Van den Broeck
NeurIPS 2023.
[abstract]
Neuro-symbolic approaches bridge the gap between purely symbolic and neural approaches to learning.
This often requires maximizing the probability of a symbolic constraint in the neural network's output. However, output distributions are typically assumed to be fully-factorized, which prohibits the application of neurosymbolic learning to more expressive output
distributions, such as autoregressive deep generative models. There, such probability computation is #P-hard,
even for simple constraints.
Instead, we propose to locally approximate the probability of the symbolic constraint under the pseudolikelihood distribution—the product
of its full conditionals given a sample from the model.
This allows our pseudo-semantic loss function to enforce the symbolic constraint.
Our method bears relationship to several classical approximation schemes, including hogwild Gibbs sampling, consistent pseudolikelihood learning, and contrastive divergence.
We test our proposed approach on three distinct settings: Sudoku, shortest-path prediction, and detoxifying
large language models. Experiments show that pseudo-semantic loss greatly improves upon the base model's ability to satisfy the desired logical
constraint in its output distribution.
- A Unified Approach to Count-Based Weakly-Supervised Learning
Vinay Shukla, Zhe Zeng*, Kareem Ahmed*, Guy Van den Broeck
NeurIPS 2023.
[abstract]
We observe that in weakly supervised learning for binary classification, the weak supervision can be seen as a constraint on label counts.
Inspired by this observation, we aim to develop a unified approach to count-based weakly supervised learning.
From first principles, we derive objectives for three of the most common weakly supervised learning paradigms for binary classification
and show that these objectives share the same computational building block, which is the probability of an arithmetic constraint defined
over the count of positive instances, i.e., the summation of binary labels. We further derive an algorithm to compute the count
probabilities in an exact and tractable way, which allows our proposed objectives to be integrated into the end-to-end training of neural
network models. Thorough empirical evaluation is performed over all three weakly supervised learning paradigms and shows that our proposed
objectives obtained by exact and tractable computations are able to achieve state-of-the-art or highly comparable results on various tasks.
We also provide some experimental analysis to illustrate that the models learn effectively by our proposed objectives.
- SIMPLE: A Gradient Estimator for k-Subset Sampling
Kareem Ahmed*, Zhe Zeng*, Mathias Niepert, Guy Van den Broeck
ICLR 2023.
[abstract]
k-subset sampling is ubiquitous in machine learning, enabling regularization and interpretability through sparsity. The challenge lies
in rendering k-subset sampling amenable to end-to-end learning. This has typically involved relaxing the reparameterized samples to allow
for backpropagation, with the risk of introducing high bias and high variance. In this work, we fall back to discrete k-subset sampling on
the forward pass. This is coupled with using the gradient with respect to the exact marginals, computed efficiently, as a proxy for the true
gradient. We show that our gradient estimator, SIMPLE, exhibits lower bias and variance compared to state-of-the-art estimators, including the
straight-through Gumbel estimator when k=1. Empirical results show improved performance on learning to explain and sparse linear regression.
We provide an algorithm for computing the exact ELBO for the k-subset distribution, obtaining significantly lower loss compared to SOTA.
- Semantic Strengthening of Neuro-Symbolic Learning
Kareem Ahmed, Kai-Wei Chang, Guy Van den Broeck
AISTATS 2023.
[abstract]
Numerous neuro-symbolic approaches have recently been proposed typically with the goal of adding symbolic knowledge to the output layer
of a neural network. Ideally, such losses maximize the probability that the neural network's predictions satisfy the underlying domain.
Unfortunately, this type of probabilistic inference is often computationally infeasible. Neuro-symbolic approaches therefore commonly
resort to fuzzy approximations of this probabilistic objective, sacrificing sound probabilistic semantics, or to sampling which is very
seldom feasible. We approach the problem by first assuming the constraint decomposes conditioned on the features learned by the network.
We iteratively strengthen our approximation, restoring the dependence between the constraints most responsible for degrading the quality
of the approximation. This corresponds to computing the mutual information between pairs of constraints conditioned on the network's
learned features, and may be construed as a measure of how well aligned the gradients of two distributions are. We show how to compute
this efficiently for tractable circuits. We test our approach on three tasks: predicting a minimum-cost path in Warcraft, predicting a
minimum-cost perfect matching, and solving Sudoku puzzles, observing that it improves upon the baselines while sidestepping intractability.
- Semantic Probabilistic Layers for Neuro-Symbolic Learning
Kareem Ahmed, Stefano Teso, Kai-Wei Chang, Guy Van den Broeck, Antonio Vergari
NeurIPS 2022.
[abstract]
We design a predictive layer for structured-output prediction (SOP) that can be plugged into any neural network guaranteeing
its predictions are consistent with a set of predefined symbolic constraints. Our Semantic Probabilistic Layer (SPL) can model
intricate correlations, and hard constraints, over a structured output space all while being amenable to end-to-end learning
via maximum likelihood. SPLs combine exact probabilistic inference with logical reasoning in a clean and modular way, learning
complex distributions and restricting their support to solutions of the constraint. As such, they can faithfully, and efficiently,
model complex SOP tasks beyond the reach of alternative neuro-symbolic approaches. We empirically demonstrate that SPLs outperform
these competitors in terms of accuracy on challenging SOP tasks including hierarchical multi-label classification, pathfinding and
preference learning, while retaining perfect constraint satisfaction.
- Neuro-Symbolic Entropy Regularization
Kareem Ahmed, Eric Wang, Kai-Wei Chang, Guy Van den Broeck
UAI 2022.
Oral Presentation.
[abstract]
In structured prediction, the goal is to jointly predict many output variables that together encode a structured object—
a path in a graph, an entity-relation triple, or an ordering of objects. Such a large output space makes learning hard and requires
vast amounts of labeled data. Different approaches leverage alternate sources of supervision. One approach—entropy
regularization—posits that decision boundaries should lie in low-probability regions. It extracts supervision from unlabeled
examples, but remains agnostic to the structure of the output space. Conversely, neuro-symbolic approaches exploit the knowledge that
not every prediction corresponds to a valid structure in the output space. Yet, they does not further restrict the learned output
distribution. This paper introduces a framework that unifies both approaches. We propose a loss, neuro-symbolic entropy regularization,
that encourages the model to confidently predict a valid object. It is obtained by restricting entropy regularization to the distribution
over only valid structures. This loss is efficiently computed when the output constraint is expressed as a tractable logic circuit. Moreover,
it seamlessly integrates with other neuro-symbolic losses that eliminate invalid predictions. We demonstrate the efficacy of our approach on
a series of semi-supervised and fully-supervised structured-prediction experiments, where we find that it leads to models whose predictions
are more accurate and more likely to be valid.
- PYLON: A PyTorch Framework for Learning with Constraints
Kareem Ahmed, Tao Li, Thy Ton, Quan Guo, Kai-Wei Chang, Parisa Kordjamshidi, Vivek Srikumar, Guy Van den Broeck, Sameer Singh
AAAI 2021, NeurIPS 2021
[abstract]
Deep learning excels at learning low-level task information from large amounts of data, but struggles with learning high-level domain knowledge,
which can often be directly and succinctly expressed. In this work, we introduce Pylon, a neuro-symbolic training framework that builds on PyTorch
to augment procedurally trained neural networks with declaratively specified knowledge. Pylon allows users to programmatically specify constraints
as PyTorch functions, and compiles them into a differentiable loss, thus training predictive models that fit the data whilst satisfying the specified
constraints. Pylon includes both exact as well as approximate compilers to efficiently compute the loss, employing fuzzy logic, sampling methods,
and circuits, ensuring scalability even to complex models and constraints. A guiding principle in designing Pylon has been the ease with which any
existing deep learning codebase can be extended to learn from constraints using only a few lines: a function expressing the constraint and a single
line of code to compile it into a loss. We include case studies from natural language processing, computer vision, logical games, and knowledge graphs,
that can be interactively trained, and highlights Pylon's usage.
- Leveraging Unlabeled Data for Entity-Relation Extraction through Probabilistic Constraint Satisfaction
Kareem Ahmed, Eric Wang, Kai-Wei Chang, Guy Van den Broeck
arxiv pre-print 2021.
[abstract]
We study the problem of entity-relation extraction in the presence of symbolic domain knowledge. Such knowledge takes the
form of an ontology defining relations and their permissible arguments. Previous approaches set out to integrate such
knowledge in their learning approaches either through self-training, or through approximations that lose the precise meaning
of the logical expressions. By contrast, our approach employs semantic loss which captures the precise meaning of a logical
sentence through maintaining a probability distribution over all possible states, and guiding the model to solutions which
minimize any constraint violations. With a focus on low-data regimes, we show that semantic loss outperforms the baselines
by a wide margin.