What is a foundation model, and what can we use them for?

2026-02-19

Foundation models for natural language, the basis for the LLM applications that are so successful at various tasks, are one of the most impactful technical achievements this century. But as they suffuse economic life vertically, restructuring all text-based professions radically, a different question poses itself: how and where do we scale them horizontally, across domains and data types? This blog post will characterise foundation models in a general way, and then address this question.

What is a foundation model?

Understanding what makes a foundation model distinct from other AI or machine learning models, it is important to understand the conventional distinctions and concepts in machine learning. Historically, machine learning and AI researchers grouped their learning algorithms into three groups: supervised learning, unsupervised learning and reinforcement learning. We will briefly discuss each of them to give context to the subsequent development of the foundation model approach.

Classical machine learning

#TLDR: classifiers, clustering, RL

In supervised learning, the algorithm learns a statistical relationship between a set of input variables or values, and an output variable or set of values. For example, a deep neural network might take the RGB values of an image of fixed dimensions as input and output a classification of the contents of that image across a fixed set of categories, such as “muffin” and “chihuahua”.

In unsupervised learning there are no output variables or values, and instead the goal is to extract some of the structure in the input variables, such as their cross-correlations and any groupings or clusters that might be contained in the data. In analysing the ratings of movies by users, we might find groups with distinct tastes, with different groups liking, disliking and not engaging with different genres.

In reinforcement learning, the algorithm is not fed all of the data at once, but rather exposed to a system that feeds it data in the form of rewards and punishments given in response to actions taken by the algorithm in various contexts. Imagine an algorithm playing pac-man and learning that moving left whenever a ghost is coming from the right is better than moving right, through the bitter experience of losing whenever it does the opposite.

Foundation models

#TLDR: foundation models are characterised by self-supervised learning, large data and multiple downstream applications, NOT any specific architecture

Self-supervised learning

For a long time, approaches that didn’t or couldn’t leverage labels or rewards to teach the model what to look for, were thereby considered to be unsupervised. A subset of these, however, were mathematically identical to supervised learning in that the signal that teaches the model was computed by comparing the predicted value with some “ground-truth” value. Only because the ground-truth values weren’t “labels” as such, but identical with all of the input data or a subset of it, was the model architecture considered unsupervised. A new term to identify these ‘in-between’ models was coined: these are self-supervised models, and they learned through self-supervised learning. Self-supervised learning learns the data distribution by pushing the data through a model and reconstructing the whole or a part of it on the other side. In between, it compels the model to learn representations that are useful for that reconstruction task, and compress the most information possible into the parameters of the model. Self-supervised learning is therefore also closely related to compression. This is the first characteristic of a foundation model: it is trained using self-supervised learning.

Training on large data

Large language models (LLMs) are the paradigmatic products of foundation model training and they are trained on very large data: typically the number of text tokens (single- or multi-character subunits of words) on which a model is trained lies between 2 and 20 trillion. This corresponds to between 8 and 80 terabytes of data, without compression. The size of the training data and the self-supervised learning approach are mutually enabling: if self-supervised learning wasn’t feasible, it also would be impossible to amass an appropriate dataset of anything like the size we are talking about here. On the other hand, it is only because the data is so massive that the self-supervised approach works as well as it does: by feeding an enormous amount of data through (in relative terms) few model weights, the model weights are forced to encode dense representations of the input and output space. This is the second essential attribute of a foundation model: the data has to be large enough that labelling is completely out of the question, and large relative to the number of parameters. In essence, foundation models enable learning from data so large that no other approach can be seriously considered.

Diverse downstream tasks

Once a foundation model has been trained on all this data, it can be adapted for many more narrow tasks: for summarisation or text classification, for creating engaging dialogue in a chat interface, for mathematics or coding, and many others. All of these applications leverage the representations that were learnt at the foundation model training stage, and augment them with more specific examples and objectives that teach the model more specific behaviours and the distinctions that enable it perform the specific task at hand. For example, to teach a model to behave as a useful assistant, giving it examples of how such a dialogue would proceed will move the weights into a direction that make it more likely for the model itself to produce such behaviour. Enabling diverse downstream tasks is the third important feature of foundation models. The aim of foundation model training is to learn representations from the data, which can subsequently be used for various downstream tasks.

The transformer architecture

#TLDR: nonetheless, the transformer is very good

There is a fourth attribute that most LLMs and foundation models share: the transformer architecture in its many variants, and quadratic attention. The transformer architecture was published in 2017, and initially aimed to improve the performance of translation systems. A variant called “decoder-only” or “causal” transformer became very successful for modelling a single language, or multiple languages without explicitly distinguishing between them. Subsequently, many changes in the specifics and even larger architectural components have been found to improve performance on this single-language stream paradigm, but all of them still use the core innovation of the original transformer: using repeated attention layers with attention heads to learn non-linear and content-dependent relationships between input tokens at various “positions” within the input. Given this context, it is natural to expect this to be a core attribute of a foundation model, but I would argue it is not: the three criteria laid out above are sufficient to functionally define what a foundation model is, and the specific architecture to perform that function is incidental. Recently, we have seen moves in this direction, with the training of foundation models for text that use diffusion, state-space modelling, or recurrent neural networks. Variational auto-encoders that are trained on massive data, for example, are also foundation models by this definition. That being said, the causal transformer (models that predict the next item in a sequence) remains the foundation model architecture to beat in many application domains.

What can we use foundation models for?

#TLDR: almost everything.

There are only two requirements for the foundation model approach to be potentially fruitful:

  1. There must be enough data
  2. There must be enough interesting structure within the data that can be learnt by the algorithm selected for that task

In practice, a third might be added:

  1. There are sufficient downstream tasks that can leverage the representations learned at the foundation model stage, and that are of sufficient value to justify the investment in training a foundation model

A large number of domains offer promising applications of this technology, where all three criteria hold: in the biological sciences and biochemistry, in economic and financial modelling, in robotics and industrial automation, and behavioural data of various types that companies and organisations hold in varied formats. In each of these domains, there are myriad datasets with different characteristics and different downstream tasks a foundation model could enable or improve upon.

Sequifier

#TLDR: a package for configurable, reproducible foundation model training across domains

The number of potential applications and ways of structuring learning problems can feel overwhelming: it is a task for multiple research communities to figure out where this approach is fruitful, and the best way to go about training models in these domains. Even then, it will take several years of concerted effort to converge on best practices for each individual domain, as it took several years for thousands of researchers and engineers to develop the modern LLM technologies.

The mission of sistemalabs is to shorten these journeys in the general direction of global optima by making experimentation on these data reliable, structured and comparable. This allows for quicker development cycles, lower likelihood of bugs, and within- and cross-domain learning about training recipes and hyperparameter configurations.

The means to these ends is sequifier, the configurable causal transformer training and inference framework. It makes the input and output variables, the model hyperparameters, the loss calculation and learning rate scheduling configurable, while implementing a modern multivariate causal transformer on the inside. The configuration to produce the exact approach that you have in mind is not without complexity, but it is still so much faster than starting with an empty repository. We have done our best to make it as usable as possible, and the remaining complexity aims to give the flexibility to domain researchers to pursue their ideas fully.

The vision does not end with sequifier: once the statistical properties of a dataset or problem domain are better understood, it might well make sense to implement a diffusion model, a recursive neural network (RNN) or something else. But as a first pass at a sequence modelling problem, the causal transformer is promising, and sequifier is here to make it as fast and painless as possible to train one.

To learn more about sequifier, check it out on github :)