Downsizing AI: when computation is too much computation

8 min readNov 26, 2024

Does the improvement in the performance of foundation models of machine learning necessarily imply a proportional increase in the consumption of computing resources? This belief feeds the current narrative on generative artificial intelligence. But what if it is also its grave? While investors express doubts about the congruity of the capex race of hyperscalers, the next challenges for value-creating and economically sustainable AI need to be focused on.

The artificial intelligence (henceforth: AI) models most in vogue today are those that generate output based on probabilistic predictions from large amounts of secondary data derived from external sources. Most representative of this approach are transformer-type neural networks, on which the main LLMs are based.

Among the actors who have embarked on this path with the greatest determination is Google, which is convinced that, with a sufficient number of levels of depth, these deep learning models could solve almost any problem. See, for instance, what Julian Horsey writes in ‘Geeky Gadgets’ (How Google DeepMind is Redefining AI Problem-Solving with Transformers, 23 September 2024). In particular, the key concept is that of chain of thought, i.e. the segmentation of more complex reasoning into intermediate tokens. This is the same approach followed by OpenAI with its most recent models, the GPT-4o series.

Google is also at the forefront of attempts to employ transformer-type models in new areas, such as support for quantum computing. It is recent news of the first tests with AlphaQubit, an AI system developed by Google DeepMind and Google Quantum AI to tackle one of the main challenges of quantum computing: the correction of alterations that can be produced by phenomena such as microscopic defects in hardware, heat, vibrations, electromagnetic interference and cosmic rays. Using an architecture based on transformer neural networks and trained with real experimental samples of Google’s third-generation quantum processor Sycamore, AlphaQubit is proving capable of accurately decoding quantum errors (Johannes Bausch et al, Learning high-accuracy error decoding for quantum processors, in «Nature», 20 November 2024).

Beyond transformers

It is risky to argue that transformer networks have no future. The objective of this article, however, is to highlight two open questions, which are connected to each other:

the risk of considering transformer models the best answer to any type of problem;
the need, in any case, to put a stop to the computational waste characteristic of transformers.

Let’s start with the first of the two questions.

We are perhaps losing sight of the possibility of creating conceptually different AI models, capable of learning from primary data collected directly from the real world, without being retrained or constantly supervised by humans. Models that, among other things, would be much more energy efficient. The idea could therefore be to go beyond the limits of the current course, not by adding computational capacity, but by designing neural networks more like human ones and thus able to learn in real time through primary data and with a strong capacity for adaptation and self-limitation. More specifically, we speak of models endowed with that cognitive capacity known as ‘inhibitory control’, by which the brain inhibits or controls automatic responses and creates responses using attention and reasoning.

Call it BIAI

This is the world of Brain-Based AI, or — by its acronym — BIAI. It is a strand that is still little explored, and in any case forced to come to terms with a conceptual limitation: imitating the mechanisms of the human brain is difficult, since little is still known about them. An overview of studies in this direction and their applications is offered by Jing Ren and Feng Xia (Brain-inspired Artificial Intelligence: A Comprehensive Review, in ‘arXiv’, 27 August 2024).

It must be said that, in this field, hypotheses are formulated, and experiments are carried out, but a solid paradigm does not yet exist. Not least because the debate is often conditioned by ideological schemes. In one field as in the other, the ranks of reductionists are large. These do not limit themselves to trying to improve the ability to solve complex problems in an increasingly efficient manner, but claim to approach the mirage of an artificial mind capable of thinking.

The debate around so-called artificial general intelligence (AGI), which may well have its own theoretical dignity, is nowadays developing in almost always very superficial terms. So much so that it seems wise to steer clear of it. In any case, one question remains legitimate: is the one on which OpenAI, Google, Meta and Anthropic are staking all their cards really the best possible AI? The current foundational models — the transformer-type networks, in particular — are highlighting two weaknesses.

The first problem is the enormous computing resource requirements associated with the development and deployment of this technology. This requirement translates into a potentially devastating energy balance. Of course, it is assumed that this can be resolved by introducing new sources of supply, such as small modular nuclear reactors (SMRs). But, for the time being, the question of the economic and environmental sustainability of the model remains (on the risks of waste management in modular nuclear reactors, see the work of Lindsay M. Krall et al, Nuclear waste from small modular reactors, in ‘Proc. Natl. Acad. Sci. U.S.A’, 31 May 2022).

‘Incredibly dumb’ models

The second problem is that, perhaps, LLMs are reaching a limit in performance. Or, to put it differently, in many application fields the performance of LLMs is more modest than one would like to believe. After all, it is even Sam Altman, CEO of OpenAI, who acknowledges that current LLMs are ‘incredibly dumb’ despite appearances (James O’Donnell, Sam Altman says helpful agents are poised to become AI’s killer function, in ‘MIT Technology Review’, 1 May 2024).

The contribution of Iman Mirzadeh and others (GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, in ‘arXiv’, 7 October 2024) has caused much discussion in this regard. In it, the mathematical capabilities of LLMs are evaluated using the GSM-Symbolic benchmark, proposed as an alternative to the more traditional GSM8K. The study seems to suggest that determining probability distribution on word sequences is not a reliable way of doing calculations. Some have questioned a certain lack of statistical rigor in Mirzadeh’s study (see, for example, Desi R. Ivanova, On Some (Fixable) Limitations of ‘Understanding the Limitations of Mathematical Reasoning in LLMs’, 31 October 2024). But the question remains legitimate: have the LLMs, stupid by Altman’s own admission, reached the limit of their capabilities?

Take a hammer to crack a nut

The issue could also be seen in another way. Perhaps the point is that we are becoming intoxicated with the idea of increasing computing capacity, when such capacity is not always necessary. And perhaps the time has come to move from bulimia to algorithmic frugality. Think of solving models for optimisation problems where the objective function and constraints are linear expressions. The challenge today is not to achieve a ‘better’ optimum. By definition, in fact, the optimal solution is already the best of all possible solutions, given the choice criteria established, and the constraints imposed. The challenge, if anything, is to identify the linear optimum with ever less computational effort.

It will be argued that the problems of our times are particularly complex and non-linear in nature. And that is why machine learning is needed, an approach that proves effective precisely when it comes to making accurate predictions from huge data sets and in unstable, multifaceted scenarios. Good point. It is no coincidence that decision intelligence today tends to combine, in most cases, mathematical optimization techniques and machine learning. However, this does not always justify the use of huge models, especially those of self-supervised learning. And in any case, such an observation is not in conflict with the objective of improving the performance of such models, while adapting their size to the actual usage requirements and thus reducing computational waste.

LLMs do not only have high computational costs. They entail large memory requirements for their training and subsequent operation. This leads to two consequences: high energy consumption and the need for specialized hardware.

How to measure model dimensions

To express the size of LLMs, we generally refer to parameters. To say, GPT-4.0 counts 1.76 trillion parameters (one trillion corresponds to 1 trillion). But what is a parameter? Let us say that deep neural networks contain several nodes and layers. Each node in one level has connections to all nodes in the next level. Each connection, then, has a weight and an orientation. Weights and activation values, in particular, represented in 16-bit floating-point numbers (FP16) or 32-bit floating-point numbers (FP32), contribute decisively to the size of the model, reaching into the hundreds of billions. Weights and orientations, together with embeddings, are known as model parameters (an embedding is a means of representing objects — such as text, images and audio — as points in a vector space).

However, defining the computational cost of training a model by considering only the number of model parameters is not so precise. It would be more correct to say that the cost depends on two factors: the size of the model — expressed in parameters — and the number of training tokens. Until a year ago, it was usual to keep the size of the training data fixed at around 300 billion tokens, allocating more computational resources to increase the number of parameters. Recently, however, assumptions have begun to be made about the most appropriate trade-off between increasing the size of the model and the amount of training data on the one hand and increasing the computational resources on the other (see, for example, the work of Jordan Hoffmann et al. An empirical analysis of compute-optimal large language model training, in ‘Google DeepMind’, 12 April 2022). In other words, the question arises: what is the optimal model size and number of training tokens for a given calculation budget?

Different ‘slimming’ strategies

Efforts to make artificial intelligence models smaller and more efficient can be classified into four areas:

Quantization
Low-rank decomposition
Pruning
Distillation

Quantization consists of the conversion of weights and activation values of high-precision data, usually FP32 or FP16, into lower numerical precision data, such as an 8-bit integer (INT8). This is one of the most promising avenues. The different quantization approaches are well described by Arun Nanda in Reducing the Size of AI Models (‘Towards Data Science’, 7 September 2024).

Conceptually similar to quantization is low-rank decomposition. With this technique, the model is compressed by replacing each level with its decomposition. Here the most interesting reference is to the recent work of Rajarshi Saha et al. Compressing Large Language Models using Low Rank and Low Precision Decomposition, in ‘arXiv’, 3 November 2024.

Then there is the pruning strategy, which consists of eliminating unnecessary connections in the neural network. Using an estimation based on mutual information, redundant neurons are identified and removed. See the contribution by Hanjuan Huang and Hao-Jia Song on this technique, Large Language Model Pruning, in arXiv’, without date.

Finally, knowledge distillation refers to the idea of training smaller models by replicating the behaviour of larger ones. A detailed review of distillation techniques is presented by Xiaochan Xu et al. in A Survey on Knowledge Distillation of Large Language Models (arXiv, 21 October 2024).

A winter that, in spite ourselves, we see ahead

Of course, the techniques just mentioned can also be used in combination, following a hybrid approach. Be that as it may, the effort to reduce the computational weight of LLM represents one of the most important trends. While in the background remains the issue of investigating alternative scenarios to those of transformers and, more generally, of AI based on probabilistic predictions from large amounts of secondary data derived from external sources.

Finding brilliant solutions to the problem of the sustainability of large neural networks could help amass reserves against a new AI winter. A winter which by certain signs, in spite ourselves — to paraphrase Marguerite Yourcenar in Memoirs of Hadrian — we see ahead.

Downsizing AI: when computation is too much computation

Beyond transformers

Call it BIAI

‘Incredibly dumb’ models

Take a hammer to crack a nut

How to measure model dimensions

Different ‘slimming’ strategies

A winter that, in spite ourselves, we see ahead

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Paolo Costa

No responses yet