Predictive coding I: Introduction
2024 Philosophy Compass, e12950
Last updated 20 March 2024
Predictive coding – sometimes also known as ‘predictive processing’, ‘free energy minimisation’, or ‘prediction error minimisation’ – claims to offer a complete, unified theory of cognition that stretches all the way from cellular biology to phenomenology. However, the exact content of the view, and how it might achieve its ambitions, is not clear. This series of articles examines predictive coding and attempts to identify its key commitments and justification. The present article begins by focusing on possible confounds with predictive coding: claims that are often identified with predictive coding, but which are not predictive coding. These include the idea that the brain employs an efficient scheme for encoding its incoming sensory signals; that perceptual experience is shaped by prior beliefs; that cognition involves minimisation of prediction error; that the brain is a probabilistic inference engine; and that the brain learns and employs a generative model of the world. These ideas have garnered widespread support in modern cognitive neuroscience, but it is important not to conflate them with predictive coding.
Predictive coding – sometimes also known as ‘predictive processing’, ‘free energy minimisation’, or ‘prediction error minimisation’ – claims to offer a complete, unified theory of cognition that stretches all the way from cellular biology to phenomenology. However, the exact content of the view, and how it might achieve its ambitions, is not clear. This series of articles examines predictive coding and attempts to identify its key commitments and justification. The present article begins by focusing on possible confounds with predictive coding: claims that are often identified with predictive coding, but which are not predictive coding. These include the idea that the brain employs an efficient scheme for encoding its incoming sensory signals; that perceptual experience is shaped by prior beliefs; that cognition involves minimisation of prediction error; that the brain is a probabilistic inference engine; and that the brain learns and employs a generative model of the world. These ideas have garnered widespread support in modern cognitive neuroscience, but it is important not to conflate them with predictive coding.
Contents
1 Introduction
Predictive coding is a computational model of cognition. Like other computational models, it attempts to explain human thought and behaviour in terms of computations performed by the brain. It differs from more traditional approaches in at least three respects. First, it aspires to be comprehensive: it aims to explain, not just one domain of human cognition, but all of it – perception, motor control, decision making, planning, reasoning, attention, and so on. Second, it aims to unify: rather than explain cognition in terms of many different kinds of computation, it explains by appeal to a single, unified computation – one computational task and one computational algorithm are claimed to underlie all aspects of cognition. Third, it aims to be complete: it offers not just part of the story about cognition, but one that stretches all the way from the details of neuromodulator release to abstract principles of rational action governing whole agents.1
However, understanding precisely what predictive coding says, and whether it can achieve these ambitions, is not straightforward. For one thing, the term ‘predictive coding’ means different things to different people.2 For another, important features of the view, whatever its name, are liable to change or are underspecified in important respects. In this article and those that follow it, my aim is to sketch what predictive coding is, and how it might fulfil these ambitions.
I argue that predictive coding should be understood as a loose alliance of three claims. These claims, each of which may be precisified or qualified in variety of ways, are made at Marr’s computational, algorithmic, and implementation levels of description.3 At Marr’s computational level, the claim is that the computational task facing the brain is to minimise sensory prediction error. At the algorithmic level, the claim is that the algorithm by which our brain attempts to solve this task involves the action of a hierarchical network of abstract prediction and error units. This network may be viewed, in a further step, as running a variational algorithm for approximate Bayesian inference. At Marr’s implementation level, the claim is that the physical resources that implement the algorithm are primarily located in the neocortex: anatomically distinct cell populations inside neocortical areas implement distinct prediction and error units.
Each of these claims needs to be qualified in certain respects and supplemented by further details. Each needs to be stated more precisely and ideally associated with a quantitative mathematical formalisation. A path needs to be forged from the claims to supporting empirical evidence. Finally, one needs to show that the resultant model delivers the kinds of benefits originally promised – a comprehensive, unifying, and complete account of cognition. Different researchers within the predictive coding community have different opinions about how to do this, and many details are currently left open. This means that the exact commitments of predictive coding are, to put it mildly, contentious. For these reasons, it is more accurate to think of predictive coding as an ongoing research programme rather than a mature theory that can be fully stated now. The aim of the research programme is to articulate and defend some sophisticated – likely heavily modified and precisified – descendent of the three claims above. As with any such programme, the merits of predictive coding should be judged in the round and, to some degree, prospectively: not just in terms of the raw predictive power and confirmation of what it says now, but also in terms of its future potential, and its ability to inspire and guide fruitful research.4
Before saying what predictive coding is, it is first helpful to say what it is not. In this article, I outline five ideas that are often presented alongside predictive coding, but which should be distinguished from it. In the three articles that follow, I focus primarily on the positive content of the view. These explore predictive coding’s claims at Marr’s computational, algorithmic, and implementation levels respectively (Sprevak forthcoming, forthcoming, forthcoming). As we will see, there are many ways in which its basic ideas may be elaborated and refined. My strategy is to present what, in my opinion, are the ‘bare bones’ of the approach. For readers new to this topic, I hope that this will provide you with a scaffold on which to drape a more nuanced future understanding of the view.5
For the remainder of this article, I focus on five ideas that feature prominently in expositions of predictive coding, but which should be distinguished from predictive coding. These ideas are: (i) that the brain employs an efficient coding scheme; (ii) that perception has top-down, expectation-driven effects; (iii) that cognition involves minimisation of prediction error; (iv) that cognition is a form of probabilistic inference; (v) that cognition makes use of generative models. All these ideas are used by predictive coding but, I argue, they are also shared by a variety of other computational approaches. They do not reflect – taken either singly or jointly – what is distinctive about predictive coding’s research programme. If one wishes to know what is special about predictive coding, these ideas, whatever their intrinsic value, can function as potential distractors. A corollary of this is that evidence for predictive coding does not necessarily flow from evidence that supports these more general ideas. Evidence for predictive coding should aim to selectively support predictive coding with respect to plausible contemporary rivals, not merely to confirm ideas that are shared by a wide variety of other approaches.
The literature on predictive coding is vast. In what follows, I ignore many interesting developments, proposals, and applications. My description is also inevitably partisan: there is too much disagreement within the primary literature to be able to characterise the view in a wholly uncontroversial way. If you disagree with my description, I hope that what I say at least provides a foil by which to triangulate your own views.
In both the present article and those that follow, I only consider predictive coding as a theory of subpersonal cognitive processing. I do not consider how its computational model might be adapted or extended to account for personal-level thought or conscious experience. Explaining conscious experience with predictive coding is a relatively recent development. However, it is a project that assumes we have a prior understanding of what predictive coding’s computational model is. That question is the focus of this review.6
2 Efficient neural coding
A key idea that predictive coding employs is that the brain’s coding scheme for storing and transmitting sensory information is, in a certain sense, efficient. The relevant form of efficiency is quantified by the degree to which the brain compresses incoming sensory information (measured in terms of Shannon information theory). To compress information, the sensory system should aim to transmit only what is ‘new’ or ‘unexpected’ or ‘unpredicted’ relative to its expectations. If the sensory system were to encode certain assumptions about its incoming sensory data, these would enable it to predict bits of that incoming sensory stream. This means that fewer bits would need to be stored or transmitted inwards from the sensory boundary, yielding a potential reduction in the costs of the brain physically storing and transmitting that data. The more accurately the brain’s internal assumptions reflect its incoming sensory stream, the less information would need to be stored or transmitted inwards from the sensory periphery. All that would need to be sent inwards would be an error signal – what is new or unexpected – with respect to those predictions. A similar idea underlies coding schemes that allow electronic computers to store and transmit images and videos across the Internet (e.g. JPEG or MPEG).
The notion that our brains use a sensory coding scheme that is efficient in this respect dates back at least to the work of Attneave (1954) and Barlow (1961). They argued that the brain uses a compressing, ‘redundancy reducing’ code for encoding sensory information based partly on the grounds that neurons in the early visual system have a limited physical dynamic range: the action potentials they send inwards to cortical centres are precious and should not be squandered to send information that those cortical centres already have.7 Predictive coding adopts the same basic perspective, but elevates it to a universal design principle: not only the early stages of vision, but every aspect of cognition, should be viewed as an attempt by the brain to compress its incoming sensory data. To this, predictive coding adds a range of further assumptions about (i) the algorithm by which the incoming sensory data are compressed; (ii) how assumptions used for sensory compression are changed during learning; (iii) where physically in the brain all this takes place.
Predictive coding has particular views about how compression of sensory signals works – see (i)–(iii) above. It also adopts the rather extreme position that sensory compression is the brain’s only goal. As Barlow made clear in his later work, even if one thinks that compressing incoming sensory data is one thing that the brain does, it is not obvious that it is the only thing. In some circumstances, it may pay the brain not to compress:
The point Attneave and I failed to appreciate is that the best way to code information depends enormously on the use that is to be made of it … if you simply want to transmit information to another location, then redundancy-reducing codes economizing channel capacity are what you need … But the brain is not just a communication system, and we now need to survey cases where compression is not the best way to exploit statistical structure. (Barlow 2001 p. 246).
One can appreciate Barlow’s point by considering what would count as ‘efficient’ coding for image data on a PC. If all one wishes to do is to transmit an image across the Internet, then compressing it using a redundancy reducing code (e.g. JPEG) might be a good solution, since it would reduce the number of physical signals one would need to send. Similarly, if one only wishes to store the image on a hard disk drive, then compressing it would mean that fewer physical resources would be required for its storage.8 However, if one wishes to transform the image or perform an inference over it, then a redundancy reducing code like JPEG may not be the best or most efficient solution. Compressed data are often harder to work with. If you ask a PC to rotate an image \(23^\circ\) clockwise, the machine will generally not attempt to execute this operation on a compressed encoding of the image data. Instead, it will switch to an uncompressed version of the image (e.g. a two-dimensional array of RGB values at X, Y pixel locations). Image processing algorithms defined over uncompressed data tend to be shorter, simpler, and faster than those defined over their more compressed counterparts.9 Uncompressed images have extra structure, and that structure can make the job of an algorithm that operates on them easier, even if it adds extra overhead to store or transmit.10
If all that matters to the brain in cognition are the costs of transmitting and storing incoming sensory data, then it may make sense for the brain to aim to maximally compress that incoming sensory data. However, if speed, simplicity, and ease of inference matter, then it may make sense to add or preserve redundant structure within incoming sensory data.11 Reducing redundancy is not the only possible objective for a cognitive system that aims at efficient sensory coding.
It is common for contemporary work on efficient coding to acknowledge this point.12 Predictive coding, in its strongest and purest form, adopts a rather extreme view: it equates efficiency with sensory redundancy reduction, and it claims that the entire brain (not just certain areas in the sensory cortex) is devoted to this task; it also claims that the sensory compression is accomplished by a specific algorithm and representational scheme. Although predictive coding employs the idea of efficient coding, the general idea is not unique to predictive coding. Similarly, although evidence for efficient sensory coding in, e.g. early stages in the visual cortex, may be compatible with predictive coding, it may also be compatible with a range of other, more modest proposals about efficient coding in cognition.
3 Top-down, expectation-driven effects in perception
Top-down, expectation-driven effects in perception are instances in which an agent’s prior beliefs systematically affect that agent’s perceptual experience. Top-down, expectation-driven effects are sometimes presented as a hallmark feature of predictive coding. Predictive coding’s computational model is thought to imply that perception is top-down or expectation-laden: ‘What we perceive (or think we perceive) is heavily determined by what we know’ (Clark 2011). Evidence for top-down effects in perception is also thought to somehow confirm predictive coding’s computational model: we should give higher credence to predictive coding’s computational proposal based on observation of top-down effects in perception.13
However, the relationship between predictive coding and top-down, expectation-driven effects in perception is more complex and less direct than this.
First, top-down effects in perception are standardly defined in terms of a relationship between an agent’s personal-level states: what an agent believes affects their perceptual experience.14 Predictive coding, at least in the first instance, makes a claim about the agent’s subpersonal computational states and processes. The ‘top’ and ‘bottom’ in predictive coding’s computational model refer, as we will see, to subpersonal computational states of the agent. ‘High-level’ neural representations (implemented deep in the cortical hierarchy) are assumed to have a ‘top-down’ influence on ‘low-level’ representations (implemented in the early sensory system). How this kind of subpersonal ‘top-down effect’ relates to personal-level top-down effects observed in psychology is presently unclear.
One might argue that, at a minimum, personal-level top-down effects require some subpersonal information to flow from high-level cognitive centres to low-level sensory systems. However, it is difficult to know what can be inferred from this assumption regarding personal-level experience. Not every piece of subpersonal information posited by predictive coding’s computational model features in the contents of either personal-level belief or perceptual experience. Only a tiny fraction of subpersonal information appears to be present at the personal level. For predictive coding to say something specific about the existence or character of top-down effects at the personal level, it would need to say which aspects of that subpersonal information give rise to which personal-level states (beliefs and perceptual contents). These assumptions – which connect the subpersonal level to the personal level – are currently not to be found anywhere within predictive coding’s computational model. Ideas about these connections have been proposed, but exactly how subpersonal states of the computational model map onto personal-level beliefs and perceptual experiences remains a highly speculative matter.15 Absent confidence in such assumptions, however, it is simply unclear how predictive coding’s computational architecture bears, or if it bears at all, on personal-level top-down effects observed in psychology.16
Second, positing top-down subpersonal information flow inside a computational model is not a characteristic that is unique to predictive coding. Almost any plausible computational model of cognition is likely to claim that information flows both ‘upwards’ (from lower-level sensory systems to high-level cognitive centres) and ‘downwards’ (from high-level cognitive centres to lower-level sensory systems). As Ira Hyman observed in his introduction to the reprinting of Neisser’s classic 1967 textbook: ‘Cognitive psychology has been and always will be an interaction of bottom-up and top-down influences’.17 This could even be said of so-called ‘bottom-up’ computational models, such as the account of vision proposed by Marr (1982). Those models might appear to ignore top-down processes, but this is not because they hold that top-down influences do not exist in the brain or are unimportant, but rather because they are not necessary to explain a particular phenomenon of interest.18 Indeed, it has been for a long time standard practice in cognitive science to invoke top-down information flow to account for endogenous attention, semantic priming, and to explain how the brain handles ambiguity, noise, and uncertainty in its sensory input.19 The mammalian brain contains a huge number of ‘backward’ cortical connections which suggest that signals carried from cortical centres to peripheral sensory areas have a significant computational role in cognitive processing. Even if one were to ignore these connections, Firestone & Scholl (2016) observe that there are many other causal routes by which high-level cognitive centres should be expected to systematically affect processing in low-level sensory systems – the decision to ‘shut one’s eyes’ causes one’s eyelids to close, which changes low-level sensory inputs, systematically affecting the contents of states in subpersonal low-level sensory systems, for example.20 When advocates of predictive coding suggest that their model has a special relationship with top-down, expectation-driven effects observed at the personal level, a challenge they face is to explain why predictive coding’s specific set of top-down computational pathways is uniquely or best suited to explain these effects.
To be clear, predictive coding’s computational model is compatible with personal-level top-down effects in perception occurring; it is also broadly suggestive that such effects would occur. What is not clear is that it is better suited to account for these effects than any number of other models that also incorporate subpersonal top-down information flow (e.g. other kinds of recurrent neural networks or classical computational models with loops). For these reasons, it is not clear how personal-level top-down effects is distinctively associated with, or selectively confirms, predictive coding.
4 Minimising prediction error
It is common in contemporary artificial intelligence (AI) to characterise learning and inference in terms of minimising prediction error. During learning, an AI system might attempt to change its parameters to better predict its training data. During inference, an AI system might search for values of its variables that would result in it generating predictions that minimise prediction errors – that are as close to ‘ground truth’ as possible.21 Different AI systems might differ in the types of data they try to predict, the mathematical model they use for prediction, or the way they revise parameters of that model during learning.22 Prediction error might also be measured in a number of ways. A common formalisation is mean-squared error – the average of the squares of the differences between the predicted values and the true values of the data.23
The logical space of possible computational systems that aim to minimise their prediction error is vast. One can get some idea of the size and diversity of that space by opening up any current textbook on machine learning or statistics.24 A maximally simple example of a system that aims to minimise its prediction error would be one that performs linear regression on its training data. Here, minimising prediction error reduces to just fitting a straight-line mathematical model to the training data and using that straight-line model to make predictions about unseen cases. Learning consists in finding the value of two parameters (slope and \(y\)-intercept) that would define a straight line that minimises mean-squared error over the training data. Classical statistics contains many algorithms for finding those values (e.g., the ordinary least squares algorithm). Deep neural networks provide more complicated examples of computational systems that aim to minimise their prediction error. Here, learning consists in finding the values of not just two, but millions or billions of parameters. Algorithms like backpropagation are commonly used to find these values. During inference, a deep neural network might execute a long sequence of mathematical operations over many variables in an effort to yield an output that is as close to the ground truth as possible.
Predictive coding suggests that the brain, like many other computational systems, aims to minimise a measure of prediction error. What distinguishes predictive coding from other proposals is that it makes specific claims about the data, model, and algorithm used in this task; a distinctive claim is also made about the role of this instance of prediction error minimisation within the brain’s wider cognitive economy.
Regarding the data, predictive coding claims that the brain aims to minimise prediction error concerning incoming sensory signals. This should be distinguished from other approaches that claim that the brain aims to minimise prediction error concerning other forms of data, such as reward signals.25 The mathematical model the brain uses to generate its predictions is encoded in an abstract hierarchical network containing prediction and error units linked by weighted connections. This network is similar to the connectionist networks found in deep learning, although the behaviour of individual units and the overall topology of the network differs from those commonly used in deep learning. The algorithm that adjusts the parameters of the network during learning is also different. Deep learning tends to use some version of backpropagation; predictive coding suggests that the brain uses a Hebbian learning algorithm.26 Finally, a special role is accorded to prediction error minimisation in cognition. Predictive coding holds that minimising prediction error over sensory signals is not just one among many objectives undertaken by the brain, but its only or fundamental objective.
It is common to find prediction error minimisation occurring inside a computational model of cognition. What marks out predictive coding as special is the claim that cognition exclusively involves prediction error minimisation over a specific set of data, with a specific mathematical model, and using a specific algorithm for learning and inference. Evidence for prediction error minimisation occurring in the brain, although it may be compatible with predictive coding, may also be compatible with any number of other computational models that also employ prediction error minimisation.
5 Cognition as a form of probabilistic inference
Brains receive noisy, incomplete, and sometimes contradictory information via their sensory organs. They need to weigh this information rapidly and integrate it with (sometimes conflicting) background knowledge in order to reach a decision and generate behaviour. Probabilistic models of cognition provide a broad framework by which to understand how brains do this. According to these models, brains do not represent the world in purely categorical way (e.g. ‘the person facing me is my father’), but instead represent multiple possibilities (e.g. ‘the person facing me is my father, my uncle, his cousin, \(\ldots\)’) along with some measure of uncertainty regarding those outcomes.27 Computational models typically formalise this by ascribing mathematical subjective probability distributions to brains. These probability distributions measure the brain’s degree of confidence in a range of different possibilities.28 Cognitive processing is then modelled as a series of operations in which one subjective probability distribution conditions, or updates, another. The exact manner in which this happens may vary between different computational models. In principle, cognitive processing may maintain this probabilistic character until the brain is forced to plump for a specific outcome in action (e.g. the agent is required to respond ‘yes’/‘no’ in a forced-choice task).
A particularly influential example of this approach is the Bayesian brain hypothesis.29 On this view, Bayes’ rule, or some approximation to it, is assumed to describe how the brain combines and updates its subjective probability distributions.30 Because exact Bayesian inference is computationally intractable, advocates of the Bayesian brain hypothesis generally assume that the brain implements some version of approximate Bayesian inference. Approximate Bayesian inference can be achieved in a variety of ways, the most popular of which being sampling algorithms (which use multiple categorical samples to create an empirical distribution that approximates the true Bayesian posterior) and variational algorithms (which change the parameters of some simpler, more computationally tractable distribution in order to try to find a posterior distribution that is close to the true Bayesian posterior).31 Both forms of approximate Bayesian inference are common in AI and machine learning. Proponents of the Bayesian brain hypothesis do not agree about whether the brain uses a sampling method, a variational method, or something else entirely.32
Predictive coding is one example of a probabilistic model of cognition and an instance of the Bayesian brain hypothesis. Predictive coding identifies the task the brain faces in cognition as that of minimising sensory prediction error. If combined with appropriate simplifying assumptions, this task can be shown to entail approximate Bayesian inference.33 The numerical values that feature in predictive coding’s artificial neural network can be interpreted as parameters of subjective probability distributions (namely, as the means and variances of Gaussian distributions). Predictive coding’s algorithm can be interpreted as a particular version of variational Bayesian inference.34 Predictive coding proposes that these numerical parameters, and hence the subjective probability distributions manipulated in cognition, are encoded in the average firing rates of neural populations of layers in the neocortex, and the manner in which these subjective probability distributions condition one another in inference is encoded in the strength of the synaptic connections between distinct neocortical areas.35
Someone might endorse the idea that the brain engages in probabilistic inference, or even the Bayesian brain hypothesis, but reject some or all of these further assumptions. For example, someone might not accept that a single probabilistic model underlies every aspect of cognition, or that the subjective probability distributions in the brain are always Gaussian, or that the brain uses the specific version of variational Bayesian inference proposed by predictive coding, or that the brain’s subjective probability distributions are encoded in the neocortex.36 Predictive coding is an example of a probabilistic model of cognition, but there are many possible alternative probabilistic models. Endorsement of, or evidence for, a probabilistic approach to cognition cannot straightforwardly be read as endorsement of, or evidence for, predictive coding as opposed to any number of other views.
6 Cognition uses a generative model
A generative model is a special kind of representation that describes how observations are produced by unobserved (‘latent’) variables in the world. If a generative model were supplied with the information that your best friend enters the room, it might predict which sights, sounds, smells you would experience. At the highest level of abstraction, you might conceive of a generative model as a black box that takes, as input, a hidden state of the world and that yields, as output, the sensory signals that would be likely to be observed. It is widely thought that generative models – and in particular, probabilistic generative models – play an important role in cognition. This is for at least three reasons.
First, a generative model could help the brain to distinguish between changes to its sensory data that are self-generated and externally generated. When our eyes move, our sensory input changes. How does the brain know which changes are due to movement of our sensory organs and which are due to movement of external objects in the environment? von Helmholtz (1867) proposed that our brain makes a copy of its upcoming motor plans and uses this copy (the ‘efference copy’) to predict how its plans are likely to affect incoming sensory data. A generative model (the ‘forward motor model’) predicts the likely sensory consequences of a planned movement (e.g. how sensory data would be likely to change if the eyeballs rotate). These predictions are then fed back to the sensory system and ‘subtracted away’ from incoming sensory data. This would allow the brain to compensate for changes its own movement introduces into its sensory data stream.37
Second, a generative model would help the brain to overcome some of the inherent latency, noise, and gaps in its sensory data. When you execute a complex, rapid motion – e.g. a tennis serve – your brain needs to have accurate, low-latency sensory feedback. It needs to know where your limbs are, how its motor plan is unfolding, whether any unexpected resistance is being met, and how external objects (like the tennis ball) are moving. Due to the limits of the brain’s physical hardware, this sensory feedback is likely to arrive late, with gaps, and with noise. A generative model would help the brain to alleviate these problems by regulating its motor control based, not on actual sensory feedback, but on expected sensory feedback. When the incoming sensory data do arrive, the brain could then integrate them into its predictions in a way that takes into account any background information that it has about bias, noise, and uncertainty in that sensory signal. Franklin & Wolpert (2011) argue that this would allow the brain to make ‘optimal’ use of its sensory input during motor control – optimal in the sense that the brain would make use of all its available information.38
Third, if a generative model takes a probabilistic form, it could, in principle be, inverted to produce a discriminative model.39 Discriminative models are of obvious utility in many areas of cognition. A discriminative model tells the cognitive system, given some sensory signal, which state(s) of the world are most likely to be responsible for its observations.40 Discriminative models are needed in visual perception, object categorisation, speech recognition, detection of causal relations, and social cognition. A discriminative model and a generative model facilitate inference in opposite directions: whereas a discriminative model tells the cognitive system how to make the inferential leap from sensory data to the value of latent unobserved variables, a generative model tells the cognitive system how to make the inferential leap from the value of latent variables to sensory observations. The latter form of inference might not initially appear to be useful, but if the system applies Bayes’ theorem, a generative model can be used to infer a discriminative model. Moreover, this may be a computationally attractive strategy because generative models are often easier to learn, more compact to represent, and less liable to break when background conditions change.41 In AI, a common strategy for tackling a discriminative problem is to first learn a generative model of the domain and then invert it using Bayes’ theorem. This strategy is frequently suggested as the way in which the brain tackles discriminative problems in certain domains of cognition.42
A generative model is a common feature in a modern computational model of cognition. Its content and structure, the methods by which it is updated, and how it might be physically implemented in the brain, might be filled out in many ways, including ways that depart substantially from those suggested by predictive coding. In the context of predictive coding, a single probabilistic generative model is claimed to be employed across all domains of cognition. This generative model is claimed to have a specific hierarchical structure, content, and to be implemented in a specific way in the brain.
Someone might accept that generative models play a role in cognition, but reject these further assumptions. For example, they might hold that multiple distinct generative models exist in the brain in relative functional isolation from each other – e.g., there might be a domain-specific generative model dedicated to motor control.43 They might hold that the brain does not use a generative model to solve every inference problem – the brain might sometimes attempt to learn and use a discriminative model of a domain directly, or employ some other, non-model-based strategy to reach a decision.44 They might disagree about the content of the generative model or how the generative model is physically implemented in the brain.45
Generative models appear in many computational accounts of cognition. Predictive coding employs the idea, but that idea is not unique to predictive coding. The proposal that the brain uses a generative model should not simply be equated with predictive coding and one should not assume that empirical evidence that favours the hypothesis that the brain employs a generative model is also evidence that supports predictive coding’s specific proposal about the character and role of a generative model in cognition.
7 Conclusion
The aim of this paper is to separate five influential ideas about cognition from predictive coding. Many philosophers first encounter these ideas in the context of predictive coding. However, it is important to recognise that those ideas exist in a broader intellectual landscape and they are employed by approaches that have little or nothing to do with predictive coding. Accepting one or more of these ideas does not constitute an endorsement of predictive coding. Similarly, evidence that supports one or more of the ideas should not be taken as evidence that unambiguously supports predictive coding. If one wants to understand the distinctive content of predictive coding, or to evaluate the empirical evidence for it, one needs to disentangle it from these other ideas.
Of course, there is nothing to stop someone defining the words ‘predictive coding’ to refer to some broad, non-specific synthesis of these five ideas. On such a deflationary reading, one could say, without fear of contradiction, that predictive coding is already widely accepted and empirically confirmed. However, there are good reasons to resist such a move. Advocates of predictive coding are keen to stress that their view is both novel and that it faces genuine jeopardy with respect to future evidence. If these claims are to be taken seriously, one would need to show (i) that the view departs from plausible rivals; and (ii) that it is not so anodyne as to be consistent with any likely empirical evidence. To this end, Clark warns against interpreting predictive coding as an ‘extremely broad vision’; it should be interpreted as a ‘specific proposal’ (Clark 2016 p. 10). Hohwy observes that there is often an ambiguity which renders presentations of predictive coding ‘both mainstream and utterly controversial’ (Hohwy 2013 p. 7). He argues that in order for it to meaningfully make contact with empirical evidence, it should be understood as a specific, detailed proposal (Hohwy 2013 pp. 7–8).46
What is that specific, detailed version of predictive coding? In what follows, I argue that what distinguishes predictive coding from contemporary rivals is a combination of three claims, each of which may be precisified or qualified in various ways. These claims concern how cognition works at Marr’s computational, algorithmic, and implementation levels.
It is worth tempering what follows with a cautionary note. As already mentioned, the specific, detailed content of predictive coding is in no way a settled matter. Researchers disagree about which features of the view are essential, whether the model should be applied to all domains of cognition, whether the computational, algorithmic, and implementation level claims should be combined, and the exact form each of these claims should take. Cutting across this disagreement and uncertainty, however, is a set of ideas that has inspired many researchers: a simple, bold, and unifying picture of the mind, its abstract computational structure, and its physical implementation. This somewhat idealised version of predictive coding will be the focus of the next three papers.
Acknowledgements
I would like to thank Jonathan Birch, Matteo Colombo, Matt Crosby, Krzysztof Dolega, Jonny Lee, Edouard Machery, Christian Michel, Nina Poth, Wolfgang Schwarz, Dan Williams, Wanja Wiese, and Sam Wilkinson for helpful comments and discussion on earlier drafts of this paper.
Bibliography
For examples of these broad claims, see Clark (2013); Clark (2016); Hohwy (2013); Friston (2009); Friston (2010).↩︎
Some authors use ‘predictive coding’ to refer to only one aspect of the view: for example, to the efficient coding strategy described in section 2, or to the algorithm described in Section 2 of Sprevak (forthcoming). Some authors call the overall research programme ‘predictive processing’, ‘prediction error minimisation’, or ‘free energy minimisation’. In what follows, I use the term ‘predictive coding’ to refer to the overall research programme.↩︎
See Marr (1982), Ch. 1 for a description of these levels.↩︎
The term ‘research programme’ is used here to indicate that the precise details, goals, and conditions of correct application of a scientific model are often not to be decided in advance and are liable to change over time. It is not meant to indicate commitment to a specific philosophical understanding of a scientific research programme (e.g. that of Lakatos (1978) or Laudan (1977)). In what follows, I use the terms ‘framework’, ‘approach’, ‘view’, ‘account’, ‘theory’, and ‘model’ interchangeably with ‘research programme’, with alternative uses flagged along the way.↩︎
To help build that understanding, helpful reviews include Aitchison & Lengyel (2017); Friston (2003); Friston (2005); Friston (2009); Friston (2010); Kanai et al. (2015); Keller & Mrsci-Flogel (2018). For reviews that focus on the describing the mathematical and computational framework, see Bogacz (2017); Gershman (2019); Jiang & Rao (2022); Spratling (2017); Sprevak & Smith (2023). For reviews that focus on the possible neural implementation, see Bastos et al. (2012); Jiang & Rao (2022); Lange et al. (2018); Kok & Lange (2015). For reviews that focus on philosophical issues and possible applications to existing problems in philosophy, see Clark (2013); Clark (2016); Friston et al. (2018); Hohwy (2013); Hohwy (2020); Metzinger & Wiese (2017); Roskies & Wood (2017).↩︎
For examples of work that applies predictive coding’s computational model to explain conscious experience, see Clark (2019); Clark (2023); Dolega & Dewhurst (2021); Hohwy (2012); Kirchhoff & Kiverstein (2019); Seth (2017); Seth (2021).↩︎
See Simoncelli & Olshausen (2001); Sterling & Laughlin (2015); Stone (2018) for reviews of efficient coding in the sensory system.↩︎
Other coding schemes such as wavelet-based codes (Usevitch 2001) or deep neural networks (Bühlmann 2022; Toderici et al. 2017) would outperform JPEG in these respects. However, these schemes tend to impose even higher computing burdens than JPEG if one wishes to decode or transform an image.↩︎
This is an instance of a more general trade-off in computer science between optimising for time and optimising for space. Compressing data saves space, but generally has an adverse effect on the time (number of computing cycles) required to do inference on that data to accomplish certain tasks. You have experienced this trade-off any time you waited for a ‘.zip’ archive to uncompress before being able to work on its contents.↩︎
A related point is that uncompressed data are more resistant to noise during storage and transmission.↩︎
Gardner-Medwin & Barlow (2001) list examples in which adding redundancy to sensory signals produces faster and more reliable inference over sensory data.↩︎
For example, Simoncelli & Olshausen (2001) suggest that the nature of the downstream task a cognitive system faces in a specific context should be considered when measuring the overall efficiency of a coding scheme, not merely the degree of compression of the incoming sensory signal (p. 1210).↩︎
For examples of this kind of reasoning, see Clark (2013), p. 190; Lupyan (2015).↩︎
See characterisations in Macpherson (2012); Firestone & Scholl (2016). One could also define a ‘top-down effect’ in terms of how various high-level states in predictive coding’s subpersonal computational model change the subject’s physically (non-intentionally) characterised behaviour (e.g. physical button presses by a subject during a psychophysics experiment). Such a claim would plausibly fall within the scope of predictive coding’s model, but its relationship to top-down effects as standardly defined is not obvious. Thanks to Matteo Colombo for this point.↩︎
For critical discussion of this point with respect to Seth (2021)’s proposals about personal-level experience, see Sprevak (2022).↩︎
See Macpherson (2017); Drayson (2017) for further development of this line of argument. They suggest that predictive coding’s computational model is compatible with no top-down effects occurring at the personal level at all.↩︎
Neisser (2014), p. xvi.↩︎
For example, Marr (1982): ‘… top-down information is sometimes used and necessary … The interpretation of some images involves more complex factors as well as more straightforward visual skills. This image [a black-and-white picture of a Dalmatian] devised by R. C. James may be one example. Such images are not considered here.’ (pp. 100–101).↩︎
See Gregory (1997); Poeppel & Bever (2010); Yuille & Kersten (2006). Firestone & Scholl (2016) suggest that endogenous attention requires subpersonal top-down information flow inside a computational model (p. 14).↩︎
Dennett (1991) argues that these kinds of external ‘virtual wires’, which loop into the environment, can enable sophisticated forms of top-down information processing, including those characteristic of rational thought (pp. 193–199).↩︎
For example, see Bishop (2006), pp. 1–12 and Hohwy (2013), pp. 42–46.↩︎
Note that a ‘prediction’ need not be about the future. A prediction is an estimate concerning something that the system does not already know. In principle, a prediction might concern what happened in the past, what is happening in the present, or what will happen in the future. For a helpful review of the relevant notion of prediction, see Lange et al. (2018), p. 766, Box 2 and Forster (2008).↩︎
Strictly speaking, AI systems aim to minimise a cost function, which combines prediction error with other factors. A common cost function is the prediction error plus the sum of the squares of the model’s parameters. The latter serves as regularisation term that penalises more complex models. For discussion, see Russell & Norvig (2010), pp. 709–713.↩︎
For example, Bishop (2006); MacKay (2003); Barber (2012); Matloff (2017).↩︎
There are a wide range of computational models of learning and decision-making that attribute the goal of minimising prediction error over reward signals to the brain (Niv & Schoenbaum 2008; Schultz et al. 1997). Although these models bear a family resemblance to predictive coding, advocates of predictive coding are generally clear that the two approaches are distinct (Friston 2009). However, see Friston et al. (2013); Schwartenbeck et al. (2015) for an attempt to show that minimising reward prediction error can be reconceptualised as minimising a measure of expected free-energy that is also associated with sensory prediction error↩︎
See Sprevak (forthcoming), Section 2.3.↩︎
For examples, see Chater et al. (2006); Danks (2019).↩︎
The subjective probabilities in question are formally handled in a similar manner to subjective probabilities inside classical formulations of Bayesianism – i.e. as degrees of belief or credences of some reasoning agent (de Finetti 1990; Ramsey 1990). However, unlike in traditional treatments, these subjective probabilities need not be ascribed to the entire agent; they may be ascribed to subpersonal parts of the agent (e.g. to individual brain regions, neural populations, or single neurons) (for example, see Deneve 2008; Pouget et al. 2013). For discussion of how the concept of subjective probability should be applied to subpersonal parts of agents, see Icard (2016); Rescorla (2020).↩︎
Chater & Oaksford (2008); Knill & Pouget (2004).↩︎
Bayesian updating is not the only option for handling inference under uncertainty. Plenty of rules and heuristics do not fit the Bayesian norms but still generate adaptive behaviour (Bowers & Davis 2012; Colombo et al. 2021; Eberhardt & Danks 2011; Rahnev & Denison 2018). Rahnev (2017) considers the possibility that brains do not store full probability distributions, but only a few categorical samples or summary statistics (e.g. variance, skewness, kurtosis) and use these partial measures to generate adaptive behaviour.↩︎
For an introduction to sampling methods (e.g. Markov chain Monte Carlo methods or particle filtering), see Bishop (2006), Ch. 11. For an introduction to variational methods, see Bishop (2006), Ch. 10.↩︎
For exploration of the idea that the brain uses a sampling method, see Fiser et al. (2010); Griffiths et al. (2012); Hoyer & Hyvärinen (2003); Moreno-Bote et al. (2011); Sanborn & Chater (2016); Sanborn & Chater (2017). Predictive coding is an example of a view that holds that the brain uses a variational method for approximate Bayesian inference.↩︎
Sprevak (forthcoming), Section 8; Sprevak & Smith (2023).↩︎
Sprevak (forthcoming), Section 5.↩︎
Sprevak (forthcoming), Section 3.↩︎
Aitchison & Lengyel (2017) consider how predictive coding’s proposals might be changed if its algorithm for variational Bayesian inference were replaced with a sampling algorithm (pp. 223–224).↩︎
Keller & Mrsci-Flogel (2018), pp. 424–425. Blakemore et al. (1999) use a model of this kind to explain why it is difficult to tickle yourself.↩︎
Grush (2004); Körding & Wolpert (2004); Körding & Wolpert (2006); Rescorla (2018).↩︎
Bayes’ theorem is \(P(Y \mid X) = P(Y \mid X)P(Y)/P(X)\), and follows from standard axioms and definitions of probability theory. Bayes’ rule (referenced in Section 5) says that an agent’s subjective probabilities should be updated using Bayesian conditionalisation, \(P_{t+1}(Y) = P_{t}(Y \mid X)\); its justification does not follow from the axioms of probability (Strevens 2017).↩︎
A discriminative model estimates the probability of a latent variable, \(Y\), given an observation, \(x\), i.e. \(P(Y \mid X=x)\). A generative model is defined either as the likelihood function, i.e. the probability of an observation, \(X\), given some hidden state of the world, \(y\), \(P(X \mid Y=y)\); or, as the full joint probability distribution, \(P(X, Y)\). The difference between these rarely matters in practice as the joint probability distribution equals the product of the likelihood and the system’s priors over those unobserved states, \(P(X, Y) = P(X \mid Y) P(Y)\), and both likelihood and priors need to be known to invert the model under Bayes’ theorem.↩︎
The reasons why generative models provide these advantages are complex and depend partly on the contingent way our world is structured. For a brief intuitive explanation, see Russell & Norvig (2010), pp. 497, 516–517.↩︎
See Bishop (2006), Ch. 4 on creating discriminative classifiers using generative models. See Chater & Manning (2006); Kriegeskorte (2015); Poeppel & Bever (2010); Tenenbaum et al. (2011); Yuille & Kersten (2006) for various proposals about how the brain uses generative models to answer discriminative queries in cognition.↩︎
Wolpert et al. (2001); Grush (2004) suggest this. They also suggest that this motor model is not implemented in the neocortex but in the cerebellum.↩︎
Ng & Jordan (2002) consider conditions under which it is more efficient to learn a discriminative model of a domain directly than learn a generative model first and then invert it. Raina et al. (2003); Lasserre et al. (2006) examine a range of hybrid discriminative-generative approaches to inference.↩︎
See Sprevak (forthcoming), Section 2.5; Sprevak (forthcoming), Section 6.↩︎
Colombo (2017) argues that Clark sometimes interprets predictive coding as a broad vision.↩︎