You’re navigating the convoluted aisles of an ethnic market, heart pounding, hoping that it’s there. Suddenly, you see it – that beautiful bag of rare, powerful, soul-shakingly delicious coffee beans that you’ve been craving for years.
While you grab the bag weeping in joy, your brain is busy learning. Striatal dopamine-secreting neurons fire in unison, forming a single clear-cut message that is broadcasted across reward-related circuits to signal “pleasant surprise (score!)”. In technical terms, “surprise” is just a reward prediction error, or the difference between the reward we expect and the one we actually get. Since the action of going to the market INCREASED the future expected reward (sipping on a cuppa joe), phasic dopamine firing signal a positive reward prediction error that strengthens this action, telling you to return for more. You learn.
However, life’s not always so straightforward. From bean to coffee, we navigate many steps – grinding, measuring, heating, brewing – before we finally cup the reward in our hands. Herein lies the conundrum: which one of these (often) inseparable actions do we attribute the reward to? How can spurts of dopamine – which signals the unexpected – teach us that EVERY step is crucial? How do we know that we’re getting close to the final goal?
Mark W. Howe (2013) Prolonged dopamine signalling in striatum signals proximity and value of distant rewards. Nature. 500; 575-579.
The answer may lie in a couple of chocolate-craving rats. Researchers measured dopamine levels in the striatum of rats, as they navigated a simple left-or-right choice T-maze to hunt down their reward – chocolate milk. Some trials started with a clicking sound, signaling to the rats that the reward is available. As standard dopamine-dependent learning theory would predict, the authors should see a dopamine spike at the click (“cue”) or at the delivery of reward – this is indeed what they saw. But something more intriguing also showed up. As you can see below, dopamine levels gradually crept up as the rats set off, increasing steadily until they reached their prize. Dopamine levels didn’t differ between right and wrong choices BEFORE rats got their award, but diverged after they got the treat.
What is this ramp encoding for? Perhaps it acts like a stopwatch, tracking how much time had elapsed since the start of the trial? Or maybe it tells the rat how close the reward is, like hikers urging themselves on just before the summit? Or it could simply be the sum of countless transient dopamine spikes reflecting fixed “checkpoints” in the maze, letting the rat know that it’s on the right path.
To tease out the answer, researchers looked at how ramps changed with elapsed time, distance and the magnitude of the reward. As shown above, peak dopamine levels didn’t change with trial length, nor did it correlate with how fast the rat accelerated or their running speed. However, it DID predict how spatially close the rat was to the reward, relative to the total length of the run. For example, although rats had to run further to get to their treat in the M-maze as compared to T-maze, their final dopamine levels were the same, with the ramp in the M-maze stretched out to cover the entire running trial (it’s almost as if dopamine is shouting “you’re at the half-way point!” regardless if you’re running 1km or 1 marathon). In other words, the ramp may be interpreted as reward anticipation or expectation – “hey I’m close!” Indeed, when the rat paused to think and rest, the signal showed a transient dip, which picked up when the rat once again took off.
Just like transient spikes, the SLOPE of dopamine ramps also reflected the value of the anticipated reward – the bigger the reward, the larger the slope. When researchers teasingly swapped the location of two rewards of different values in the T-maze (left arm in the bottom graph first, then right arm), the ramps also shifted, so that the slope was still larger for the bigger reward, even though it was in the opposite position.
Even more surprisingly, the ramps didn’t disappear as the rats became well versed in the task, hinting that they weren’t encoding for reward prediction errors, as the element of surprise would surely have disappeared with repeated trials.
So in sum, ramping dopamine signals seem to be tracking how well you are doing in a lengthy task. How does this fit into the theory of dopamine-mediated reward learning? Well, it doesn’t. The ramps are highly consistent and predictable, hence according to the theory they should disappear – but they remained observable even after rats learned to predict the reward. They also didn’t correspond to another way of dopamine signalling – background “tonic” dopamine – which is thought to dictate how vigorously one should pursue their goal.
Instead, the authors hypothesized that these dopamine ramps may work “behind the scenes”, actively directing motivation and attention to a previously correct choice when multiple options are available (“should I go left or right?”). This way, they may maintain and promote long-term actions that will eventually lead to predicted rewards (“go right, that’s what I did last time to get the chocolate”). Without sufficient sustained dopamine ramping, we may be easily distracted and venture off the beaten track, wobbling away from the ultimate reward.
It’s a very believable theory, albeit just that. There’s no causative evidence in the data to support the hypothesis – that’s left for future studies. I find it incredible that dopamine ramping has never been seen in previous studies involving tasks requiring a series of actions. Is it because other studies directly recorded dopaminergic cell responses, while this study measured dopamine with fast-scan cyclic voltammetry? Or are dopamine ramps only observable for spatial navigation? Does it generalize to any type of extended work? What is the mechanism that’s generating the dopamine ramp? How does it relate to the “classical” tonic and phasic dopamine firing, ie can changes in one influence any of the others? Does it exist outside the striatum? Can it be manipulated to influence stimulus-reward learning?
And a naïve one for the computer scientists out there: does the discovery of dopamine ramping help with the problem of credit assignment in reinforcement learning algorithms?
Mark W Howe (2013). Prolonged dopamine signalling in striatum signals proximity and value of distant rewards. Nature, 500, 575-579 DOI: 10.1038/nature12475