Artificial intelligence has been successfully used for the first time to predict how the building blocks of life – the strings of amino acids that make the proteins in our bodies – fold up from a linear chain into the complex three-dimensional shapes that allow them to carry out life’s tasks.
DeepMind, a U.K.-based AI company, hopes that this could aid the pandemic effort, though the feat also has much wider implications. I talk to the head of the team, John Jumper. His edited responses are in italics.
Why is this important?
Proteins are arrangements of atoms at the nanoscale (billionths of a metre) and include the most spectacular machines ever created from moving atoms.
Some can do chemistry orders of magnitude more efficiently than anything that we’ve built or used in the lab. Until now, these self-assembling machines have been somewhat inscrutable.
Proteins are central to life.
All living things on Earth are built from proteins created from the same 20 chemical units, called amino acids (though scientists, such as at the Medical Research Council’s Laboratory of Molecular Biology, Cambridge, are now experimenting with new amino acids beyond those canonical 20).
Because proteins are the building blocks of all living things, and because their shape is central to how they work, efforts have been under way for decades to study their three-dimensional structures.
This field of structural biology was pioneered in the UK, notably at the Medical Research Council’s Laboratory of Molecular Biology, where X-rays and forms of electron microscopy have been used to investigate crystals made of a target protein (obtaining these crystals is in itself not an easy task) in order to work out their atomic structure.
In 1957, the structure of myoglobin, a protein that provides oxygen to working muscles, was determined by John Kendrew and colleagues using X-rays. The Science Museum Group has in its collections Kendrew’s first plasticine model of the protein, along with another, remarkable, ’forest of rods’ model.
Since Kendrew’s pioneering work, only 170,000 protein structures have been solved. It can take years of painstaking and laborious research to crack each structure and require the use of multi-million-dollar specialised equipment.
AlphaFold’s ability to predict protein structures could prove important to biology and medicine, from developing vaccines and drugs to diagnostics. Many other challenges, like finding ways to break down industrial waste, are also tied to understanding proteins, notably enzymes, which are proteins that can accelerate chemical reactions.
Earlier this year, we predicted several protein structures of the SARS-CoV-2 virus. We think that protein structure prediction could be useful in future pandemic response efforts, as one of many tools developed by the scientific community.
This is not like typical machine learning where you’re taking human judgement and putting it into, for example, an AI ‘cat recognition network’ that tries to give the same answer as humans to figure out if a picture is of a cat. Here we are talking about a task that humans can’t do.
We are getting results that are very close to what experiments by a doctoral trainee is going to produce in a year or two , and that’s still, to me, unbelievable.
We can cover a huge fraction of proteins. We can do large proteins out to maybe 2000 amino acid residues. We can do novel proteins with high accuracy, such as membrane proteins that aren’t as common.
We are training our AI on existing protein data but we extracted pretty generalisable rules so we can tackle novel proteins at high accuracy. Not 100 per cent but this generalisability is surprising.
How can AI tackle the boggling diversity of life?
They are highly constrained because they have limited rules, board size and pieces but are computationally hard because there are too many possible moves for a computer to explore all of them.
Now its AlphaFold offers a solution to one of biology’s great challenges. However, we are still in the era of ‘narrow AI’, that is AI which exceeds human abilities on one task, rather than the general intelligence between our ears and ‘general AI’ that is so adored by Hollywood, such as in the Terminator movies (a depiction of malevolent AI that AI aficionados dislike for obvious reasons).
Ultimately, Deep Mind wants to “solve intelligence” by developing artificial general intelligence, or AGI.
How do we know this AI works?
This AI system – AlphaFold – has been acknowledged as having solved this grand challenge by the organisers of the biennial Critical Assessment of protein Structure Prediction, or CASP, which was set up to catalyse research and establish the state-of-the-art in protein structure prediction.
CASP measured the accuracy of the AI’s predictions in terms of the percentage of amino acids (think of them as beads in the protein chain) within a threshold distance from the correct position.
They found the Al’s predictions have an average error of approximately 1.6 Angstroms, which is comparable to the width of an atom (or 0.1 of a nanometre). That is impressive.
How will AlphaFold help combat COVID?
DeepMind used AlphaFold to predict the structure of several under-studied proteins associated with SARS-CoV-2, the virus that causes COVID-19, with the scientific names ORF8, a membrane protein, protein 3a, NSP2, NSP4, NSP6 (NSP stands for non-structural protein), and Papain-like proteinase (C terminal domain).
We were 90% sure that the system was really good, but we weren’t 100%. Even so, we intentionally went after the six SARS-CoV-2 proteins about which almost nothing was known because they weren’t well studied in other coronaviruses.
We confirmed that our system provided an accurate prediction for the experimentally determined SARS-CoV-2 spike protein structure in the Protein Data Bank, and this gave us confidence that our model predictions on other proteins may be useful.
Jumper and his colleagues published in March 2020 but there were errors according to studies of one target protein, Orf-3a, by traditional experiment-led molecular biology. This particular protein was challenging for AlphaFold due to the small number of related sequences that were available to train the AI.
They refined the AlphaFold model and in August, Jumper, Kathryn Tunyasuvunakool, Pushmeet Kohli, Demis Hassabis and the team released the most up-to-date predictions of these five understudied SARS-CoV-2 protein targets. Experimentalists have since confirmed the structures of both ORF3a (protein 3a) and ORF8.
What is Artificial Intelligence?
There are lots of kinds of artificial intelligence, from those based on knowledge and rules to the kind used by DeepMind, which is a neural network.
A hand-waving explanation is that neural networks draw inspiration from the fields of mathematics, biology, physics, and machine learning, as well as of course the work of many scientists in the protein folding field over the past half-century.
A neural network contains a network of processors, a tiny bit like the network of neurons in the brain, though to liken it to the most complex known object is a stretch, given a single one of your 86 billion brain cells is more complex than the most complex neural network.
The processors in the network are organized into layers of nodes. So-called deep learning neural networks have many layers (that is why they are ‘deep’), and the interconnections change as its training data is fed to the bottom layer — the input layer — and it passes through the succeeding layers.
During its training, weights and thresholds of connections change – that is how the neural net learns.
In the case of AlphaGo, for example, DeepMind used ‘deep reinforcement learning’ where they trained the algorithm by showing it very large numbers of possible Go scenarios and rewarding it for recognising patterns and taking more effective decisions, until it surpassed a human’s ability.
How does this particular neural net work?
The details are somewhat arcane.
First, the team had to couch the protein folding problem in mathematical terms – a folded protein can be represented in what mathematicians might call a “spatial graph” that shows the relationship – connections – between the amino acids in three dimensions.
By connections, we mean the amino acids that are close in space, not necessarily that are chemically linked to each other.
The DeepMind team trained the AlphaFold network to predict the shapes using sequences of amino acids and the resulting protein shapes in publicly available data consisting of around 170,000 protein structures
Instead of predicting relationships between amino acids, the network predicts the final structure. We did not program it in a specific way. Instead, we designed the structure of the network to learn from its training what amino acids lie near each other.
We used an ‘attention based neural network,’ a new type of network.
Attention, one part of the neural network, is one of the things we found very important when trying to understand a protein.
The problem is at the start of the training process, when you have the sequence of amino acids, you did not know what was nearby.
You start with the amino acid sequence of the protein and you don’t know what parts are going to later be in contact when it folds.
What ‘attention’ allows the network to do is to dynamically push information around, according to what it has learned already.
We see that as the neural network starts to learn which parts of the protein are close, then it is able to establish essentially a link to pass information between different amino acid pieces.
So, ‘attention’ is, in a sense, each part of the protein attends to, or kind of communicates with, the parts of the protein that the network has determined might be close.
So, you see it kind of build knowledge of the structure of this protein and then used that knowledge to build even more knowledge about how it folds.
What goes on inside AlphaFold?
What goes inside ‘deep learning’ algorithms is opaque and there is no general theory of how they work so their development is rather suck-it-and-see.
In other words, how AlphaFold ‘reasons’ from sequence to structure remains a black box.
We have been looking very carefully at the internals of the network to make sure we understand, and it seems to reflect our understanding, but it’s a much softer and less dogmatic view which tends to work much better in machine learning systems.
Can you extract nuggets of generalisable wisdom from AlphaFold?
We think so, and we think that we have an idea of how it’s creating protein pieces and joining those pieces together. We know what it knows in layer 10 and in layer 20. We’ve been putting a lot of effort into it and have made a fair amount of progress thus far.
How long have we dreamed of solving this challenge?
This grand challenge dates back to the acceptance speech for the 1972 Nobel Prize in Chemistry of Christian Anfinsen, who postulated that, in theory, a protein’s amino acid sequence should fully determine its structure. In other words, how can a protein’s three-dimensional structure be figured out from its one-dimensional amino acid sequence?
Why is it such a tough problem?
The ways that you can fold a chain of amino acids into a three-dimensional structure is astronomical.
In 1969 Cyrus Levinthal noted that it would take longer than the age of the known universe to enumerate all possible configurations of a typical protein by brute force calculation – Levinthal estimated 10 to the power of 300 possible conformations (shapes) for a typical protein.
At the atomic level, molecules jig about due to thermal energy (the higher the temperature, the more frenetic their movements). While it would take a conventional computer aeons to check all the possibilities, under the jiggle of heat energy proteins fold spontaneously, some within milliseconds.
What happens now?
AI now promises to be an important tool to study proteins and will be especially helpful for important classes of proteins, such as membrane proteins, that are tricky to crystallise and thus more challenging to study.
Andrei Lupas, Director of the Max Planck Institute for Developmental Biology and a CASP assessor, commented, ‘AlphaFold’s astonishingly accurate models have allowed us to solve a protein structure we were stuck on for close to a decade, relaunching our effort to understand how signals are transmitted across cell membranes…This will change medicine. It will change research. It will change bioengineering. It will change everything‘.
What problems will AlphaFold focus on?
DeepMind is excited about the potential for these techniques to explore the hundreds of millions of proteins of unknown structure – a vast terrain of unknown and unexplored biology.
While we have 180 million protein sequences and counting in the Universal Protein database there are only around 170,000 protein structures in the Protein Data Bank, though clearly this database is bound to be skewed towards the structures that have been easiest for scientists to solve.
Among the undetermined proteins may be some with new and exciting functions and – just as a telescope helps us see deeper into the unknown universe – techniques like AlphaFold may help us find them.
Neural networks sometimes produce rubbish – do you constrain your predictions with theory?
Proteins adopt their shape without help, guided only by the laws of physics and chemistry. However, AlphaFold does not take account of these laws.
The model is constrained, not so much by physics, but by geometry.
We don’t even tell the neural network explicitly that the amino acids exist in a chain – it just learns to place them this way. We actually leave the network to learn that and that much simplifies the problem. That’s a weird non-physicality that turns out to matter.
Did you train the model on virus protein structures to deduce COVID-19 virus structures?
Quite the opposite. One of the interesting things about neural networks is that, to be anthropomorphic, they want to memorise what you show them.
If you show one ten data points, it will want to memorise those data. The skill of fitting a neural network is to convince it to learn the right thing instead of overfitting (the AI equivalent of not being able to see the wood – the big principles in this case – for the trees).
To help do that, we show it an enormous variety of information so that when we’re training the neural network, we don’t just show it coronavirus proteins, we show it the wide gamut of proteins and so it is easier for it to learn generalizable rules than try to memorize the great diversity of structures.
It’s also very important exactly what’s going on in that network, and this, in the language of neural networks, is called inductive bias. What are the things that are easy for the network to learn and how do you arrange the connections of the pieces in the network, so it is easy to learn about proteins?
We think that’s really at the core of what we’ve done better- we figured out how to arrange the network in a sense so that it wants to learn so it generalized really well to novel proteins, such as SARS-CoV-2.
How long does it take to train and how many parameters does it optermise?
Training takes a few weeks, and it uses hundreds of millions of parameters.
Physicists like to joke: ‘With four parameters I can fit an elephant, and with five I can make him wiggle his trunk’. With this number of parameters perhaps you can give it an inner life and consciousness and sense of purpose too!
Though this comparison is a little unfair, it is tempting to say that you can fit this number of parameters to anything by statistical methods, like a glorified look-up table, rather than something smart that can generalise from a few examples.
I am not absolutely sure, but I wouldn’t be surprised if we have more parameters than the entirety of the protein data base which sounds like madness. More is better here.
What’s really surprising about these kinds of neural networks is if you give them a great many parameters, then they actually work really well.
What people have found is that if you train a neural network forever, till the ‘heat death’ of the universe, it will just remember all the examples you showed it and not generalize.
Those driving the neural net revolution find that the net tends to learn the right things at first in the training, the generalizable facts in this vast set of data that work well on new examples. It is totally weird to my classically-trained brain.
His DeepMind colleague Clemens Meyer added: ‘John said we’re using around 10 to the power of 8 parameters. A human brain has something like 10 to the power of 11 neurons with thousands of synapses each, so approximately 10 to the power of 14 synapses, which could be seen as most comparable to parameters. Elephants won’t have quite as many, but still a few orders of magnitude more than 10 to the power of 8!’
Are others doing this work?
Yes. Having a reliable method to predict 3D structures of proteins is very, very good and potentially extremely powerful. But these structures are merely approximations and static snapshots of a protein that is dynamic and at much higher temperatures.
Many teams are working on how to combine machine learning with more accurate ‘physics based’ methods for drug discovery.
The next generation of ‘exascale’ supercomputers, such as Aurora in the United States, will also depend much more on AI.
Can you make predictions from the other codes of nature?
The amino acid sequence is defined by the DNA code, which is first translated into another code, called RNA, and then translated into proteins.
It is trivial to extend this to DNA, though we put this in terms of the protein sequence that the DNA codes for.
We haven’t tried RNA structure prediction yet. I wouldn’t be surprised if these ideas are applicable to RNA, but we don’t know, and it does its own folding and we have not looked into it.
How easy is it to create a drug to interfere with a target protein?
DeepMind is looking into how protein structure predictions could aid our understanding of specific diseases, for example by helping to identify proteins that have malfunctioned. These insights could enable more precise work on drug development, complementing existing experimental methods, to find promising treatments faster.
However, chemists have talked about computer-aided drug design for half a century, the dream being that you could take a target protein and work out the drug molecule – they call them ligands – to interfere with that molecule.
It has proved much harder to do than thought because proteins are floppy, they have charges distributed over them and chaos – in the strict mathematical sense of a system where the outcome is highly sensitive to conditions – rules.
In other words, the crystal structures in the protein database are highly artificial and in cells, they are much more flexible and the pockets and targets in a protein can change shape.
It is telling that AlphaFold’s predictions were poor matches (see page 11 and page 23 of these reports) to experimental structures determined by a technique called nuclear magnetic resonance spectroscopy which can study proteins in solution, as they are in cells.
This is a complex question, and it’s a good one. The neural network is trained to give answers to the protein database, most of which are crystal structures. Mostly the differences in the cell are small. We can tell if a structure is disordered but we can’t say much about the range of structures it adopts in the cellular milieu. That is future work.
In addition to providing answers, we do provide good estimates of our own errors. So, we also have a good idea when we are not succeeding.
These protein crystal structures are still used to create drugs. I find it amazing that we are designing the arrangement of 30 atoms to figure out how to be healthy.
Quite a lot of what is hard about drug design is figuring out what you should be drugging. You have to figure out which of at least 20,000 human proteins (those described by all the genes in the human DNA sequence) is associated with a disease like Alzheimer’s. We still think it is an important step.
Biology is extraordinary difficult.
Do you run your neural net on a supercomputer?
No. It uses tens of tensor processing units – TPUs, which are integrated circuits customised by Google for machine learning – over a few weeks, a relatively modest amount of computer power in the context of most large state-of-the-art models used in machine learning today.
The neural net learns to predict structures from known protein sequence/protein structure pairs again and again and again. For each round of predictions, it automatically gets a score for accuracy and learns what is correct from these scores.
This is a typical “supervised machine learning” setup –what’s new, as discussed, is the special architecture of the AlphaFold neural net (plus engineering to make this work).
Although training this model takes a couple of weeks, they have tried quite a few variants over the last four years.
Can anyone use this AI?
DeepMind will publish a peer reviewed paper on AlphaFold to share the details with the scientific community. Demis Hassabis, DeepMind’s co-founder, also says that the company plans to make AlphaFold available for drug discovery and protein design.
Hassabis was inspired to apply AI to this problem as an undergraduate, more than two decades ago, by Foldit, an online puzzle video game about protein folding. ‘It turned out that through playing this game… they actually discovered a couple of very important structures for real proteins.’
HOW CAN I FIND OUT MORE?
There is much more information in our Coronavirus blog series (including some in German by focusTerra, ETH Zürich, with additional information on Switzerland), from the UKRI, the EU, US Centers for Disease Control, WHO, on this COVID-19 portal and Our World in Data.
The Science Museum Group is also collecting objects and ephemera to document this health emergency for future generations.