Alphafold. The First Artificial Intelligence Program that can Predict How Proteins Organise Themselves

Posted by Phil Heler on March 9, 2021

A key scientific holy grail has been to accurately predict and understand how proteins fold and tangle themselves into the shapes that hold the key to how they carry out their vital functions (like Watson and Crick’s double helix). Even working out the shape of just one requires expensive equipment and can take years. Now an Artificial Intelligence program called AlphaFold appears to have cracked the problem.

There are a few scientific anniversaries that are worth celebrating in 2021. Two of these are truly relevant to the present day as they discovered the underlying principles that led to the development of our COVID-19 vaccines (whether mRNA or DNA based).

It was 150 years ago that Johann Friedrich Miescher discovered the existence of DNA. Biochemistry was a young science at the time, as biologists were only just beginning to describe and understand the contents of living cells and how they interacted. Miescher discovered proteins in a cell nucleus that were dramatically different from all the other contents and he called the substance ‘nuclein’  (this was later identified as DNA). He had no idea how important his findings were. Fast forward to 1953.

If this were written on February 27th, 2021 that would not have made any sense as this day was International Polar Bear Day (I am not pulling your leg as you can check that out at www.wwf.org.uk). However, February 28th ties in very nicely. February 28th (1953) is the anniversary when two young Cambridge University scientists James Watson and Francis Crick solved one of life’s great enigmas and the scientific Everest of their day. They announced that they had determined the double-helix structure of DNA, the molecule that contains our genes (‘the molecule of life’).

Aphafold amino acids

The discovery was made in the morning of that day and it was a watershed moment in scientific history as it would revolutionise biochemistry. Research teams in London, Europe and California had at the time been struggling with this mystery for at least a decade but Watson and Crick pipped them all to the post. Nothing would be the same again and in the present day we have clearly reaped the benefit.

James Watson (who was American) later wrote a best-selling book entitled ‘The Double Helix’. In his book he claimed that Crick, in a truly typical English way,  announced their discovery by walking into a nearby pub called ‘The Eagle’ and in casual conversation declared that they had discovered ‘the secret of life’. I am not sure if anyone would have bothered listening if you did that in our local pub.

He most definitely deserved his pint as the chances of success had at various points looked uncertain. Watson himself describes in his book that the prospect ‘that anyone on the British side of the Atlantic would crack DNA looked dim’. The pub ‘The Eagle’ is now considered a national treasure and it was refurbished in 2018. Another reason was because its ceiling is covered in the signatures of Allied fighter pilots from World War Two.

Predicting How Proteins Fold Has Up Until Now only been Theoretically Possible

Since this time, another scientific holy grail has been to accurately predict and understand how proteins fold and tangle themselves into the shapes that hold the key to how they carry out their vital functions (like Watson and Crick’s double helix). There are for instance three long protein chains that fold and tangle to form the spikes on COVID-19 and how these spikes interact with receptors in human cells is a vital component in our fight against the disease.

In 1972, Christian Anfinsen was jointly awarded a Nobel Prize in Chemistry for his work showing that it should be theoretically possible to determine the shape of proteins based on the sequence of their building blocks (amino acids). The theory is one thing but doing it is another. How proteins fold to create exquisitely unique three-dimensional structures is one of biology’s biggest mysteries.

Inside every cell, thousands of different proteins form the machinery that keeps all living things – from humans and plants to microscopic bacteria – alive and well. Almost all diseases, including cancer, dementia and even infectious diseases such as COVID-19, are related to the way these proteins function. Predicting how a protein folds into a unique three-dimensional shape has puzzled scientists for half a century. Proteins, as we know, are comprised of amino acids (there are twenty different amino acids) which link together to form protein chains.amino acidsProtein chains have an almost stupefyingly incomprehensible large number of possible folded shapes. There is a bewildering array of energetic factors that allow proteins to accomplish these magic tricks. They are governed only by the laws of physics. The mere mention of some of the factors involved would send you to sleep immediately (examples being hydrogen bonding, hydrophobic surfaces, and pi-stacking interactions to name but a few).

The best simile I can think of is that it is almost like a pile of wood spontaneously arranging itself into an IKEA wardrobe, or a dining table or a rowing boat (it would be great if all flat pack furniture self-assembled itself like proteins do!).

Many diseases are linked to the roles of proteins in catalysing chemical reactions (enzymes), fighting disease (antibodies) or acting as chemical messengers (hormones such as insulin). Even minute changes in the shape of a protein can have catastrophic effects on  our health. Therefore, clearly one of the best ways to understand disease and find new treatments is to study how proteins work and organise themselves.

There are quite literally tens of thousands of human proteins and many billions in other species, including bacteria and viruses (such as our COVID-19 example). Even working out the shape of just one requires expensive equipment and can take years. So, anything that helps streamline this process is hugely significant. A better understanding of protein structures and the ability to predict them means a better understanding of life, evolution and, of course, our health.

The CASP Competition

As it happens every two years there is a competition that has run since 1994 called CASP (Critical Assessment of Protein Structure Prediction). In this competition teams from all over the world use computer modelling to predict the shape of 100 proteins just using amino acid sequences alone.  In the meantime, another set of teams use traditional gold standard techniques (like X-ray crystallography and NMR spectroscopy) to painstakingly work out accurately what the 3-D structures would look like.

CASP

This is done even to the extent of working out the location of each atom relative to another (but this takes time and is expensive). A team of scientists from CASP then compares the computer predictions with the 3-D structures solved using traditional laboratory methods and decide which computer prediction is the closest. Then we have our winner.

To a certain extent this acts as a barometer as it gives an indication of how much progress is being made over time. Amazingly last year’s award (in 2020) went to a recipient which appears to have, at long last, largely cracked the gargantuan problem. This passed by virtually unnoticed under the shadow of COVID-19. What is even more surprising is that the winner was an artificial intelligence initiative called DeepMind which is owned by Google (they bought the program for £400 million in 2014).

AlphaFold

The program that DeepMind used was called ‘Alphafold’ and it has the capabilities to make predictions quickly and accurately, and has the incredible potential to revolutionise life sciences. In last year’s competition, the AlphaFold program determined the shape of around two thirds of the proteins with accuracy comparable to gold standard laboratory methods. AlphaFold’s accuracy with most of the other proteins was also high, though not quite at that level.

In early 2020, DeepMind released predictions using AlphaFold of the structures of a handful of COVID-19 proteins that had not yet been determined experimentally. DeepMind’s predictions for a particular protein (called Orf3a) ended up being remarkably similar to one later determined through laboratory techniques.

 

Protein Proteins

 

AlphaFold is based on a concept called deep learning. In this process, the 3-D structure of a folded protein is represented as a spatial graph. The program then learns using information on the 3-D shapes of known proteins held in the Worldwide Protein Data Bank (or PDB). The AlphaFold program was able to do in a matter of days what might take years at the laboratory bench.

This progress is hugely significant. Janet Thornton, a structural biologist at the European Molecular Biology Laboratory and a past CASP assessor comments ‘this is a problem that I was beginning to think would not get solved in my lifetime’. This type of breakthrough does demonstrate the impact AI can have on scientific discovery and its potential to dramatically accelerate progress in some of the most fundamental fields that explain and shape our world.