James Hennessy, Intergration Engineer

Are Deep Learning Methods Useful for Docking?

2022 saw the emergence of deep learning (DL) methods for docking. These methods, trained on data from the PDB, learned to predict the poses of ligands based on interactions in known protein-ligand complexes. There were papers on DiffDock, Eqibind, TANKBind, and more. In 2023, these methods underwent additional scrutiny, and it turned out that they weren’t quite as good as originally reported. Criticism of DL docking methods fell into three categories: the methods used for comparison, biases in the datasets used for evaluation, and the quality of the generated structures.

1.1 Are the Comparisons Fair?

One potential advantage of DL docking programs is their ability to perform “blind docking”. Unlike conventional docking programs, the DL methods don’t require the specification of a binding site. The DL programs use training data to infer the protein binding site and the ligand pose. In earlier comparative studies, conventional docking programs were simply given an entire protein structure without binding site specifications. Since this is not how they were designed to operate, the conventional methods were slow and inaccurate. A preprint by Yu and coworkers at DP Technologies decomposed blind docking into two problems: pocket finding and docking into a predefined pocket. The authors found that DL docking programs excelled at pocket finding but didn’t perform as well as conventional methods when pockets are predefined.

Do Deep Learning Models Really Outperform Traditional Approaches in Molecular Docking https://arxiv.org/abs/2302.07134

1.2 Training/test Set Bias

Most DL docking programs were trained and tested on time splits from the PDB. For instance, DiffDock was trained on structures deposited in the PDB before 2019 and tested on structures deposited in 2019 and later. Quite a few structures in the test set are similar to those in the training set. In these cases, prediction becomes a simple table lookup. One way to address this bias is to create train/test splits that don’t contain similar structures.

A paper by Kanakala and coworkers from IIT analyzed several datasets commonly used for affinity prediction, including PDBBind and KIBA, and found that typical splitting methods overestimate model performance. The authors propose a clustered cross-validation strategy that provides more realistic estimates of model performance.

Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets https://pubs.acs.org/doi/10.1021/acsomega.2c06781

A preprint by Li and coworkers from UC Berkeley described a similar effort. The authors cleaned the PDBBind dataset and divided it into segments that minimized leakage between the training and test sets. This new dataset was then used to retrain and evaluate several widely used scoring functions.

Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction https://arxiv.org/abs/2308.09639

1.3 Structure Quality

The third problem with many DL docking programs is the quality of the generated structures. To put it technically, the structures were really messed up. Bond lengths and angles were off, and there were often steric clashes with the protein. To address these challenges, Buttenschoen and colleagues from Oxford University developed PoseBusters, a Python package for evaluating the quality of docked poses. PoseBusters performs a series of geometry checks on docked poses and also evaluates intra and inter-molecular interactions. The authors used the Astex Diverse Set and a newly developed PoseBusters benchmark set to evaluate five popular deep learning docking programs and two conventional docking approaches. The conventional docking programs dramatically outperformed the deep learning methods on both datasets. In most cases, more than half of the solutions generated by the DL docking programs failed the PoseBusters validity tests. In contrast, with the conventional docking programs, only 2-3% of the docked poses failed to validate.

PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences https://pubs.rsc.org/en/content/articlepdf/2024/sc/d3sc04185a

Many of the same problems encountered with DL methods for docking can also impact generative models that produce structures in the context of a protein binding site. A paper by Harris and coworkers from the University of Cambridge describes PoseCheck, a tool similar to PoseBusters, for identifying unrealistic structures. PoseCheck evaluates steric clashes, ligand strain energy, and intramolecular interactions to identify problematic structures. In addition, structures are redocked with AutoDock Vina to confirm the validity of the proposed binding mode. In evaluating several recently published generative models, the authors identify failure modes that will hopefully influence future work on structure-based generative design.

Benchmarking Generated Poses: How Rational is Structure-based Drug Design with Generative Models https://arxiv.org/abs/2308.07413

1.4 Reporting Scientific Advances in Press Releases

The other (potentially) significant docking developments in 2023 weren’t reported in preprints or papers; they were published in what can best be described as press releases. In early October, the Baker group at the University of Washington published a short preprint that previews RoseTTAFold All-Atom, the latest incarnation of their RoseTTAFold software for protein structure prediction. In a brief section entitled “Predicting Protein-Small Molecule Complexes”, the authors mention their efforts to generate structures of bound non-covalent and covalent small molecule ligands. On benchmark structures from the CAMEO blind docking competition, RoseTTAFold All-Atom generated high-quality structures (<2Å RMSD) in 32% of cases. This compared favorably to an 8% success rate for the conventional docking program AutoDock Vina.

Generalized Biomolecular Modeling and Design with RoseTTAFold All-Atom https://www.biorxiv.org/content/10.1101/2023.10.09.561603v1.full.pdf

In late October, the DeepMind group published a blog post entitled “A glimpse of the next generation of AlphaFold,” where, among other things, they made this statement.

“Our latest model sets a new bar for protein-ligand structure prediction by outperforming the best reported docking methods, without requiring a reference protein structure or the location of the ligand pocket — allowing predictions for completely novel proteins that have not been structurally characterized before.”

The accompanying whitepaper provided impressive performance statistics for the PoseBusters set described above. The AlphaFold method achieved a 73.6% success rate compared to 52.3% for the conventional docking program AutoDock Vina. The AlphaFold performance was even more impressive when considering how the comparison was performed. While Vina was provided protein coordinates and a binding site as input, AlphaFold was only given the protein sequence and a SMILES string for the ligand.

A glimpse of the next generation of AlphaFold https://deepmind.google/discover/blog/a-glimpse-of-the-next-generation-of-alphafold/

Performance and structural coverage of the latest, in-development AlphaFold model https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/a-glimpse-of-the-next-generation-of-alphafold/alphafold_latest_oct2023.pdf

Unfortunately, neither the RoseTTAFold All-Atom preprint nor the DeepMind whitepaper contained any details on the methodology. In addition, at the time I’m writing this, neither group has released the code for their methods. Hopefully, papers with details on the methods will appear soon, along with public code releases. It’s safe to assume that others, such as the OpenFold consortium, who tend to be more forthcoming with their code and methods, are probably working on similar ideas.

Perspective: Like many other areas in AI, DL docking programs began with an initial period of exuberance. The community was excited, and everyone thought the next revolution was imminent. As people started using these methods, they discovered multiple issues that needed to be resolved. We’re not necessarily in the valley of despair, but this is definitely a “hey, wait a second” moment. I’m confident that, with time, these methods will improve. I wouldn’t be surprised to see DL docking methods incorporating ideas from more traditional, physics-inspired approaches. Hopefully, newly developed, unbiased training and test sets and tools like PoseBusters will enable a more rigorous evaluation of docking and scoring methods. With the co-folding approaches in RosettaFold All-Atom and AlphaFold, we’ll have to wait hope for the code to be released so that the community can evaluate the practical utility of these methods.

Can We Use AlpaFold2 Structures for Ligand Discovery and Design?

2.1 Experimentally Evaluating AlphaFold2 Structures

Since it took the CASP14 competition by storm in 2020, AlphaFold2 has greatly interested people involved in drug discovery and numerous other fields. In addition to benchmark comparisons with the PDB, there have been several other efforts to evaluate the structural models generated by AF2 experimentally. Rather than simply comparing the atomic coordinates of AF2 structures with corresponding PDB structures, a paper by Terwilliger and colleagues from Lawrence Livermore National Labs compares AF2 structures with reported crystallographic electron density maps. The authors argue that this approach puts less weight on loops and sidechains that are poorly resolved experimentally. They found that prediction accuracy varied across individual structures, and regions with prediction score (pLDDT) > 90 varied by less than 0.6A from the deposited model. They suggest that even inaccurate regions of AF2 structures can provide plausible hypotheses for experimental refinement.

AlphaFold predictions are valuable hypotheses, and accelerate but do not replace experimental structure determination https://www.nature.com/articles/s41592-023-02087-4

A paper by McCafferty and coworkers from UT Austin used mass spec data from protein cross-linking experiments to evaluate the ability of AF2 to model intracellular protein conformations. The authors compared experimentally observed distances in cross-linked proteins from eukaryotic cilia with corresponding distances from AF2 structures and found an 86% concordance. In 42% of cases, all distances within the predicted structure were consistent with those observed in cross-linking experiments.

Does AlphaFold2 model proteins’ intracellular conformations? An experimental test using cross-linking mass spectrometry of endogenous ciliary proteins https://www.nature.com/articles/s42003-023-04773-7

2.2 Generating Multiple Protein Conformations with AlphaFold2

In 2023, there was great interest in AF2’s ability to generate multiple relevant protein conformations. A paper by Wayment-Steele and coworkers showed that clustering the multiple sequence alignment (MSA) used by AF2 enabled the program to generate multiple relevant protein conformations.

Predicting multiple conformations via sequence clustering and AlphaFold2 https://www.nature.com/articles/s41586-023-06832-9

These ideas have spurred additional investigations and stirred up a bit of controversy. A paper by Chakravarty and coworkers from NCBI and NIH examined the performance of AF2 on 93 fold-switching proteins. The authors found that AF2 only identified the switched conformation in 25% of the proteins in the AF2 training set and 14% of proteins not in the training set.

AlphaFold2 has more to learn about protein energy landscapes https://www.biorxiv.org/content/10.1101/2023.12.12.571380v1

Wayment-Steele and coworkers proposed that their clustering of the MSAs captured the coevolution of related proteins. A subsequent preprint from Porter and coworkers at NCBI challenged this assumption and demonstrated that multiple protein conformations could be generated from single sequences.

ColabFold predicts alternative protein structures from single sequences, coevolution unnecessary for AF-cluster https://www.biorxiv.org/content/10.1101/2023.11.21.567977v2

2.3 Docking into AlphaFold2 Structures

After the publication of the AF2 paper and the subsequent release of the code, many groups began experiments to determine whether structures generated by AF2 and related methods could be used for ligand design. The initial results weren’t promising. Díaz-Rovira and coworkers from the Barcelona Supercomputing Center compared virtual screens using protein-crystal structures and structures predicted by AF2 for 11 proteins. The authors found that the average enrichment factor at 1% for the x-ray structures was double that of the AF2 structures.

Are Deep Learning Structural Models Sufficiently Accurate for Virtual Screening? Application of Docking Algorithms to AlphaFold2 Predicted Structures https://pubs.acs.org/doi/10.1021/acs.jcim.2c01270

Holcomb and coworkers from Scripps took a different approach and compared the performance of AutoDockGPU on AF2 structures with the performance on corresponding crystal structures from the PDBBind set. The authors noted a significant loss in docking accuracy with the AF2 structures. AutoDockGPU generated poses within 2Å of the experiment in 41% of the cases when docking into the crystal structures. This success rate dropped to 17% for the AF2 structures. On a brighter note, the authors reported that the docking success rate for AF2 structures was better than with corresponding apo structures.

Evaluation of AlphaFold2 structures as docking targets https://onlinelibrary.wiley.com/doi/full/10.1002/pro.4530

A paper by Karelina and coworkers from Stanford examined the utility of AF2 for modeling the structures of GPCRs. While the authors found that AF2 could model structures and binding pockets with high fidelity, the docking performance of the models was poor. The results of this study were consistent with those in the papers described above. In this case, the success rate for docking into AF2 structures (16%) was less than half of that for experimentally determined structures (48%). As mentioned above, it was encouraging that the docking performance of the AF2 structures was better than that of structures with other ligands bound.

How accurately can one predict drug binding modes using AlphaFold models? https://www.biorxiv.org/content/10.1101/2023.05.18.541346v2

While the results in the papers above aren’t encouraging, all hope may not be lost. In the last week of 2023, there was a paper from Brian Shoichet, Bryan Roth, and coworkers that reported successful prospective virtual screening results with AF2 structures of the sigma2 and 5-HT2A receptors. The odd bit here is that while the AF2 model performed well prospectively, its retrospective performance on prior screens of the same targets wasn’t good. To demonstrate that they got the right answer for the right reason, the authors solved a cryoEM structure of one of the 5-HT2A agonists and found that the docked pose was consistent with the experimental structure. The authors suggest that AF2 structures may sample the underlying manifold of conformations and posit that retrospective screening studies such as those described above may not predict prospective performance.

AlphaFold2 structures template ligand discovery https://www.biorxiv.org/content/10.1101/2023.12.20.572662v1

Many earlier papers describing the use of AF2 structures for docking suggested that performance could be improved by refining the predicted structures. Zhang and coworkers at Schrödinger compared virtual screening performance using holo structures, apo structures, and AF2 structural models. The authors compared virtual screening performance across 27 targets from the DUD-E set and found that the enrichment factor at 1% (EF1%) on AF2 structures (13%) was similar to that for apo structures (11%). However, EF1% increased to 18% when the AF2 structures were refined using induced fit docking.

Benchmarking Refined and Unrefined AlphaFold2 Structures for Hit Discovery https://pubs.acs.org/doi/10.1021/acs.jcim.2c01219

Perspective: The publication of the AF2 paper and subsequent release of the code has sparked work in numerous areas. There are already more than 17,000 references to the original AF2 paper. Protein structure prediction has become an integral component of experimental structural biology. Programs like Phenix can generate AF2 structures that can subsequently be fit to experimental data. While there is still work to be done, AF2 may be capable of generating ensembles of relevant protein conformations. It’s exciting to think about how this work will progress as we achieve tighter integration between protein structure prediction and physics-based modeling. It currently appears that the jury is still out on the utility of predicted protein structures for drug design. While the results of retrospective evaluations are somewhat disappointing, the recent prospective success from the Shoichet and Roth labs is encouraging.

Drugs Ai