Unlock Molecular Secrets: Predict Structures with Chai-1’s AI Model

Discover Chai-1, the powerful AI model for molecular structure prediction! Ideal for drug discovery, it excels at proteins, DNA, and more

6 min readSep 12, 2024

Cloudways — The Best Managed Cloud Hosting | Web Hosting

Chai Discovery has launched Chai-1 , a multimodal base model for molecular structure prediction, suitable for tasks such as drug discovery. Chai-1 has advanced prediction capabilities, can make unified predictions for proteins, small molecules, DNA, RNA, etc., and performs well in multiple benchmarks such as PoseBusters and CASP15. Unlike many tools that require multiple sequence alignments, Chai-1 can run in single sequence mode while maintaining high performance.

The Chai-1 model achieves a 77% success rate on the PoseBusters benchmark (compared to 76% for Google AlphaFold 3) and a 0.849 prediction set for Cα LDDT on the CASP15 protein monomer structure (compared to 0.801 for ESM3–98B).

Unlike many models that rely on multiple sequence alignment (MSA), Chai-1 can be run without MSA and still maintain high accuracy. For multimeric structures, Chai-1 can even outperform AlphaFold-Multimer.

Chai-1 can be used for commercial applications, and provides a free web interface and open source code base to support non-commercial use. Its launch aims to promote the development of the entire ecosystem through collaboration with the research and industry communities.

Main Functions of Chai-1

Biomolecule structure prediction

Chai-1 can predict the three-dimensional structure of biological molecules such as proteinsnucleic acids and directly from the original molecular sequence and chemical information . This is of great significance for studying how molecules fold, interact with each other and their functions in cells.

Protein-ligand structure prediction

Chai-1 is good at predicting the interaction structure between **proteins and drug molecules (ligands)**, helping researchers understand how drugs bind to proteins and providing a reference for drug design.

Protein complex prediction

This model can predict the three-dimensional structure of protein-protein complexes, especially the interactions between protein multimers, which is crucial for studying protein functions and designing protein drugs.

Single sequence structure prediction

Chai-1 can perform highly accurate structure prediction from a single sequence input without multiple sequence alignment (MSA), which enables it to maintain excellent performance even when there is insufficient data or no relevant sequence information.

Accurate prediction based on experimental data

Chai-1 can use the constraint information provided by experimental data (such as mass spectrometry data or epitope mapping) to further improve the accuracy of structure prediction , especially in the prediction of complex molecular interactions.

Antibody-antigen interaction prediction

Chai-1 has a very high prediction accuracy for antibody-antigen interactions, which can help researchers accurately predict the binding mode between antibodies and antigens and promote the design and development of antibody drugs.

Multimodal input support

Chai-1 supports multiple input forms, including protein sequences, chemical ligand information, experimental data, etc., making it more capable of predicting complex molecular structures and suitable for a wide range of biological and drug development tasks.

Architecture of Chai-1 Model

Overall Architecture

The model architecture of Chai-1 is mainly based on deep learning neural networks , which is similar to traditional biomolecular structure prediction models, but with several key improvements. The model design allows for multiple inputs, including protein sequences , language model embeddings, and experimental constraint data, thereby enhancing the flexibility and accuracy of predictions.

Language Model Embeddings

Chai-1 introduces protein language model embedding in the architecture , which is a way to generate an embedding representation of each residue based on the protein sequence. The embedding is generated by a protein language model with 3 billion parameters , which is designed to capture the grammatical and structural information in the sequence. This design enables Chai-1 to achieve high-precision predictions in single sequence mode , especially in the absence of multiple sequence alignment (MSA) information, the model can still achieve excellent performance.

Constraint characteristics

Chai-1 supports experimental constraint input, such as structural data or epitope mapping information obtained through mass spectrometry experiments. The constraint features of the model include the following:

Pocket constraints: By providing distance constraints on the molecular binding pocket, the model is able to better predict the location of intermolecular interactions.
Contact constraints: By specifying the contact distances between molecular residues, the model is able to predict the relative positions of residues in multi-molecular systems.
Docking constraints: The model predicts the docking pattern of a molecular system based on the distance constraints between different chains or groups of molecules.

These constraint features are randomized by the dropout mechanism

during training, ensuring that the model does not over-rely on specific constraints, thereby maintaining generality during inference.

Multimodal input and optional structure templates

In addition to language model embedding and experimental constraints, Chai-1 also supports multimodal inputsco-evolutionary signals such as multiple sequence alignments (MSA) and structural templates. MSA information is often used to capture in protein sequences , while structural templates provide additional spatial constraint information, which helps improve the prediction accuracy of complex structures.

The combined use of these multimodal inputs allows Chai-1 to maintain high prediction accuracy and flexibility in situations where different experimental data or structural information are scarce.

Improved training and inference strategies

The training strategy of Chai-1 is based on a large amount of protein and biomolecular structure data, using a large amount of GPU parallel computing. As of 2021, the model was trained on the Protein Database (PDB) and the AlphaFold Database (AFDB), and used structural templates from the PDB70 database.

During inference, the model can generate multiple prediction structures through random sampling and extended search strategies, and select the best prediction based on confidence. The model can disable dropout during inference to improve the consistency and repeatability of the results.

Modular Design

The architecture design of Chai-1 adopts a modular approach, which can selectively enable or disable certain input features according to task requirements during reasoning. For example, users can choose to rely on language model embedding when MSA data is not available, or improve the prediction accuracy of specific molecular systems through experimental constraint information.

Experimental Results of Chai-1

Protein-ligand prediction: On the PoseBusters benchmark, Chai-1 achieves a 77% prediction success rate, comparable to AF3. When combined with docking constraints, the success rate increases to 81%.
Peptide polymer prediction: Chai-1 in single sequence mode without MSA performed comparable to the AF2.3 model with MSA, and even surpassed AF2.3 in some evaluations.
Antibody-protein prediction: Chai-1 excels in predicting antibody-antigen interactions, with significantly higher accuracy when using constraints, achieving a higher DockQ success rate than AF2.3.
Protein monomer prediction: Without MSA, the prediction accuracy of Chai-1 is slightly inferior to AF2.3, but with MSA, Chai-1 performs better than AF2.3.

Chai-1 has demonstrated excellent performance in a variety of biomolecule prediction tasks. The following is a summary of the results of key experiments:

1. Protein-ligand prediction

Test setThe PoseBusters: benchmark test set is used for evaluation, which includes 427 protein-ligand structures.

Evaluation metric: Based on the ligand root mean square deviation (RMSD), a successful prediction is considered when the RMSD is less than 2Å.

Result:

Chai-1’s prediction success rate is 77.05%, comparable to AlphaFold3 (AF3)’s 76.34%.
When docking constraints are used, the success rate of Chai-1 increases to 81.20%, which is better than the case without constraints.
In certain cases, Chai-1 sometimes predicted deeper ligand binding pockets than the true structure, suggesting that the model is able to capture potential binding sites.

……

For more info ↓