EmbedDiff-ESM2 is a comprehensive protein sequence generation pipeline that combines large-scale pretrained protein embeddings (ESM-2) with a latent diffusion model to explore and sample from the vast protein sequence space. It generates novel sequences that preserve semantic and evolutionary properties without relying on explicit structural data, and evaluates them through a suite of biologically meaningful analyses including logistic regression classification.
To run the entire EmbedDiff pipeline from end to end:
python run_embeddiff_pipeline.py
EmbedDiff-ESM2 uses ESM-2 (Evolutionary Scale Modeling v2) to project protein sequences into a high-dimensional latent space rich in evolutionary and functional priors. A denoising latent diffusion model is trained to learn the distribution of these embeddings and generate new ones from random noise. These latent vectors represent plausible protein-like states and are decoded into sequences using a Transformer decoder with configurable stochastic sampling ratios.
The pipeline includes logistic regression analysis to evaluate embedding quality and domain separation, followed by comprehensive sequence validation via entropy analysis, cosine similarity, BLAST alignment, and embedding visualization (t-SNE, MDS). A final HTML report presents all figures and results in an interactive format.
The full EmbedDiff-ESM2 pipeline is modular and proceeds through the following stages:
esm2_t33_650M_UR50D model.Architecture Details:
Diffusion Process:
| Forward Process: Gradual addition of Gaussian noise following q(xβ | xβ) = βαΎ±βxβ + β(1-αΎ±β)Ξ΅ |
| Reverse Process: Learned denoising using p_ΞΈ(xβββ | xβ) with noise prediction |
Training Configuration:
| Loss Function: Mean squared error (MSE) between predicted and actual noise: L = | Β | Ξ΅ - Ξ΅_ΞΈ(xβ, t) | Β | Β² |
This architecture enables the model to learn the complex distribution of protein embeddings and generate novel, biologically plausible latent representations through iterative denoising.
The synthetic embeddings from Step 4 are decoded into amino acid sequences using a hybrid decoding strategy that balances biological realism with diversity.
Current Configuration:
This configuration produces sequences with approximately 30-55% sequence identity to known proteinsβstriking a practical balance between novelty and plausibility.
This decoding step is fully configurable:
scripts/transformer_decode_esm2.py.blastp.EmbedDiff_ESM/
βββ README.md # π Project overview and documentation
βββ requirements.txt # π¦ Python dependencies
βββ run_embeddiff_pipeline.py # π Master pipeline script
β
βββ data/ # π Input and output biological data
β βββ curated_thioredoxin_reductase.fasta # Input protein sequences
β βββ decoded_embeddiff_esm2.fasta # Generated sequences
β βββ decoder_dataset_esm2.pt # Decoder training dataset
β βββ blast_results/ # BLAST analysis results
β βββ blast_summary_local_esm2.csv # BLAST summary
β βββ [individual BLAST XML and FASTA files]
β
βββ embeddings/ # π Latent vector representations
β βββ esm2_embeddings.npy # Real sequence embeddings
β βββ esm2_stats.npz # Embedding statistics
β βββ sampled_esm2_embeddings.npy # Generated embeddings
β βββ tsne_coords_esm2.npy # t-SNE coordinates
β βββ tsne_labels_esm2.npy # t-SNE labels
β
βββ figures/ # π All generated plots and reports
β βββ fig_tsne_by_domain_esm2.png # t-SNE by domain
β βββ logreg_per_class_recall_esm2.png # Logistic regression recall
β βββ logreg_confusion_matrix_esm2.png # Logistic regression confusion matrix
β βββ fig2b_loss_esm2.png # Diffusion training loss
β βββ fig3a_generated_tsne_esm2.png # Generated embeddings t-SNE
β βββ fig5a_decoder_loss_esm2.png # Decoder training loss
β βββ fig5a_real_real_cosine_esm2.png # Real-Real cosine similarity
β βββ fig5b_gen_gen_cosine_esm2.png # Generated-Generated cosine similarity
β βββ fig5c_real_gen_cosine_esm2.png # Real-Generated cosine similarity
β βββ fig5b_identity_histogram_esm2.png # Identity histogram
β βββ fig5c_entropy_scatter_esm2.png # Entropy vs Identity scatter
β βββ fig5d_all_histograms_esm2.png # All histograms
β βββ fig5f_tsne_domain_overlay_esm2.png # t-SNE domain overlay
β βββ logreg_classification_results_esm2.csv # Logistic regression results
β βββ embeddiff_esm2_summary_report.html # Final HTML report
β
βββ scripts/ # π Core processing scripts
β βββ esm2_embedder.py # Step 2a: ESM-2 embedding
β βββ logistic_regression_probe_esm2.py # Step 2b: Logistic regression analysis
β βββ first_tsne_embedding_esm2.py # Step 2c: t-SNE of real embeddings
β βββ train_embeddiff_esm2.py # Step 3: Train latent diffusion model
β βββ sample_embeddings_esm2.py # Step 4: Sample new embeddings
β βββ build_decoder_dataset_esm2.py # Step 5a: Build decoder training set
β βββ train_transformer_esm2.py # Step 5b: Train decoder
β βββ transformer_decode_esm2.py # Step 6: Decode embeddings to sequences
β βββ plot_tsne_domain_overlay_esm2.py # Step 7a: t-SNE comparison
β βββ cosine_similarity_esm2.py # Step 7b: Cosine similarity plots
β βββ plot_entropy_identity_esm2.py # Step 7c: Entropy vs. identity filter
β βββ blastlocal_esm2.py # Step 7d: Local BLAST alignment
β βββ generate_esm2_report.py # Step 8: Generate final HTML report
β
βββ models/ # π ML model architectures
β βββ latent_diffusion.py # EmbedDiff-ESM2 diffusion model
β βββ decoder_transformer.py # Transformer decoder
β
βββ utils/ # π Utility and helper functions
β βββ esm2_embedder.py # ESM-2 embedding utilities
β
βββ checkpoints/ # π Model checkpoints
βββ best_embeddiff_mlp_esm2.pth # Best diffusion model
βββ decoder_transformer_best_esm2.pth # Best decoder model
βββ decoder_transformer_last_esm2.pth # Last decoder checkpoint
# Clone the repository
git clone <repository-url>
cd EmbedDiff_ESM
# Install dependencies
pip install -r requirements.txt
data/curated_thioredoxin_reductase.fasta# Run complete pipeline
python run_embeddiff_pipeline.py
# Or skip specific steps
python run_embeddiff_pipeline.py --skip esm2 logreg tsne diffusion
data/decoded_embeddiff_esm2.fastafigures/ directoryembeddiff_esm2_summary_report.html for comprehensive resultsEdit scripts/transformer_decode_esm2.py:
STOCHASTIC_RATIO = 0.6 # 60% stochastic, 40% reference-guided
Use the --skip flag to skip specific steps:
python run_embeddiff_pipeline.py --skip esm2 logreg tsne diffusion sample decoder_data decoder_train decode tsne_overlay cosine entropy blast html
Generated sequences can be assessed for structural plausibility using:
| Metric | Value | Description |
|---|---|---|
| Generated Sequences | 240 | High-quality synthetic proteins with domain-specific conditioning |
| Sequence Identity | 37-49% | Range of similarity to real sequences (BLAST validation) |
| Training Epochs | 300 | Diffusion model training with early stopping |
| Batch Size | 32 | Optimized for training stability |
| Learning Rate | 1e-4 | Adam optimizer configuration |
| Timesteps | 1000 | Diffusion process steps for smooth noise scheduling |
| Embedding Dimension | 1280 | ESM-2 latent space size |
| Data Split | 80/10/10 | Train/validation/test ratio with stratified sampling |
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.