EmbedDiff_ESM

🧬 EmbedDiff-ESM: Latent Diffusion Pipeline for De Novo Protein Sequence Generation

HTML Report Run EmbedDiff License: MIT

EmbedDiff-ESM2 is a comprehensive protein sequence generation pipeline that combines large-scale pretrained protein embeddings (ESM-2) with a latent diffusion model to explore and sample from the vast protein sequence space. It generates novel sequences that preserve semantic and evolutionary properties without relying on explicit structural data, and evaluates them through a suite of biologically meaningful analyses including logistic regression classification.


πŸš€ Quick Start (1-liner)

To run the entire EmbedDiff pipeline from end to end:

python run_embeddiff_pipeline.py

πŸ” What Is EmbedDiff-ESM2?

EmbedDiff-ESM2 uses ESM-2 (Evolutionary Scale Modeling v2) to project protein sequences into a high-dimensional latent space rich in evolutionary and functional priors. A denoising latent diffusion model is trained to learn the distribution of these embeddings and generate new ones from random noise. These latent vectors represent plausible protein-like states and are decoded into sequences using a Transformer decoder with configurable stochastic sampling ratios.

The pipeline includes logistic regression analysis to evaluate embedding quality and domain separation, followed by comprehensive sequence validation via entropy analysis, cosine similarity, BLAST alignment, and embedding visualization (t-SNE, MDS). A final HTML report presents all figures and results in an interactive format.


πŸ“Œ Pipeline Overview

The full EmbedDiff-ESM2 pipeline is modular and proceeds through the following stages:

Step 1: Input Dataset


Step 2a: ESM-2 Embedding


Step 2b: Logistic Regression Probe Analysis


Step 2c: t-SNE of Real Embeddings


Step 3: Train EmbedDiff-ESM2 Latent Diffusion Model

Architecture Details:

Diffusion Process:

Training Configuration:

Loss Function: Mean squared error (MSE) between predicted and actual noise: L = Β  Ξ΅ - Ξ΅_ΞΈ(xβ‚œ, t) Β  Β²

This architecture enables the model to learn the complex distribution of protein embeddings and generate novel, biologically plausible latent representations through iterative denoising.


Step 4: Sample Synthetic Embeddings


Step 5a: Build Decoder Dataset


Step 5b: Train Transformer Decoder


Step 6: Decode Synthetic Sequences

The synthetic embeddings from Step 4 are decoded into amino acid sequences using a hybrid decoding strategy that balances biological realism with diversity.

Current Configuration:

This configuration produces sequences with approximately 30-55% sequence identity to known proteinsβ€”striking a practical balance between novelty and plausibility.

πŸ’‘ Modular and Adjustable

This decoding step is fully configurable:


Step 7a: t-SNE Overlay


Step 7b: Cosine Similarity Analysis


Step 7c: Entropy vs. Identity Analysis


Step 7d: Local BLAST Validation


Step 8: HTML Summary Report


πŸ“‚ Project Structure

EmbedDiff_ESM/
β”œβ”€β”€ README.md                                    # πŸ“˜ Project overview and documentation
β”œβ”€β”€ requirements.txt                             # πŸ“¦ Python dependencies
β”œβ”€β”€ run_embeddiff_pipeline.py                   # πŸš€ Master pipeline script
β”‚
β”œβ”€β”€ data/                                        # πŸ“ Input and output biological data
β”‚   β”œβ”€β”€ curated_thioredoxin_reductase.fasta     # Input protein sequences
β”‚   β”œβ”€β”€ decoded_embeddiff_esm2.fasta            # Generated sequences
β”‚   β”œβ”€β”€ decoder_dataset_esm2.pt                 # Decoder training dataset
β”‚   └── blast_results/                          # BLAST analysis results
β”‚       β”œβ”€β”€ blast_summary_local_esm2.csv        # BLAST summary
β”‚       └── [individual BLAST XML and FASTA files]
β”‚
β”œβ”€β”€ embeddings/                                  # πŸ“ Latent vector representations
β”‚   β”œβ”€β”€ esm2_embeddings.npy                     # Real sequence embeddings
β”‚   β”œβ”€β”€ esm2_stats.npz                          # Embedding statistics
β”‚   β”œβ”€β”€ sampled_esm2_embeddings.npy             # Generated embeddings
β”‚   β”œβ”€β”€ tsne_coords_esm2.npy                    # t-SNE coordinates
β”‚   └── tsne_labels_esm2.npy                    # t-SNE labels
β”‚
β”œβ”€β”€ figures/                                     # πŸ“ All generated plots and reports
β”‚   β”œβ”€β”€ fig_tsne_by_domain_esm2.png            # t-SNE by domain
β”‚   β”œβ”€β”€ logreg_per_class_recall_esm2.png       # Logistic regression recall
β”‚   β”œβ”€β”€ logreg_confusion_matrix_esm2.png        # Logistic regression confusion matrix
β”‚   β”œβ”€β”€ fig2b_loss_esm2.png                    # Diffusion training loss
β”‚   β”œβ”€β”€ fig3a_generated_tsne_esm2.png          # Generated embeddings t-SNE
β”‚   β”œβ”€β”€ fig5a_decoder_loss_esm2.png            # Decoder training loss
β”‚   β”œβ”€β”€ fig5a_real_real_cosine_esm2.png        # Real-Real cosine similarity
β”‚   β”œβ”€β”€ fig5b_gen_gen_cosine_esm2.png          # Generated-Generated cosine similarity
β”‚   β”œβ”€β”€ fig5c_real_gen_cosine_esm2.png         # Real-Generated cosine similarity
β”‚   β”œβ”€β”€ fig5b_identity_histogram_esm2.png      # Identity histogram
β”‚   β”œβ”€β”€ fig5c_entropy_scatter_esm2.png         # Entropy vs Identity scatter
β”‚   β”œβ”€β”€ fig5d_all_histograms_esm2.png          # All histograms
β”‚   β”œβ”€β”€ fig5f_tsne_domain_overlay_esm2.png     # t-SNE domain overlay
β”‚   β”œβ”€β”€ logreg_classification_results_esm2.csv  # Logistic regression results
β”‚   └── embeddiff_esm2_summary_report.html     # Final HTML report
β”‚
β”œβ”€β”€ scripts/                                     # πŸ“ Core processing scripts
β”‚   β”œβ”€β”€ esm2_embedder.py                       # Step 2a: ESM-2 embedding
β”‚   β”œβ”€β”€ logistic_regression_probe_esm2.py      # Step 2b: Logistic regression analysis
β”‚   β”œβ”€β”€ first_tsne_embedding_esm2.py           # Step 2c: t-SNE of real embeddings
β”‚   β”œβ”€β”€ train_embeddiff_esm2.py                # Step 3: Train latent diffusion model
β”‚   β”œβ”€β”€ sample_embeddings_esm2.py              # Step 4: Sample new embeddings
β”‚   β”œβ”€β”€ build_decoder_dataset_esm2.py          # Step 5a: Build decoder training set
β”‚   β”œβ”€β”€ train_transformer_esm2.py              # Step 5b: Train decoder
β”‚   β”œβ”€β”€ transformer_decode_esm2.py             # Step 6: Decode embeddings to sequences
β”‚   β”œβ”€β”€ plot_tsne_domain_overlay_esm2.py       # Step 7a: t-SNE comparison
β”‚   β”œβ”€β”€ cosine_similarity_esm2.py              # Step 7b: Cosine similarity plots
β”‚   β”œβ”€β”€ plot_entropy_identity_esm2.py          # Step 7c: Entropy vs. identity filter
β”‚   β”œβ”€β”€ blastlocal_esm2.py                     # Step 7d: Local BLAST alignment
β”‚   └── generate_esm2_report.py                # Step 8: Generate final HTML report
β”‚
β”œβ”€β”€ models/                                      # πŸ“ ML model architectures
β”‚   β”œβ”€β”€ latent_diffusion.py                     # EmbedDiff-ESM2 diffusion model
β”‚   └── decoder_transformer.py                  # Transformer decoder
β”‚
β”œβ”€β”€ utils/                                       # πŸ“ Utility and helper functions
β”‚   └── esm2_embedder.py                       # ESM-2 embedding utilities
β”‚
└── checkpoints/                                # πŸ“ Model checkpoints
    β”œβ”€β”€ best_embeddiff_mlp_esm2.pth            # Best diffusion model
    β”œβ”€β”€ decoder_transformer_best_esm2.pth       # Best decoder model
    └── decoder_transformer_last_esm2.pth       # Last decoder checkpoint

πŸš€ Quick Start

1. Setup Environment

# Clone the repository
git clone <repository-url>
cd EmbedDiff_ESM

# Install dependencies
pip install -r requirements.txt

2. Prepare Data

3. Run Full Pipeline

# Run complete pipeline
python run_embeddiff_pipeline.py

# Or skip specific steps
python run_embeddiff_pipeline.py --skip esm2 logreg tsne diffusion

4. View Results


πŸ”§ Configuration Options

Stochastic Ratio Adjustment

Edit scripts/transformer_decode_esm2.py:

STOCHASTIC_RATIO = 0.6  # 60% stochastic, 40% reference-guided

Pipeline Step Control

Use the --skip flag to skip specific steps:

python run_embeddiff_pipeline.py --skip esm2 logreg tsne diffusion sample decoder_data decoder_train decode tsne_overlay cosine entropy blast html

πŸ“Š Key Features


πŸ§ͺ Optional: Structural Validation

Generated sequences can be assessed for structural plausibility using:


πŸ“Š Performance Metrics

Metric Value Description
Generated Sequences 240 High-quality synthetic proteins with domain-specific conditioning
Sequence Identity 37-49% Range of similarity to real sequences (BLAST validation)
Training Epochs 300 Diffusion model training with early stopping
Batch Size 32 Optimized for training stability
Learning Rate 1e-4 Adam optimizer configuration
Timesteps 1000 Diffusion process steps for smooth noise scheduling
Embedding Dimension 1280 ESM-2 latent space size
Data Split 80/10/10 Train/validation/test ratio with stratified sampling

🀝 Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.