LMLF — LLM-Driven Molecule & Materials Generation

Semi-automatic pipeline using LLMs with logical feedback loops to propose, filter, and validate novel molecules and materials

Motivation

LLMs trained on chemistry literature have absorbed an enormous amount of structure-activity relationship (SAR) knowledge. The question: can we use them as hypothesis generators in drug discovery and materials science — not as oracles, but as a creative front-end to a rigorous validation pipeline?

The naive approach (prompt → SMILES → done) doesn’t work: LLMs hallucinate invalid molecules, ignore synthesizability, and don’t respect target-specific constraints. We needed an architecture that closes the loop.

Architecture

Target spec (protein / material property)
         ↓
LLM candidate generation (GPT-4 / Claude)
         ↓
Validity filter (RDKit, Lipinski, SCScore)
         ↓
Retrosynthesis check (ASKCOS / similar)
         ↓
Quantitative validation (docking / MD / DFT)
         ↓
Feedback prompts → LLM (next iteration)

The feedback loop is the key. Failed candidates are returned to the LLM with structured reasons (“violates Lipinski Rule of 5 on H-bond donors”, “no synthetic route under 5 steps”, “predicted binding affinity below threshold”). The LLM uses these to refine its next round.

Two deployments

Drug discovery (APPCAIR, BITS Pilani)

Targeting JAK2 kinase and dopamine beta-hydroxylase (DBH):

  • Generated 20+ novel inhibitor candidates passing all filters
  • Achieved 15-20% success rate (candidates passing validity + retrosynthesis + docking thresholds)
  • Inhibitor scaffolds proposed by the system have been forwarded for experimental validation

Low-k dielectric materials (Deep Forest Sciences)

Targeting novel materials for semiconductor interlayer dielectrics:

  • LLM-generated material candidates validated via Molecular Dynamics + DFT simulations
  • Screened 40+ candidate materials
  • Reduced screening cycle time by 83% (30+ days → 5 days)

What came of it

  • Production deployment at Deep Forest Sciences for their Prithvi drug discovery platform
  • Inhibitor candidates from the JAK2/DBH pipeline forwarded for wet-lab synthesis
  • Methodology generalizes — same feedback-loop pattern now being adapted for catalyst design

Stack: Python · RDKit · GPT-4 / Claude APIs · ASKCOS · Gnina (docking) · LAMMPS (MD)

Collaborators: Prof. Ashwin Srinivasan (APPCAIR) · Bharath Ramsundar (Deep Forest Sciences)