← All workAI / DARPA / Knowledge discovery

AI Protein Polymer Platform

A DARPA-funded LLM-and-transformer pipeline that mines protein polymer designs from the scientific literature. The platform turns thousands of fragmented papers into a structured dataset, with architecture to scale to over a million.

AI Protein Polymer Platform

Protein polymers are the building blocks of some of nature's most remarkable materials: spider silk, mussel adhesive, cephalopod reflectin, insect resilin. They are also a vast, fragmented research domain. Sequences and properties are scattered across tens of thousands of papers and patents with no unified vocabulary, no shared dataset, and no consistent representation. The accessible design space is whatever fits in one expert's head.

The AI Protein Polymer Platform is a pipeline for turning that literature into a structured, queryable, ML-trainable dataset. We built and validated a scalable extraction pipeline, comparing LLM-based prompt engineering, reasoning, and vision-language models against a hand-curated ground-truth corpus. The winning method was productionized and fine-tuned for material property extraction, scaling toward a database that can cover over a million papers.

Funded by DARPA's AI x BTO Catalyst award in 2024, in the Biomanufacturing / Synthetic Biology category. I served as Principal Investigator and technical lead. The work belongs to a broader thesis: applying AI to fragmented but knowledge-rich scientific corpora is where some of the largest near-term gains in materials science live. The natural next step is a generative model that designs novel protein polymers from peptide-level embeddings, with the dataset built here as its foundation.