Home

How I Found 71 Natural Alternatives to a $200 Million Drug

February 11, 2026

By Author — Not random images either, there’s a chance you find the compounds in all of these organisms.

Static Badge

Table of Contents #

  1. Exploratory Data Analysis for COCONUT — Profiling ~100K natural products to understand drug-likeness, physicochemical space, and screening viability.
  2. Chemical Similarity Search — Fingerprint-based filtering (ECFP, MACCS, pharmacophore, USRCAT) to reduce the database to high-similarity candidates.
  3. Protein Model Selection, Cleanup, and Preprocessing — Selecting acarbose-bound crystal structures and preparing them for docking.
  4. Protein–Ligand Docking & Assessment — Running GNINA-based virtual screening and scoring with CNN-VS.
  5. Lead Validation and Controls — Decoys, redocking replicates, statistical filtering, and stability analysis to remove false positives.
  6. Candidate Overview & Key Takeaways — Final shortlist by target protein and reflections on the workflow.

Flavonoids are a large, diverse group of secondary plant metabolites that play a role in pigmentation, UV protection, and insecticidal activity. Specifically, some anthocyanins (a subclass) can inhibit insect digestive enzymes, preventing nutrient uptake and development, serving as a defence mechanism against predators.

Obviously, there’s potential here for bug Ozempic (or insecticides, you pick). But research on Allium mongolicum flowers has also revealed that their flavonoids may inhibit human digestive enzymes, including our very own starch-metabolizing α-glucosidase. These compounds could be used to treat diabetes, as inhibiting starch breakdown reduces postprandial blood glucose levels (Li et al., 2025).

Initial docking analysis revealed that A. mongolicum-derived isoquercetin (already in clinical trials) can inhibit α-glucosidase (AGI) activity with an effect similar to that of acarbose , a common AGI derived from Actinoplanes bacteria. Chemical similarity search revealed that troxerutin (aka vitamin P4, a flavonoid you can buy over the counter) shares some substructural motifs with isoquercetin, and its docking analysis showed similar binding affinity.

Docking results from DiffDock-Pocket.

Note : troxerutin’s slightly higher affinity may be attributed to steric interactions introduced by the moieties found on the larger molecule.

I wanted to take this approach and extend it to a larger database of natural compounds to find other biochemicals that could hypothetically outperform acarbose — after all, the drug is projected to have a $220 million global market by 2035.

Here’s my overall process:

  1. Exploratory data analysis of COCONUT, a natural products database.

  2. Chemical similarity search of acarbose against COCONUT to find structurally similar NPs.

  3. Protein model selection, cleanup, and preprocessing.

  4. Protein-ligand docking and assessment , where I run a virtual screen against human α-glucosidases.

  5. Lead validation and controls to identify any false positives.

You can see all the code (in progress) here.

Workflow diagram

1. Exploratory data analysis for COCONUT #

In this section, I’ll take a look at:

There are a little over 100,000 natural products in the COCONUT database. I started by doing some simple feature visualization:

For a closer look at some derived ratios, I looked at the following:

Research indicates that a TPSA/MW ratio >= 0.2 is ideal for good solubility; approximately half of the dataset meets this criterion (Whitty et al., 2018).

This metric stems from two findings:

Here, I take the ratio of both in the dataset. Two populations emerge: many NPs with no saturated carbons and a smaller group with some balance.

Takeaways :

Chemical similarity is generally determined by generating chemical fingerprints for a molecule, fixed-size identifiers that capture information about atomic structure and chemistry. Several fingerprinting protocols exist, each with its own strengths and weaknesses. Here are the ones I chose and why:

By generating fingerprints for acarbose, we can use Tanimoto similarity to assess the degree of agreement between our database compounds and the lead target. USR is the exception; we use inverse Manhattan distance instead.

To trim the dataset down, I used the following cutoffs for Tanimoto similarity:

This gave me a dataset of 10,341 high-similarity candidate molecules to explore. Ligands were generated using RDKit and converted to .pdbqt using OpenBabel.

3. Protein model selection, cleanup, and preprocessing #

Acarbose is a bit of a rolling stone, it can’t be tied down — by which I mean it interacts with multiple proteins. To be thorough, I wanted to explore this virtual screening process across all of its protein targets at once. In hindsight, this made it much harder to keep track of (I’ll take it one target at a time from now on).

I made my selections based on protein targets listed in DrugBank and the resolution of the corresponding protein model. I also specifically selected PDB entries of the target protein _ in complex with acarbose_ so I’d a) have a defined binding pocket and b) have an experimental reference point for future validations. Finally, all proteins were put through PDB-redo before being converted to .pdbqt for docking:

  1. 2QMJ : The N-terminal subunit of human maltase-glucoamylase in complex with acarbose. As you can see from the PDB page, there’s room for growth in model quality, but PDB-redo led to some minor improvements. (Gene: MGAM).
  2. 3BAJ : Human pancreatic alpha-amylase in complex with nitrate and acarbose. PDB-redo led to more significant improvements here. (Gene: AMY2A).
  3. 5NN8 : Human lysosomal acid-alpha-glucosidase in complex with acarbose (Gene: GAA).

Below, acarbose is designated in light green with binding residues (within 4 Å of acarbose) in red. All other observed elements are ions and ligands, which are removed for docking.

Note: one of the highlighted light green ligands in 5NN8 is an acarbose-derived trisaccharide; the other binding pocket (containing true alpha-acarbose) is used for docking configuration — Images generated via The Protein Imager.

4. Protein-ligand docking & assessment #

Figures generated using ChimeraX. Can you tell I tried (unsuccessfully) to create Goodsell-like graphics?

For protein-ligand docking, I usedGNINA, an open-source AutoDock Vina fork that uses CNNs to score ligand poses.

Since we know where acarbose is supposed to bind (the protein models include acarbose), we can guide the docking process by setting an autobox (basically a mini 3D search space) around its coordinates. In the diagrams above, I’ve indicated the autobox GNINA draws around the ligands (4 Å outward from the farthest corners of the molecule) in red.

While it may be the most computationally expensive step, this stage of the project was actually the easiest. I ran GNINA on an HPC cluster in parallel fashion to save time — although GNINA wasn’t designed specifically for high-throughput virtual screening1, this step was completed in approximately one day.

5. Lead validation and controls #

In this section, I go over:

GNINA generates several metrics for binding poses , including CNN pose score (the AI’s assessment for score quality) and CNN binding affinity (the AI-predicted kcal/mol). Benchmarking studies with GNINA have indicated that the product of these two values, CNN_VS (virtual screening), can be used to identify potentially active ligands when it exceeds 6.30.

Horizontal lines represent the CNN-VS threshold of 6.30.

Taking the CNN_VS of all ligands docked against each protein, 71 ligands passed the threshold.

Okay, the headline promise-payoff is complete—but we’re not done yet. After collecting a set of leads, I needed to validate some of the findings. There were a few ways I went about doing this, with each control telling me something different:

Here are the results for each:

Molecular decoys #

Decoys help us determine whether our docking results are false positives. By matching our lead molecules against decoy molecules with similar properties (e.g., molecular weight, number of rotatable bonds, log P), we can assess whether our leads have real potential.

I used LUDe (LIDEB’s Useful Decoys) to generate decoys for all (putatively) biologically active ligands. LUDe provides fifty decoys per ligand, all of which were docked against the protein models using GNINA under the same parameters, with CNN-VS taken as well.

Comparative overall performance (measured via CNN-VS) of leads vs. property-matched decoys.

Alternatively:

As you can see, nearly all decoy poses scored below our activity threshold. However, to be more robust, I did the following checks:

P-values & z-scores #

For each ligand, I compared CNN-VS scores with ~50 matched decoys (in some cases, RDKit was unable to generate conformers for the decoys) and computed a one-sided empirical p-value. I also determined how many robust standard deviations the ligand lay above the median and adjusted all p-values using the Benjamini-Hochberg false discovery rate.

Some results for 3BAJ are shown below.

Ligand dominance, or rank-based enrichment #

Realistically, the ligand vs. decoy sets are too small for any real statistical robustness. A more useful heuristic is to compare the ligands’ ranks relative to their matched decoys when ordered by CNN-VS.

For all three proteins, all leads ranked at or above the 98th percentile relative to their decoys. Actually, all except 2 leads for 3BAJ ranked first relative to their decoys. Some results for 3BAJ are shown below:

You’ll notice very few decoys were generated for CNP0064234.2; it was not considered a candidate for 3BAJ.

Redocking with variant seeds #

One of the more important validation tests, this process is meant to tell us if the leads we determined were flukes or are consistently high-CNN-VS across replicates , an attribute we’ll refer to as stability from here on.

Once again, this control helps us filter out false positives. If expanded, it can help us identify false negatives: ligands that didn’t pass the CNN-VS threshold in the original screen but would otherwise. I chose not to do that in this project due to time and computational cost constraints, but it would be a logical next step.

For this control, I simply took all of our leads and docked them again with GNINA using 5 variant seeds. Below are the distributions of CNN-VS scores for the initial run compared with the redocked runs:

As we can see, the CNN-VS for the redocked ligands exhibits a wider distribution (as expected — we haven’t artificially subset them to CNN-VS >= 6.30, and there are 25 times as many data points).

Let’s take a closer look:

5.1. Jittered Box-Plot of CNN-VS Distribution by Identifier Across All Seeds

What this shows:

Each box shows the spread of CNN-VS scores for a given lead identifier across all seeds, including the original and redocking runs.

Already, we can see several leads do not pass muster.

Why it matters:

These charts help us see which identifiers have the most consistent CNN-VS scores and whether their distributions exceed the biologically active threshold.

5.2. Spaghetti Plot of Best CNN-VS Score by Identifier Across All Seeds

What this shows :

This is a closer look than the previous boxplot and shows us seeds on the x-axis. Once again, it shows us the consistency and potency of each identifier:

Note : It might make more sense to use a spaghetti plot for fewer samples that were actually labelled, but here it’s meant to provide a broader overview.

Why it matters :

These plots show whether CNN-VS scores are consistent across runs and how high or low those scores are. We’re looking for consistent scoring that’s also above the threshold.

5.3. Scatter Plot of Cross-Seed Median CNN-VS Score & Cross-Seed Interquartile Range

What this shows :

The median CNN-VS across seeds provides a general idea of the CNN-VS for a particular ligand; the IQR indicates the stability of that CNN-VS. A lower IQR means a tighter distribution.

Why it matters :

This is another way of visualizing consistency (low IQR) and quality (median CNN-VS).

5.4. Median Comparison Between Originals and Redock CNN-VS by Identifier

What this shows :

One of the simpler visualizations; all we see here is the difference between the original run’s CNN-VS and the median redock runs’ CNN-VS.

This plot becomes more useful if we expand our initial redocking set to include those leads with CNN-VS scores below 6.30 — a jump would indicate a false negative.

Why it matters :

A companion to IQR vs. CNN-VS. A snapshot of pose stability.

Taking all of this into account, my final filtration process was to subset to the leads that scored CNN-VS > 6.30 in 5 out of 6 replicates, then rank them by the median CNN-VS-to-interquartile range ratio. That meant the top 3 candidates for each protein were as follows:

2QMJ :

3BAJ :

5NN8 :

We’ll take a closer look at these soon — but first, let’s do one final validation.

Structural interaction fingerprint comparison #

The goal of this step is more exploratory than confirmatory. My aim here was to determine whether the lead ligands interact with the target proteins similarly to experimental acarbose. We don’t exactly need to prioritize the ligands with similar binding profiles, as it could be expected that a more performant ligand would behave differently.

We can compare ligand-protein structural interactions across a diverse set of ligand sizes using structural interaction fingerprints. These go one step beyond typical molecular fingerprints and encode molecular interactions between ligands and binding-pocket residues, hashing them to preset sizes to enable comparisons.

It seems the best fingerprint so far isPLECFP, which combines the best features of all previous methods — atomic environment capture, expanded capture radius, etc. It’s shown promise in AI tasks, lead optimization, and scaffold hopping. That’s why I chose to use it for my initial pass:

As you can see, our similarity scores are practically zero. (Full disclosure: this was a pretty disappointing chart to generate).

These results are a little surprising — the initial list of 10,000 candidates was selected because they shared similarities with acarbose, so we’d expect at least some overlap.

Moreover, when creating a residue contact map for each ligand against its target protein, the interacting residues seem quite similar:

2QMJ :

3BAJ :

5NN8 :

I’m still not sure why this might be, so please let me know if you have an explanation!

Also, out of curiosity, I determined pairwise Tanimoto similarity across all leads2. There seems to be clustering among some 3BAJ leads, which may merit further investigation.

Some of the lead ligands for 3BAJ, such as CNP0146238.29, appear to share binding behaviours with other leads, such as CNP0156494 — this may be something worth further investigation.

Candidate overview & key takeaways #

Let’s start by interpreting our shortlist from earlier:

2QMJ #

3BAJ #

5NN8 #

Some honourable mentions:

If you’re in a lab, made it this far, and have access to such delightful flora as A. toxicaria or a library of traditional Ethiopian medicine, feel free to take this list of lead candidates and get to testing.

Some of my key takeaways:

Thank you for reading! Am I a genius? A modern Prometheus? Have I made grievous and unforgivable scientific errors? Please let me know if you have any feedback.

1

They actually specifically recommend against using GNINA for high-throughput screening applications. However, simple parallelization allows you to spawn multiple GNINA processes on a single GPU node, so you can take advantage of the program’s speed.

2

2BAJ :

3BAJ :

5NN8:

Previous Post How to Win a Case Competition in 7 Hours
View / Make Comments