Biotech’s AI Revolution Will Be Won in the Lab

Biotech is learning, again, that data provenance is not a detail. It is the product.

Imagine that you are working in a lab and the experiment you are about to run depends on a simple act of trust. You open a supplier’s catalogue, choose an antibody, and look at a validation image called a Western blot. For a non-specialist, a Western blot is basically a lab test that shows whether an antibody recognizes the right protein: if the right dark band appears in the right place, scientists take it as evidence that the antibody works.

Now imagine someone zooms in, adjusts the contrast, and discovers that some of those bands may have been copied, flipped, painted over, or reused across different product pages. Suddenly, the problem is no longer just one questionable image.

It is a much larger question: how much of the scientific supply chain are we trusting without truly verifying?

A recent Nature article reported that catalog entries for more than 100 Thermo Fisher Scientific antibodies contained images that appeared to have been manipulated, including images intended to demonstrate antibody quality and performance (Read more here: https://www.nature.com/articles/d41586-026-01706-2).

Nature also noted that image alteration does not automatically mean the underlying products are defective, but the episode is still a reminder that even trusted commercial data streams can contain uncertainty.

*The bands labeled 1 through 4 are all identical to one another after a vertical flip, a horizontal flip or a 180 degree rotation. Source:* **How much of Thermo Fisher’s antibody data has been manipulated? by** Reese Richardson May 28, 2026.

The Thermo Fisher case is troubling because the reported problems were not limited to one questionable figure or one isolated product page. Reese Richardson’s analysis says his team identified “more than 100 images bearing signs of manipulation” in Thermo Fisher’s online primary-antibody catalog verification data [Please read Reese’s excellent blog post that I used a lot in jotting this post down].

These images were not decorative marketing assets; they were presented as verification data intended to show that specific antibodies worked as advertised.

The reported manipulation also spans several categories of concern. Richardson describes Western blot bands that appear identical after flipping and rotation, images with conspicuous brushstroke-like edits after contrast adjustment, repetitive blocks of background noise, and abrupt discontinuities in background texture. In one recurring pattern, the post says dozens of Thermo Fisher antibody pages showed verification Western blots containing the same background pattern, with 50 instances documented at the time of writing!!!

In academia, a pattern of apparent image manipulation on this scale would likely be career-ending.

A principal investigator facing allegations involving duplicated bands, rotated blot features, brushstroke-like edits, and repeated background patterns across many figures would expect institutional investigation, paper retractions, grant scrutiny, reputational damage, and possibly the collapse of their laboratory’s credibility.

So the uncomfortable question for industry is this:

If comparable problems appear in commercial verification data that thousands of scientists rely on, what are the consequences for the broader biotech ecosystem, including small biotechs, CROs, platform companies, and large pharmaceutical organizations whose experimental decisions, model-training pipelines, reagent choices, and therapeutic programs may depend on that data?

Sad truth: the reproducibility problem is not new, and we have learned to live with it

Nature’s broader coverage of reproducibility shows that this is a structural issue, not a one-off scandal. In its 2016 reproducibility survey, Nature reported that two-thirds of responding researchers viewed current levels of reproducibility as a major problem, and that pressure to publish, selective reporting, poor statistics, and finicky protocols can all contribute to unreliable work (Read more: https://www.nature.com/articles/533437a).

The issue is especially relevant in preclinical biology. Nature Reviews Drug Discovery reported that Bayer could replicate only 25% of the preclinical academic projects it took on, while Amgen reported an 11% success rate when trying to recreate findings from cancer papers (Read more: https://www.nature.com/articles/nrd.2017.19).

Pre-clinical research reproducibility issues reported by Big Pharma companies

Source: Nature Reviews Drug Discovery; https://www.nature.com/articles/nrd.2017.19

Those numbers should matter to every company building computational platforms for drug discovery.

For AI-first biotech companies, the Amgen and Bayer reproducibility findings mean something very practical:

Pharma will not simply trust the model output because the model looks sophisticated.

If large companies have previously struggled to reproduce a major fraction of published preclinical biology, then any AI-first company selling predictions, targets, mechanisms, or assets should expect tougher diligence. The first consequence is more friction in every deal process. Pharma partners will ask for more raw data, more assay details, more reagent provenance, more replication experiments, and more evidence that the model is learning real biology rather than artifacts.

That means more steps before a deal can close. MTAs may need to be negotiated so Pharma can test materials directly. POCs may become longer and more demanding. External validation may be required before moving from interest to partnership. Legal, scientific, translational, and procurement teams may all get involved earlier.

In practice, weak trust in data reproducibility extends the sales cycle.

Why is external data not enough?

External CROs are essential partners in biotechnology. But if a tech-first company relies entirely on outsourced experiments, it risks becoming a passive consumer of data rather than an active builder of scientific truth.

The problem is not that CRO data is bad. The problem is that outsourced data often arrives detached from the tacit experimental context that makes biology interpretable: reagent history, protocol drift, failed runs, operator effects, batch artifacts, and subtle assay behaviors. These details often determine whether a model learns biology or learns noise.

For AI-first drug discovery, this distinction is existential. Models trained on weakly controlled or poorly understood experimental data can become very good at reproducing hidden artifacts. They can produce confident predictions that look impressive in silico but fail when confronted with real biophysical complexity.

What independent data generation gives Peptone

A bespoke robotic and ultra-fast mixing hydrogen-deuterium exchange-enabled mass spectrometry system for structural characterisation of Intrinsically Disordered Proteins, developed in Peptone Switzerland AG laboratories.

Our goal is not to replace every external partner. It is to own the critical feedback loops.

By generating key datasets internally, Peptone can design experiments around the questions our models actually need answered. We can control assay conditions, rerun ambiguous measurements, test failure modes, and connect raw experimental observations directly to model development. That gives us a tighter loop between hypothesis, measurement, model update, and decision.

This matters particularly for the kinds of biological systems Peptone works on: dynamic proteins, conformational ensembles, intrinsically disordered regions, and therapeutic mechanisms that are not always captured by static structures. In these systems, the important signal is often subtle. If we cannot interrogate the data ourselves, we cannot know whether the model is learning the right physics.

Implications for tech-first biotech companies without laboratories

Companies that rely only on external CROs may face several strategic risks.

Data provenance risk: They may not fully know how critical training data was produced, cleaned, normalized, or rejected.

Reproducibility risk: They may struggle to distinguish a true biological effect from a protocol-specific artifact.

Model-risk amplification: AI systems can scale errors quickly if the underlying data are biased or weakly validated.

Slower learning cycles: Every experimental iteration requires external scheduling, contracting, and interpretation.

Lower defensibility: If competitors can buy similar CRO datasets, the moat shifts away from proprietary biology.

Reduced scientific intuition: Without direct experimental ownership, teams may lose contact with the messy details that often reveal the next breakthrough.

My view

The future of computational drug discovery is not “models instead of labs.” It is models with laboratories that are purpose-built for learning.

At Peptone, we are building independent data-generation capacity because we want our models to be trained on data we understand deeply, can reproduce rigorously, and can improve continuously. The lesson from Nature’s reporting is not that biotechnology should trust less. It is that we should build systems where trust is earned experimentally, repeatedly, and transparently.

That is why data generation is not a support function for Peptone. It is core infrastructure.

If you want to learn more about what we are building at Peptone, please see this short flick put together with NVIDIA and AWS.

Biotech’s AI Revolution Will Be Won in the Lab

Biotech is learning, again, that data provenance is not a detail. It is the product.

It is a much larger question: how much of the scientific supply chain are we trusting without truly verifying?

These images were not decorative marketing assets; they were presented as verification data intended to show that specific antibodies worked as advertised.

In academia, a pattern of apparent image manipulation on this scale would likely be career-ending.

Sad truth: the reproducibility problem is not new, and we have learned to live with it

Pre-clinical research reproducibility issues reported by Big Pharma companies

Those numbers should matter to every company building computational platforms for drug discovery.

Pharma will not simply trust the model output because the model looks sophisticated.

In practice, weak trust in data reproducibility extends the sales cycle.

Why is external data not enough?

What independent data generation gives Peptone

Our goal is not to replace every external partner. It is to own the critical feedback loops.

Implications for tech-first biotech companies without laboratories

My view

The future of computational drug discovery is not “models instead of labs.” It is models with laboratories that are purpose-built for learning.

That is why data generation is not a support function for Peptone. It is core infrastructure.

Say hello here.

Elsewhere –

Goodreads
LinkedIn
Twitter

Biotech’s AI Revolution Will Be Won in the Lab

Biotech is learning, again, that data provenance is not a detail. It is the product.

It is a much larger question: how much of the scientific supply chain are we trusting without truly verifying?

These images were not decorative marketing assets; they were presented as verification data intended to show that specific antibodies worked as advertised.

In academia, a pattern of apparent image manipulation on this scale would likely be career-ending.

Sad truth: the reproducibility problem is not new, and we have learned to live with it

Pre-clinical research reproducibility issues reported by Big Pharma companies

Those numbers should matter to every company building computational platforms for drug discovery.

Pharma will not simply trust the model output because the model looks sophisticated.

In practice, weak trust in data reproducibility extends the sales cycle.

Why is external data not enough?

What independent data generation gives Peptone

Our goal is not to replace every external partner. It is to own the critical feedback loops.

Implications for tech-first biotech companies without laboratories

My view

The future of computational drug discovery is not “models instead of labs.” It is models with laboratories that are purpose-built for learning.

That is why data generation is not a support function for Peptone. It is core infrastructure.

Peptone at NVIDIA GTC 2026: Tackling the "Undruggable" Proteome

Say hello here.

Elsewhere –

GoodreadsLinkedInTwitter

Goodreads
LinkedIn
Twitter