#1557: Deepfakes and Diagnosis: The New Era of Medical AI

Explore the shift from simple AI detection to multimodal systems and the growing challenge of deepfake medical images in healthcare.

0:000:00

Episode Details

Published: Mar 26
Duration: 21:43
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The landscape of medical artificial intelligence is undergoing a massive transformation, shifting away from isolated detection tools toward integrated, multimodal systems. While the FDA has authorized nearly 1,500 AI-enabled medical devices as of early 2026, a significant "AI Chasm" persists. Recent data suggests that 95% of these cleared devices have never reported a single patient health outcome, highlighting a gap between laboratory performance and real-world clinical benefits.

From Point Solutions to Workflow Integration

Historically, medical AI focused on "point solutions"—models trained to identify one specific pathology, such as a lung nodule or a wrist fracture. However, this approach created a "friction nightmare" for clinicians who had to juggle multiple applications for a single patient. The industry is now moving toward workflow-native systems that live directly within the Picture Archiving and Communication Systems (PACS) used by radiologists. The goal is to provide a "second pair of eyes" that understands the entire diagnostic context rather than just searching for a single anomaly.

The Rise of Vision-Native 3D Models

A major technical fork has emerged between text-based models and vision-native architectures. Emerging models like Pillar-0 represent a shift toward interpreting full 3D volumes—such as CT or MRI scans—directly, rather than as a series of 2D slices. This allows the AI to understand the connectivity of biological structures in three-dimensional space, significantly outperforming general-purpose models in diagnostic accuracy. By treating the data as a volumetric whole, these systems can identify "invisible" signals, such as protein-survival associations in pathology slides, that correlate with how a patient might respond to specific treatments.

The Challenge of Grounding and Hallucinations

One of the most dangerous hurdles in medical AI is the "statistical hallucination." This occurs when a model predicts a secondary condition not because it sees it in the pixels, but because its training data suggests that Condition A often accompanies Condition B. To combat this, researchers are implementing Category-Wise Contrastive Decoding (CWCD). This technique forces models to justify their predictions based on specific visual cues, ensuring that every claim in a medical report is grounded in spatial coordinates on the actual scan.

Deepfakes and the Crisis of Trust

The integrity of medical records is facing a new threat: synthetic medical images. Modern generative models can now produce X-rays and scans that are indistinguishable from real ones, even to trained experts. This creates risks for insurance fraud and the potential poisoning of clinical trial data. The industry is responding with calls for "chains of custody" for pixels, utilizing cryptographic signatures at the hardware level to ensure an image originated from a physical sensor rather than a digital generator.

Specialized vs. General-Purpose Models

The future of the field appears to favor specialized, modular pipelines over massive, general-purpose LLMs. Specialized models are not only more cost-effective but also allow for the encoding of domain-specific constraints and professional guidelines directly into the architecture. By focusing on high-precision tasks within specific clinical guardrails, these tools aim to finally bridge the gap between technological innovation and tangible patient recovery.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1557: Deepfakes and Diagnosis: The New Era of Medical AI

Daniel's Prompt

Custom topic: let's discuss the world of medical fine tunes and what kind of specialist AI models have been used to date to assist physicians with diagnosis, differential diagnosis and initial interpretation of med

Herman, I saw a headline this morning that makes me want to never look at an X-ray again without a private investigator and a digital forensic team standing by. Apparently, we have reached the point where deepfake medical images are so good that even the experts—and I mean the people who spent a decade in med school—cannot tell if that broken rib is real or just a very talented hallucination from a neural network.

It is a genuine crisis of trust, Corn. My name is Herman Poppleberry, and we are diving headfirst into the messy, high-stakes world of medical AI today. This topic actually comes from a prompt Daniel sent us. He wants us to look at the landscape of fine-tuned medical models as of today, March twenty-six, twenty-twenty-six. Specifically, how we moved from simple detection tools to these massive, multimodal systems that are trying to act as a second pair of eyes for doctors.

Today's prompt from Daniel is about the shift from what he calls point-solution detection to these workflow-native systems. He is asking us to look at the history, the technical battle between text-based and vision-native models, and whether the future belongs to specialized tools or general-purpose models with very expensive guardrails. But before we get into the "how," we have to talk about the "why." Why are we even having this conversation when the FDA seems to be approving these things like they are handing out stickers at a dentist's office?

That is the perfect place to start because we are seeing this massive paradox in the industry right now. On one hand, the volume of innovation is staggering. As of late twenty-twenty-five, the Food and Drug Administration had authorized a cumulative one thousand four hundred fifty-one AI and machine learning enabled medical devices. Seventy-six percent of those, or one thousand one hundred four devices, are in radiology alone. Companies like GE HealthCare are leading with one hundred twenty authorizations, followed by Siemens Healthineers at eighty-nine and Philips at fifty.

That sounds like a lot of progress on paper. If I am a regulator, I am feeling pretty good about myself. But I saw that report from the ARISE Network that came out about two weeks ago, on March tenth. It was pretty damning, Herman. They found that ninety-five percent of those FDA-cleared devices have never reported a single patient health outcome. Not one. So, we have over a thousand approved tools, but we basically have no proof that they are actually making people healthier or extending lives. That is what you call the AI Chasm, right?

That is exactly the term. It is the gap between a model performing well on a test set in a lab—where the data is clean and the stakes are low—and a model actually improving the life of a patient in a hospital bed. For years, the history of AI in medicine was defined by these point solutions. You would have one model that was incredibly good at finding a specific type of lung nodule, and another that was great at spotting a fracture in a wrist. But the problem is that radiologists do not just look for one thing. They look at the whole picture.

Right, if I go in for a chest X-ray because I have a cough, and the AI is only trained to look for pneumonia, it might miss the fact that my heart is twice the size it should be. It is like having a specialist for every single bolt on a car, but no one who knows how the engine actually works. If the doctor has to open ten different apps to check for ten different things, they are just going to stop using the apps. It is a friction nightmare.

And that brings us to the shift toward workflow-native systems. We are moving away from those isolated apps and into systems that live inside the PACS, which is the Picture Archiving and Communication System. That is the software environment where radiologists spend their entire day. The goal now is not just to detect one pathology, but to integrate into the entire diagnostic workflow. This is where people like Dr. Gelareh Sadigh come in. She is the Associate Editor at the Journal of the American College of Radiology, and she has been leading the charge on AI workflow optimization. Her research shows that if the AI is not part of the primary screen the doctor is already looking at, it might as well not exist.

Which leads us to this technical fork in the road Daniel mentioned in his prompt. We have the text-based models like Med-Gemini or the latest versions of GPT-five point four Mini, and then we have the vision-native models like Pillar-zero. Herman, I know you have been reading the papers on Pillar-zero. I bet you have the architecture diagrams printed out and framed on your wall.

I will neither confirm nor deny the framing, but Pillar-zero is fascinating because it represents a major architectural shift. It was released in late twenty-twenty-five and just got a major update this month. It is an open-source three-D vision-language model out of UC Berkeley and UCSF, with Adam Yala as the senior author. Most previous models would look at a three-D scan, like a CT or an MRI, as a series of two-D slices. They were essentially looking at a flipbook and trying to remember what they saw three pages ago. Pillar-zero interprets the full three-D volume directly.

Wait, why is that such a big deal? If I look at the slices one by one, I eventually see the whole thing, right? I mean, that is how human radiologists were trained for decades.

Not necessarily with the same spatial context. When you interpret the volume as a whole, the model understands the connectivity of structures in three-dimensional space. It is not just looking at a circle on a slice; it is looking at a tube winding through the body. Pillar-zero achieved an area under the curve, or AUC, of zero point eight seven across more than three hundred fifty different findings. That is significantly better than Google's MedGemma or Microsoft's MI-two. It is the difference between reading a description of a room and actually standing in the middle of it.

But even these vision-native models have a major problem with what they call grounding. I read about this study from the University at Buffalo from just a few days ago, March twenty-fourth. They were talking about hallucinations, but not the kind where the AI thinks there is a dragon in the room. It is more subtle and, frankly, more dangerous. They call it Category-Wise Contrastive Decoding, or CWCD.

I am glad you brought that up. Hallucinations in medical AI are terrifying because they often look like logical deductions. For example, a model might see an enlarged heart, which is cardiomegaly, and its training data says that cardiomegaly often appears alongside pulmonary edema, which is fluid in the lungs. So, the model predicts edema even if it cannot actually see any fluid in the pixels. It is essentially stereotyping the patient based on one finding. It is "drifting" away from the image and relying on its internal statistical map of how diseases usually hang out together.

It stops looking at the pixels and starts guessing based on statistics. That is where CWCD comes in, right? It forces the model to justify its prediction based on specific visual cues rather than just probabilistic association. It is like a teacher telling a student, "Do not just give me the answer; show me exactly where in the text you found it."

It uses contrastive layers to penalize the model when it makes a prediction that is not grounded in the actual visual evidence of that specific scan. This is critical for models like Microsoft's MAIRA-two. That model is designed to generate grounded reports where it links specific sentences in the text to spatial coordinates on the image. If the report says there is a mass in the upper left lobe, the model has to be able to point to the exact pixels that represent that mass. If it cannot point to them, it should not be allowed to say it.

It is like a lawyer who has to provide an exhibit for every claim they make in court. No evidence, no claim. But let's go back to the deepfake thing I mentioned at the start. This is the news from the last forty-eight hours. Dr. Mickael Tordjman published this study in the journal Radiology showing that multimodal LLMs and even experienced radiologists are struggling to tell the difference between real X-rays and synthetic ones. Herman, how does this happen? Is someone just sitting there with Photoshop?

It is much more sophisticated than Photoshop. These are Generative Adversarial Networks or Diffusion Models trained specifically on medical imaging. They can generate a perfect-looking X-ray of a tumor that looks indistinguishable from a real one. It is a nightmare for the integrity of medical records. If a bad actor can generate a fake scan to commit insurance fraud, or if a clinical trial is poisoned with synthetic data to make a drug look more effective than it is, the whole system collapses. This is why we are hearing these urgent calls for invisible watermarking and cryptographic signatures at the hardware level.

So, when an X-ray machine takes a picture, it needs to sign that file with a unique key so we know it came from a physical sensor and not a GPU in someone's basement. It is basically a blockchain for your ribs.

In a way, yes. We need a chain of custody for pixels. But let's look at the other side of the multimodal coin, which is less about faking and more about seeing things humans simply cannot. Have you looked at Microsoft's GigaTIME? They launched it on March fifteenth in partnership with Providence.

That is the one that turns cheap pathology slides into high-resolution spatial proteomics, right? That sounds like something out of a science fiction movie where they enhance the image and suddenly see the DNA of the killer.

It is not quite that, but it is close. They are taking standard, inexpensive tissue slides—the kind that have been used for a hundred years—and using the model to identify protein-survival associations. They identified over one thousand two hundred thirty-four of these associations across twenty-four different types of cancer. It is transforming a static image into a map of biological activity. That is information a human pathologist could never extract just by looking through a microscope. It is finding the "invisible" signals that correlate with how a patient will actually respond to treatment.

So, we have these vision-native powerhouses like Pillar-zero and GigaTIME, but then we have the general-purpose side. Daniel's prompt asks if the future is specialized or just general models with guardrails. If I have GPT-five or the latest Gemini, and it is trained on basically all of human knowledge, why do I need a specialized model for prostate MRIs? Why can't I just give GPT-five a very long set of instructions?

Well, that is the big debate of twenty-twenty-six. The consensus is actually moving toward specialized modular pipelines. There are a few reasons for that. First, there is the cost. Running a massive general-purpose LLM for every single diagnostic check is incredibly expensive in terms of compute. But more importantly, there is the issue of clinical guardrails. General models are "jack of all trades, master of none" when it comes to the extreme precision required in a basement radiology lab.

You mean the rules that keep the AI from suggesting a lobotomy for a headache?

Among other things. Specialized models allow you to encode domain-specific constraints directly into the architecture. Think about Siemens' AI Rad Companion for Prostate MRI. They are testing this at the Cleveland Clinic right now with Dr. Andrei Purysko. That model isn't trying to write poetry or summarize news. It is specifically tuned to the PI-RADS standards, which are the professional guidelines for prostate imaging. It knows the exact dosage limits, the anatomical landmarks, and the specific reporting requirements. It is a tool, not a chatbot.

It is the difference between a general practitioner and a surgeon who has performed the same specific procedure ten thousand times. The surgeon might not know much about your seasonal allergies, but you want them for that specific surgery. And you mentioned those governance frameworks—NIST and ISO. How do they fit into this?

The NIST AI Risk Management Framework and ISO twenty-three eight hundred ninety-four have become the industry standards for how you implement those guardrails. They provide a structured way to measure and manage the risks of these models in a clinical setting. General-purpose models are great for administrative tasks, like summarizing a patient's history or drafting a discharge note—things where a small error is annoying but not fatal. But when it comes to the actual diagnosis, doctors want something that was built for that specific purpose and validated against those specific standards.

It feels like we are seeing a "Death of the Generalist" moment in the high-stakes sectors, which I think we actually talked about in an earlier episode, number eight hundred sixty-nine. The "Data Wall" we hit in early twenty-twenty-six forced everyone to stop focusing on just making the models bigger and start focusing on the "shape" of the data. If you can't get more data, you have to get better data.

That is exactly what is happening. We are moving from quantity to quality and specificity. And that brings us to the NVIDIA GTC conference that just wrapped up a few days ago, on March nineteenth. They announced two major things: Dynamo one point zero and NemoClaw. These are tools specifically designed for agentic reasoning in real-time.

Agentic reasoning. That is the new buzzword, isn't it? It sounds like the AI is going to start filing its own taxes and booking its own vacations.

In a medical context, it means the model can perform a multi-step workflow. For instance, it sees an abnormality on a scan, realizes it needs to compare it to a scan from three years ago, fetches that scan from the archive, notes the change in size, and then looks up the patient's latest lab results to see if their white blood cell count is elevated. It is acting as an active participant in the diagnostic process rather than just a passive filter that says "I see a spot."

And they mentioned something called MONAI updates for real-time active learning. This part actually sounds useful for the doctors who are worried about AI taking their jobs. It is more like the AI is an apprentice that learns from them.

Precisely. In these updated MONAI workflows, the model performs an organ segmentation, which is basically drawing an outline around the heart or the liver to measure its volume. If the doctor sees that the outline is slightly off—maybe it missed the edge of the diaphragm—they correct it manually. The system then uses that correction to retrain the model in real-time. It is a continuous feedback loop. The model gets smarter with every single patient, specifically tailored to the nuances of how that specific hospital or that specific doctor works.

That seems like it would help with the "AI Chasm" problem we talked about earlier. If the model is constantly being validated and corrected by a human expert, you have a much better chance of it actually improving outcomes. But we still have that ninety-five percent statistic hanging over our heads. We have all this tech, we have NVIDIA's latest chips, we have three-D vision models, but we still aren't sure if we are actually saving lives. Why is it so hard to prove?

Because proving a patient outcome takes time and a lot of messy variables. You have to follow a patient for months or years to see if the AI's early detection of a nodule actually led to a better survival rate compared to a human-only detection. Maybe the AI found it early, but the treatment didn't work. Or maybe the human would have found it two weeks later anyway. The industry is finally starting to realize that FDA clearance is just the beginning of the journey, not the finish line. We need longitudinal studies, not just accuracy percentages.

So, if you are a developer or a clinician listening to this, what is the actual takeaway? Because it feels like a very confusing time to be in medical AI. You have deepfakes on one side, brilliant three-D models on the other, and a whole lot of regulatory paperwork in the middle.

The first takeaway is that workflow-native is the only way forward. If your AI tool requires a doctor to open a separate window, log into a different system, or even click an extra button, it is probably going to fail. It has to live inside the PACS. It has to be part of the air they breathe.

And don't ignore the security side. If you aren't thinking about cryptographic signatures and watermarking for your imaging data, you are leaving the door wide open for some very high-tech fraud or even medical malpractice. We need to be able to prove that the pixels are real and that they haven't been tampered with between the machine and the doctor's screen.

Also, focus on the shape of the data. As we move further into twenty-twenty-six, the models that are winning are the ones that use specialized, high-quality datasets like the ones used for Pillar-zero. General-purpose models are becoming the utility players for office work—writing emails, summarizing meetings—but the diagnostic heavy lifting is being done by specialized, modular pipelines that respect clinical constraints.

I think we also need to keep an eye on that "agentic" shift. If you missed our episode fifteen hundred on the era of agentic AI, you should go back and listen to that. It explains the transition from chatbots to these multi-surface operating layers. In medicine, that means the AI isn't just a box you talk to; it is a background process that is constantly connecting the dots between your imaging, your labs, and your history. It is the difference between a search engine and a research assistant.

It is a lot to take in, but it is an incredibly exciting time. We are finally moving past the hype of "AI is going to replace radiologists" and into the reality of "AI is going to give radiologists superpowers they never imagined." We are moving from the era of detection to the era of understanding.

As long as those superpowers include the ability to spot a deepfake rib fracture, I am all for it. But seriously, the Pillar-zero stuff is the most promising thing I have seen in a while. Being able to interpret the whole volume directly feels like the way it should have always been done. It is more natural.

It is the natural evolution. We had to start with slices because that is how our hardware and our brains worked. We could only process so much information at once. But now that we have the compute and the architectures to handle three-D volumes, going back to two-D feels like trying to understand a statue by looking at a thousand polaroids of it. You lose the essence of the form.

Well, I for one am glad I am a sloth and not a radiologist. My diagnostic workflow mostly involves deciding which branch is the sturdiest for a nap. But for the humans out there, this transition into vision-native, workflow-integrated AI is going to be the defining story of the next few years. It is about closing that AI Chasm and making sure the tech actually helps the person in the hospital bed.

It certainly will. We have to bridge that gap. We have the tools; now we just need the evidence that they work for the people who need them most. And we need the governance to make sure they are used safely.

That feels like a good place to wrap this one up. Thanks to Daniel for the prompt. It really forced us to look at the current state of the art, which is moving faster than my brain can usually handle on a Tuesday. Or a Wednesday. Or any day, really.

It was a great deep dive. Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes and making sure our audio doesn't have any deepfake artifacts.

And a big thanks to Modal for providing the GPU credits that power this show and let us run these models to see what they are actually doing. Without them, we would just be two guys talking about papers we haven't actually tested.

This has been My Weird Prompts. If you are enjoying the show, a quick review on your podcast app really helps us reach more people who are interested in this kind of deep-tech exploration. It helps us climb the charts and find more curious minds.

You can find us at myweirdprompts dot com for the full archive and all the ways to subscribe. We will be back soon with another prompt from Daniel, hopefully one that involves less terrifying deepfakes and more cool robots.

See you then.

Stay real. Or at least, stay more real than a deepfake X-ray. Goodbye.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.