#2919: How CPR Guidelines Actually Get Updated

The surprising data loop that turns a single study into what millions learn to do with their hands.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-3089
Published: May 19
Duration: 28:13
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: emergency-preparedness medical-history public-health

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The process for updating first aid and CPR protocols is far more rigorous—and more fascinating—than most people realize. At the center of it all is ILCOR, the International Liaison Committee on Resuscitation, formed in 1992 with seven member organizations spanning North America, Europe, Asia, Africa, and Australia. ILCOR produces CoSTR documents—Consensus on Science with Treatment Recommendations—through systematic evidence reviews conducted every three to five years. Each regional body, like the American Heart Association or the European Resuscitation Council, then adapts those consensus findings into localized guidelines.

The methodology driving this system is called GRADE—Grading of Recommendations, Assessment, Development, and Evaluations. It replaced an older, more opinion-driven classification system in 2010. GRADE forces reviewers to assess studies across five dimensions: risk of bias, inconsistency, indirectness, imprecision, and publication bias. Each study receives a certainty rating—high, moderate, low, or very low—which then determines whether the resulting recommendation is strong ("we recommend") or weak ("we suggest"). A famous example: the 2015 change in compression depth guidelines, from 1.5–2 inches to 2–2.4 inches, came from a meta-analysis of over 7,000 cardiac arrest cases that scored high across all five GRADE dimensions and showed a 22% improvement in survival to discharge.

The system isn't static. The AHA is currently in its 2025–2027 review cycle, integrating COVID-era field data and piloting "living guidelines"—continuous evidence surveillance that can update recommendations within weeks when practice-changing trials emerge. Data now flows not just from paramedics but from layperson apps like PulsePoint and GoodSAM, which have contributed hundreds of thousands of real-world events to registries like CARES. This field feedback loop recently prompted a simplification of the recovery position protocol from seven steps to four, after a study found that 34% of untrained bystanders performed the standard instructions incorrectly. The system's honesty about uncertainty—sometimes new evidence weakens rather than strengthens a recommendation—is a feature, not a bug.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2919: How CPR Guidelines Actually Get Updated

Daniel sent us this one — he's asking how first aid protocols actually get written, updated, and revised as new data comes in. The CPR guidelines you learned five years ago might already be wrong, and that's by design. He wants to know what the data loop looks like from start to finish — how a single study in a journal somewhere eventually changes what a bystander does with their hands on someone's chest. There's a lot to unpack here.

The timing on this question is actually perfect, because right now — May twenty twenty-six — the American Heart Association is smack in the middle of its twenty twenty-five to twenty twenty-seven evidence review cycle. They're integrating data from the COVID-era out-of-hospital cardiac arrest surge, which produced an enormous amount of field data that's only now being fully analyzed. So the loop is actively spinning as we speak.

Which raises the question most people never ask — who actually decides what goes on that poster in your office break room?

That's exactly where we should start. The short answer is an organization called ILCOR — the International Liaison Committee on Resuscitation. Formed in nineteen ninety-two, it now includes seven member organizations covering North America, Europe, Asia, Africa, and Australia. The American Heart Association is one member. The European Resuscitation Council is another. And ILCOR's whole purpose is to produce something called CoSTR documents — Consensus on Science with Treatment Recommendations. They conduct systematic evidence reviews every three to five years, and those reviews become the foundation that each member organization adapts into its own guidelines.

ILCOR is the engine room, and then each regional body does its own translation for local context.

And calling it an engine room is more literal than you might think, because the methodology they use — it's called GRADE, which stands for Grading of Recommendations, Assessment, Development, and Evaluations — this is a rigorous, almost industrial process for turning raw studies into actionable recommendations. GRADE was first applied to resuscitation science in twenty ten, replacing an older system that used Class One, Two-A, Two-B, and Three designations. The old system was more opinion-driven. GRADE forces you to quantify uncertainty.

Walk me through it. A study drops showing that compression-only CPR improves survival by twelve percent. What happens next?

First, that study doesn't just get read and nodded at. It gets fed into a systematic review. ILCOR has topic-specific task forces — there's one for basic life support, one for advanced life support, one for first aid, one for pediatrics, and so on. Each task force includes methodologists, clinicians, and increasingly, patient representatives. They take that study and they assess it across five dimensions: risk of bias, inconsistency, indirectness, imprecision, and publication bias.

Break those down for me.

Risk of bias is straightforward — was the study well-designed? Was randomization proper? Were the groups comparable at baseline? Inconsistency asks whether different studies point in different directions. If three trials say compression-only CPR helps and two say it doesn't, that's inconsistency, and it lowers your certainty. Indirectness is about whether the study population matches the population you're making a recommendation for — if the study was done on healthy twenty-five-year-olds but your guideline is for seventy-year-olds with comorbidities, that's indirect. Imprecision is about sample size and confidence intervals — if the twelve percent improvement has a confidence interval ranging from negative three to plus twenty-seven, you can't be confident it's real. And publication bias asks whether negative studies might have been buried in file drawers somewhere.

The file drawer problem. Of course there are.

It's a real issue in resuscitation science, actually, because negative trials are harder to publish. Nobody wants to read "new CPR technique doesn't work." But those null results matter enormously for evidence synthesis. Anyway, after scoring all five dimensions, the task force assigns a certainty rating — high, moderate, low, or very low. That certainty rating then drives the strength of the recommendation. High-certainty evidence can support a strong recommendation. Low-certainty evidence typically produces a weak recommendation, even if the effect size looks promising.

A weak recommendation means what in practice? "We think this might help, but don't bet the farm"?

The formal language is "we suggest" versus "we recommend." A strong recommendation means the benefits clearly outweigh the harms for almost all patients in almost all settings. A weak recommendation means there's a closer balance, and the right choice might depend on context, patient preference, or resource availability. And this is where things get interesting, because some of the most famous guideline changes in recent years came from weak recommendations that later strengthened as evidence accumulated.

Give me a case study.

The twenty fifteen AHA guideline change on compression depth is the textbook example. Before twenty fifteen, the recommendation was one and a half to two inches of compression depth. That was based on older, lower-quality evidence. Then a massive meta-analysis came out — over seven thousand out-of-hospital cardiac arrest cases — showing that deeper compressions, in the two to two-point-four inch range, improved survival to hospital discharge by twenty-two percent. That -analysis scored high on the GRADE assessment across all five dimensions. It was consistent across multiple studies, the populations matched, the effect size was precise, and there was no evidence of publication bias. So in twenty fifteen, the guideline changed from one-point-five-to-two inches to two-to-two-point-four inches. That's a concrete, measurable change in what millions of people are taught to do with their hands, driven directly by data.

Twenty-two percent improvement in survival to discharge. That's not marginal.

It's enormous. And it came from looking at the totality of evidence, not just one trial. That's the key thing about GRADE — it forces you to synthesize across studies rather than cherry-picking the most exciting result. The flip side is what happens when two high-quality trials contradict each other. That actually happened with naloxone administration protocols for opioid overdose. In the twenty fifteen guidelines, the AHA gave a strong recommendation for layperson naloxone use based on observational data. But when the twenty twenty review cycle came around, two new randomized trials had produced conflicting results — one showed clear benefit, the other showed no statistically significant difference. The task force downgraded the certainty of evidence from high to moderate, and the recommendation shifted from strong to weak. Not because naloxone stopped working, but because the evidence base got noisier.

That's counterintuitive to most people. New data made the recommendation weaker, not stronger.

That's actually a sign of a healthy evidence system. A weak recommendation isn't a failure — it's an honest representation of what we actually know. The problem is that weak recommendations are harder to communicate. A first aid instructor wants to say "do this." They don't want to say "well, the evidence is mixed, but on balance we suggest you consider this approach." That's where the implementation science gap starts to bite.

Let's hold that thought, because I want to get into the feedback loop from the field. But first — the actual mechanics of an ILCOR review. How many people are we talking about? How long does it take?

A full ILCOR review cycle typically involves hundreds of reviewers across the seven member organizations. The process starts with topic prioritization — the task forces identify questions that need updating based on new published evidence or identified gaps. Then they commission systematic reviews, which can take six to eighteen months each. The draft CoSTR documents go through internal review, then a public comment period where anyone can submit feedback. After revisions, the final CoSTR is published simultaneously in multiple journals — usually Circulation for AHA, Resuscitation for ERC — and the member organizations then spend another six to twelve months adapting the consensus science into their own localized guidelines.

From a study being published to a guideline changing, we're talking two to four years minimum.

And that's the tension at the heart of this whole system. The rigor takes time, but patients are having cardiac arrests right now. Which is why the concept of living guidelines has become such a big deal.

Define living guidelines.

Instead of updating every five years, you maintain a continuous evidence surveillance system. When new evidence meets a pre-specified threshold — say, a large randomized trial with practice-changing implications — the guideline gets updated within weeks or months, not years. The AHA launched a pilot program for this in twenty twenty-four, focused initially on COVID-nineteen and cardiac arrest, where the evidence was moving so fast that a five-year cycle was laughable. Recommendations were being updated quarterly. The twenty twenty-five to twenty twenty-seven cycle I mentioned earlier is actually a hybrid — some topics are on the traditional track, others are on the living guideline track.

This is where the field data feedback loop comes in, right? Because living guidelines need real-time data to work.

And this is the part of the data loop most people never see. Let me trace it for you. An out-of-hospital cardiac arrest happens. They record everything — compression depth, rate, hands-off time, defibrillation timing, drugs administered, outcomes. That data flows into a registry. In the US, the big ones are the Cardiac Arrest Registry to Enhance Survival — CARES — and the Resuscitation Outcomes Consortium. But here's where it gets interesting: data is also coming from layperson apps now. PulsePoint, which alerts trained bystanders to nearby cardiac arrests, has contributed data on over two hundred thousand events to the AHA's registry since twenty eighteen. GoodSAM, a similar platform used in the UK and Australia, feeds into the same ecosystem. So you're getting data not just from paramedics, but from ordinary people who happened to be nearby with a phone.

Which means the evidence base now includes data from non-clinical settings — offices, schools, homes, grocery stores.

That matters enormously, because a protocol that works when performed by a paramedic in an ambulance might fail completely when performed by a panicked office worker in a break room. The twenty twenty-four recovery position study is the perfect example. Researchers ran twelve hundred layperson simulations — just regular people, no medical training — and found that thirty-four percent of them performed the recovery position incorrectly when given the standard seven-step instructions. The most common errors were failing to tilt the head back adequately to maintain an open airway, and rolling the person too far forward or backward. Thirty-four percent failure rate on a technique that's supposed to prevent airway obstruction in an unconscious person.

One in three people were doing it wrong.

Worse — one in three were doing it wrong in a way that could compromise the airway, which is exactly what the recovery position is supposed to protect. The AHA responded in the twenty twenty-five revision by simplifying the protocol from seven steps to four, with a new emphasis on what they call the log roll technique — keeping the head, neck, and torso aligned as a single unit during the roll, rather than manipulating each body part separately. They also changed the instructional graphics to show the hand placement more clearly, because the simulations revealed that people were consistently misinterpreting the old diagrams.

That's a feedback loop in action. Field data revealed a failure mode, and the protocol changed. But here's what I'm wondering — who's actually watching for these failure modes? Is there someone whose job it is to notice that thirty-four percent of people are getting the recovery position wrong?

This is where the implementation science gap comes in, and it's a real problem. The evidence review process is excellent at evaluating clinical efficacy — does this intervention work under ideal conditions? But it's much weaker at evaluating implementation effectiveness — does this intervention work when deployed at scale to a diverse population with varying levels of training, motivation, and resources? There's a whole academic discipline called implementation science that studies exactly this, but it's only been systematically integrated into guideline development in the last five to seven years.

You can have a Grade-A, high-certainty recommendation that completely fails in practice because nobody considered the cognitive load of performing it under stress.

The twenty twenty-three ERC guideline on tourniquet use for hemorrhage control is a case study in exactly this problem. The clinical evidence for tourniquets is strong — they save lives in severe extremity bleeding. High-certainty evidence, strong recommendation. But when the ERC looked at real-world implementation data, they found multiple failure points. First, many commercially available tourniquets are counterfeit or poorly manufactured, and laypeople can't tell the difference. Second, the training requirements are significant — you need hands-on practice to apply a tourniquet correctly under stress, and most people never get that practice. Third, there are cultural barriers in some regions where tourniquet use is associated with military combat, making civilians reluctant to apply one. So the twenty twenty-three guideline had to add a whole section on equipment verification, training frequency, and public education campaigns — none of which came from the clinical efficacy data. It all came from the implementation feedback loop.

That's the messy part. The clinical evidence says "tourniquets work." The real world says "yes, but.

The "yes, but" is where information gets lost or distorted. Let me map out the failure modes in the data loop. The first is selection bias in what gets studied at all. Most resuscitation research is conducted in high-income countries with advanced EMS systems. The results may not generalize to low-resource settings where response times are longer, equipment is scarce, and bystander training is minimal. ILCOR has been trying to address this by including member organizations from Asia and Africa, but the evidence base is still heavily skewed toward North America and Europe.

What's the second failure mode?

Reporting bias in the field data. EMS agencies that are well-resourced and well-run are more likely to submit data to registries like CARES. The agencies that are struggling — the ones where protocols might be failing most dramatically — often have the least capacity for data reporting. So the registry data looks better than reality, because the worst-performing sites are underrepresented.

Survivorship bias, basically. The data that survives to be analyzed comes from the systems that are already functioning well enough to collect it.

And the third failure mode is the time lag we already discussed. Even with living guidelines, there's a gap between when a problem is identified in the field and when the guideline changes. During that gap, people are being trained on outdated protocols. And once a protocol is embedded in training materials, certification requirements, and legal standards of care, it's incredibly sticky. Changing the guideline is step one. Getting every first aid instructor, every workplace safety officer, every CPR card holder to update their knowledge — that's a whole separate challenge.

This is where Herman the pediatrician has opinions, I suspect.

The distribution of updated first aid guidelines to parents is severely broken. When I was practicing, the standard was that new parents got a five-minute briefing before discharge from the hospital, maybe a pamphlet, and that was it. No systematic update mechanism. The choking protocol changed in twenty twenty-four — the AHA unified what had been separate age-based guidelines into a single five-and-five approach. Back blows and abdominal thrusts for everyone over one year old. That change was evidence-based and important. But how many parents who were trained before twenty twenty-four know about it? There's no push notification for "the way you were taught to save your choking child has been updated.

Which means the data loop is technically closed — the evidence was reviewed, the guideline was changed — but the last mile, the actual human being whose behavior needs to change, is a black hole.

This is where the apps might actually help. PulsePoint and GoodSAM don't just collect data — they also push updates to registered users. If you're a PulsePoint user and the CPR compression depth guideline changes, the app can update its in-app instructions immediately. That's a direct digital pipeline from guideline to end user that bypasses the traditional multi-year training cycle entirely.

That only works for people who have the app.

Which is a tiny fraction of the population. The app users are self-selected — they're already motivated, already trained, already engaged. They're not the people who most need the update. The person who took a CPR class ten years ago and hasn't thought about it since — that person is invisible to the data loop in both directions. They're not generating field data because they're not reporting their experiences, and they're not receiving updates because there's no channel to reach them.

We've got this incredibly sophisticated, multi-organizational, international evidence review apparatus — GRADE methodology, systematic reviews, public comment periods, living guideline pilots — and the whole thing dead-ends at a pamphlet that someone might or might not have read in twenty nineteen.

I want to push back slightly on that framing, because it's not quite that bleak. The guidelines do propagate through institutional channels — hospitals, EMS agencies, workplace safety programs, professional certifications. If you're a paramedic, your protocols are updated regularly because your medical director is accountable for keeping them current. If you're a lifeguard, your employer updates your training every season. The system works reasonably well for professionals. It's the layperson gap that's so frustrating.

Which brings us to what people can actually do with this information. But before we get to takeaways, I want to hit one more thing — the twenty twenty-seven ILCOR meeting in Amsterdam. What's on the agenda?

The big item is expected to be a formal framework for incorporating machine-learning-derived evidence from wearable devices into guideline development. Smartwatches now generate continuous ECG data from millions of people. When someone has a cardiac event while wearing an Apple Watch or a Fitbit, that device captures data that no clinical study could ever replicate — the actual physiological transition from normal rhythm to arrest, in real time, in the wild. The question for Amsterdam is how to evaluate that data within the GRADE framework. It's not a randomized controlled trial. It's not even a prospective observational study. It's passively collected, algorithmically processed, and subject to all kinds of selection biases — people who wear smartwatches are wealthier, healthier, and more tech-savvy than the general population.

You've got a data source that's incredibly rich but doesn't fit any of the existing evidence categories.

That's exactly the tension they'll be wrestling with. The working group has been exploring whether a new evidence category is needed — something like "real-world data with algorithmic adjudication" — that would have its own set of quality criteria distinct from traditional study designs. If they propose that framework in twenty twenty-seven, it could be the biggest change to evidence evaluation methodology since GRADE was adopted in twenty ten.

Which raises the question that I think sits underneath this whole conversation. If AI and wearables are generating real-time physiological data during emergencies, does the five-year review cycle become obsolete entirely?

I think it becomes obsolete for certain types of recommendations. For things where the evidence base is stable — compression depth, defibrillation timing — a five-year cycle is probably fine. But for things where the evidence is evolving rapidly — drug therapies during arrest, post-resuscitation care, anything involving new technology — continuous surveillance is the only model that makes sense. The challenge is that continuous surveillance requires continuous funding, continuous staffing, and continuous methodological rigor. It's much easier to convene a task force every five years than to maintain a standing committee that reviews new evidence every month.

The bottleneck isn't methodological, it's institutional. The will and the money.

As it so often is. ILCOR's annual budget is not public, but these are largely volunteer organizations. The task force members are mostly academics who do this on top of their regular jobs. The systematic review infrastructure is cobbled together from grants and institutional support. The whole system runs on goodwill and professional obligation, which is inspiring and terrifying in equal measure.

Like adopting a feral cat.

I'm not sure that analogy holds, but I take your point.

What does this mean for someone listening who's not a researcher or a guideline developer? If you're a first aid instructor, a workplace safety officer, or just someone who wants to actually know what to do in an emergency, what do you do with all this?

Three concrete things. First, if you're responsible for training others — and this includes workplace safety officers, scout leaders, coaches, anyone who maintains first aid certification for a group — you should be checking the AHA or ERC website quarterly, not every five years when your certification expires. The living guideline updates are posted as they happen, and some of them are practice-changing. You can set up email alerts for the Circulation journal's guideline updates, or follow the AHA's CPR guidelines Twitter account, which posts updates within days of publication.

That's a specific, actionable rhythm.

Second, the next time you see a first aid poster in a workplace, a school, a gym — look for two things: the publication date and the evidence grade. A proper guideline poster will say something like "Class One, Level B-R" or it'll reference the specific CoSTR document it's based on. If neither of those is present, the poster is probably outdated or was produced without evidence review. I've seen posters from the nineteen nineties still hanging in church basements.

Class One, Level B-R — what does that actually decode to?

Under the older system that's still used alongside GRADE in some AHA materials, Class One means the benefit greatly outweighs the risk — it's a strong recommendation. Level B-R means the evidence comes from one or more randomized controlled trials, but they're of moderate quality. It's a useful shorthand, and if you learn to read it, you can tell at a glance how much confidence to place in the recommendation.

The third thing?

You can contribute data yourself. If you witness or respond to a medical emergency and you're willing to share what happened, the AHA's Get With The Guidelines registry and the Resuscitation Outcomes Consortium both accept case reports from laypeople. There's also an app called the First Aid Report that lets you document what you did and what the outcome was, and that data gets anonymized and fed into the evidence base. It's not just for professionals. The thirty-four percent failure rate on the recovery position came from layperson simulations — ordinary people agreeing to be observed. Your experience, especially if something went wrong or you had to improvise, is genuinely valuable data.

The loop isn't closed until people feed their experiences back in.

That's the whole point. The data loop only works if data flows in both directions. Guidelines go out, outcomes come back. If the outcomes don't come back — if nobody reports that the recovery position is failing in the field — the guideline doesn't change, and more people get hurt. The system is only as good as the feedback it receives.

Most people don't know they're part of the system at all.

They don't. But they are. Every time someone performs CPR, every time someone puts a person in the recovery position, every time someone applies a tourniquet — they're generating data. The question is whether that data gets captured or evaporates.

Now — Hilbert's daily fun fact.

Hilbert, what do you have for us today?

Hilbert: In nineteen seventy-three, radio astronomers using the Algonquin Radio Observatory detected an unexplained narrowband signal at one-point-four-two gigahertz originating from the direction of the Comoros archipelago. The signal lasted seventeen seconds, never repeated, and remains catalogued as source SHGb zero-two-plus-fourteen-a in the set of candidate interstellar transmissions — though later analysis suggested it may have been a terrestrial reflection off a geostationary satellite passing through the beam.

A seventeen-second mystery from the Comoros. I'll take it.

Here's the thought I'll leave listeners with. We've traced this entire data loop — from a single study in a journal, through the GRADE machinery, through ILCOR's systematic reviews, through the adaptation into regional guidelines, through the implementation gap, through the field data feedback from apps and registries, and back into the evidence base. It's an impressive system. It's also slow, patchy, underfunded, and fails to reach the people who most need its output. The twenty twenty-seven Amsterdam meeting might accelerate parts of it. But the last mile problem — getting updated protocols into the hands and heads of ordinary people — that's going to take more than a new evidence framework. It's going to take a cultural shift in how we think about first aid training. Not as something you learn once and certify once, but as something you stay connected to.

That's the open question I think we're left with. In a world where your phone can alert you to a cardiac arrest happening fifty meters away, why can't it also tell you that the protocol you learned has changed? The technology exists. The institutional will is what's lagging.

This has been My Weird Prompts. Thanks to our producer, Hilbert Flumingtop, for keeping the ship pointed vaguely forward. If you enjoyed this episode, we'd love a review wherever you get your podcasts — it helps other people find the show. Find more at myweirdprompts dot com. I'm Corn.

I'm Herman Poppleberry. Check your first aid posters.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2919: How CPR Guidelines Actually Get Updated

Downloads

You Might Also Like

#2919: How CPR Guidelines Actually Get Updated