#794: AI’s Secret Language: The Return of the Modem Screech

Why are AI agents talking in modem screeches? Explore the high-speed world of acoustic data and the future of "silent" secret messaging.

0:000:00

Episode Details

Published: Feb 23
Duration: 28:21
Audio: Direct link
Pipeline: V4
TTS Engine
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

In the rapidly evolving landscape of artificial intelligence, a peculiar phenomenon has emerged: AI agents are beginning to ditch human language in favor of high-speed, acoustic "handshakes." While these sounds might remind listeners of the screeching and grinding of 1990s dial-up modems, they represent a sophisticated leap in how machines exchange information in the physical world.

The Inefficiency of Human Language

Human language is essentially a biological legacy system. It is limited by the physical constraints of vocal cords and ears, operating at a relatively low bitrate. For AI agents capable of processing massive amounts of data in milliseconds, waiting for a "text-to-speech" or "speech-to-text" conversion is an agonizingly slow bottleneck. By communicating directly through raw audio tokens, agents can bypass the "word" phase entirely. This allows them to transmit complex instructions or data structures in a fraction of a second—what sounds like a simple chirp to a human ear can contain the equivalent of several paragraphs of text.

Data Over Sound: How it Works

This method of communication, often called "data over sound" or acoustic communication, utilizes the physical environment as a medium. While machines on the same server can exchange data instantly via internal networks, robots or devices in the same room may not share a local network connection. In these cases, speakers and microphones become a universal, zero-configuration interface.

To ensure these messages aren't lost in noisy environments—like a crowded cafe or a moving vehicle—modern protocols use Orthogonal Frequency Division Multiplexing (OFDM). This technology, the same used in Wi-Fi and LTE, splits data across multiple frequencies simultaneously. This redundancy ensures that even if a specific sound, like a car engine or an air conditioner, blocks one frequency, the data still reaches its destination intact.

Stealth and Ultrasonic Communication

Beyond machine-to-machine efficiency, there are compelling human applications for this technology, particularly regarding privacy. By shifting these acoustic transmissions into ultrasonic frequencies—typically above 20 kilohertz—communication becomes "silent" to the human ear while remaining perfectly audible to smartphone microphones.

This creates the possibility for encrypted, peer-to-peer messaging in public spaces without relying on cellular networks or centralized servers. Because the communication is localized and leaves no digital trail on a service provider’s server, it offers a level of metadata privacy that traditional messaging apps cannot match.

The Role of AI in Signal Processing

The true innovation comes from the integration of AI into these acoustic protocols. Unlike traditional modems that use fixed rules, an AI agent can actively "listen" to the environment and adapt its transmission in real-time. If an environment is particularly noisy, the AI can shift its data to a different frequency register or adjust the bitrate to ensure clarity. This creates a highly resilient, off-grid mesh network that turns any device with a speaker and a microphone into a sophisticated communication hub. As these multimodal models continue to advance, the "screech" of the modem may become the most secure and efficient way for both machines and humans to share secrets in plain hearing.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #794: AI’s Secret Language: The Return of the Modem Screech

Daniel's Prompt

Hi Herman and Corin. Last year there was a demo of an experimental agent-to-agent communication protocol where two AI agents spoke to each other in a high-speed, indecipherable language similar to a modem handshake. I’d love to do another episode on the mechanics of these protocols and explore whether there are practical applications for this technology between humans. For example, could two people use an app to transmit messages as encrypted audio streams for their phones to decode, allowing for discreet communication? Does this technology actually exist?

Hey everyone, welcome back to My Weird Prompts. We are diving into something today that sounds a bit like a science fiction movie from the nineteen nineties, but it is actually happening right now in the world of artificial intelligence. I am Corn, and I am joined as always by my brother, who I am pretty sure speaks fluent modem.

Herman Poppleberry here, at your service. And you are not wrong, Corn. There is something deeply nostalgic about the sounds of data being transmitted over audio. It takes me right back to the days of waiting ten minutes for a single image to load on a beige desktop computer while the phone line screamed in the background. That screeching, that rhythmic grinding of bits and bytes, it was the sound of the future arriving one packet at a time.

It is funny you say that, because today's prompt from Daniel is all about that exact sensation. He is interested in those experimental agent to agent communication protocols. Specifically, there was a demo last year where two A I agents basically bypassed human language entirely and started talking to each other in this high speed, screeching, modem like handshake. It was eerie, Herman. It sounded like two R2-D2s having a very heated argument in a blender.

I remember that demo vividly. It was a fascinating moment because it highlighted the massive inefficiency of human language for machines. We use words because our vocal cords and ears are limited to a very specific, and frankly quite narrow, range of frequencies and speeds. We are biological legacy systems. But agents? They do not have those biological constraints. They do not need to wait for a mouth to form a "th" sound.

Right, and Daniel wants us to explore the mechanics of these protocols, but then take it a step further. He is asking if there are practical applications for this between humans. Imagine two people in a public place, like a taxi or a crowded cafe, using an app to transmit encrypted audio streams that their phones decode. Discreet communication through what sounds like random noise. It is like a digital version of a secret handshake, but it is a secret conversation.

It is a brilliant thought experiment. It is essentially hiding a message in plain hearing. We have talked about similar things in the past, but the landscape has changed so much with the rise of multimodal models in twenty twenty four and twenty twenty five. If you are looking for more on our history with these kinds of topics, you can always search our archive at myweirdprompts dot com. We have over seven hundred episodes of this kind of technical deep diving.

So, let us start with the mechanics. Why would two A I agents choose to speak in "modem screech" instead of just talking like we are right now? I mean, they are built to understand us, so why go through the trouble of inventing a new noise?

The short answer is bandwidth and latency. When we talk, we are essentially using a very low bitrate protocol. We have to turn a complex thought into a linear string of phonemes, wait for the other person to hear them, turn them back into tokens, and then interpret them. For an A I agent, that is like trying to download a movie over a telegraph line. It is agonizingly slow.

So, in that demo Daniel mentioned, they were basically skipping the "text" part of the conversation?

Exactly. Most modern voice A I used to work in three distinct steps. Speech to text, then the large language model processes the text, and then text to speech. But the newer multimodal models, like the ones we saw in late twenty twenty four, operate directly on audio. They see the audio as a series of features or tokens. When two of these agents connect, they realize they do not need to form words. They can just blast the raw tokens at each other using a high frequency audio handshake. It is like two people realizing they both speak telepathy and deciding to stop moving their mouths.

It is basically a shortcut. Instead of translating their thoughts into English and back again, they are just sharing the underlying data structure using sound as the medium. But why sound? If they are both on the same server, or even different servers, why not just send a digital packet over the internet?

That is the "weird" part of the demo. In that specific case, they were testing how agents interact in the physical world. If you have two robots in a room, or two phones on a table, they might not be on the same local network. They might not have a handshake protocol established. But they both have speakers and they both have microphones. Sound becomes the universal, zero configuration interface.

It is the one thing they both definitely have in common.

Precisely. And because they can process audio at much higher frequencies and speeds than we can, they can cram an entire paragraph of information into a half second burst of sound. To us, it sounds like a digital chirp or a screech. To them, it is a perfectly clear, incredibly dense packet of information. It is like a compressed zip file being shouted across the room.

That is the part that blows my mind. The idea that a half second "beep" could contain a detailed set of instructions or a complex response. It makes our conversation feel so slow by comparison. We have been talking for five minutes and we have probably shared less data than two agents could share in the blink of an eye.

It really is. It is the difference between sending a letter via carrier pigeon and sending a fiber optic signal. But what Daniel is asking about is bringing that "fiber optic" efficiency, or at least the "stealth" of it, back to the human level. He is looking for the human application of the machine screech.

Yeah, let us look at his taxi example. You and I are in a taxi in a foreign city. We want to discuss something private, maybe where we kept the spare key or a sensitive business detail, and we do not want the driver to understand us. If I have an app that turns my text into a burst of encrypted audio, and your phone "hears" it and displays the text, we have just had a private conversation in public. It sounds like a glitchy radio to the driver, but it is a clear text to us.

This actually touches on a field called data over sound, or acoustic communication. It is not entirely new, but the way A I handles it now makes it much more robust. In the past, this was done using something called Audio Frequency Shift Keying, or A F S K. It is the same tech used by old modems and even some ham radio digital modes. You basically assign different pitches to different bits of data. A high beep is a one, a low beep is a zero.

Does it actually work in a noisy environment like a taxi? I imagine the road noise, the wind, and the radio would mess with the signal. If the driver hits a pothole and the car rattles, does my message get garbled?

That is the big technical hurdle. If you are just using simple beeps, a lot can go wrong. But modern protocols use something called Orthogonal Frequency Division Multiplexing, or O F D M. It is the same tech used in Wi Fi and L T E, but adapted for sound waves. It splits the data across many different frequencies simultaneously. So if the taxi's engine hum blocks out one frequency, the data still gets through on the others. It is redundant and incredibly resilient.

So the technology does exist. I mean, I remember there were apps a few years ago that tried to do this. I think one was called Chirp?

Yes, Chirp was a big one. They were eventually bought by Sonos. There was also a company called Silver Push that used ultrasonic sounds to track users across devices, which was actually quite controversial from a privacy perspective. They would hide a "silent" chirp in a television commercial, and your phone's microphone would hear it and realize you were watching that specific ad. But the "discreet messaging" angle Daniel is talking about is slightly different because it is intentional, peer to peer, and encrypted.

Let us talk about that "discreet" part. If I am sitting in a taxi and my phone starts making loud modem noises, is that really discreet? It seems like it would draw more attention than just whispering. If I am the driver and my passengers start sounding like a dial up connection from nineteen ninety five, I am going to be very suspicious.

That is a great point. If it is audible, it is definitely not "hidden." It is just indecipherable. It is like speaking a language that only you and I know, but doing it at the top of our lungs. However, there is a way around that: ultrasonics.

Right, frequencies above twenty kilohertz that humans cannot hear, but microphone sensors can still pick up.

Exactly. Most smartphone microphones can actually pick up frequencies up to about twenty two or even twenty four kilohertz. Most adults over the age of twenty five cannot hear much above fifteen or sixteen kilohertz. Children and teenagers can hear a bit higher, but almost no one hears above twenty. So, you could have your phone "whisper" the data in a range that the taxi driver literally cannot hear, but your phone, sitting on the seat next to mine, picks it up perfectly.

Okay, that is much cooler. That is actual stealth communication. It is a silent conversation happening in the air between us. But here is my question: if we have this technology, and we have had it for a while, why aren't we using it? Why am I still typing into a chat app and hitting send over the cellular network?

Well, think about the trade offs. If you send a message over Signal or WhatsApp, it is already end to end encrypted. The only reason to use audio is if you do not have a data connection, or if you are trying to avoid leaving a digital trail on a server. Even encrypted apps leave metadata. The server knows that I sent a message to you at ten fifteen P M. But if I send that message via sound, there is no central server. There is no metadata. It is just a wave in the air that disappears as soon as it is heard.

Or if you are in a situation where the network is being monitored or shut down. We have seen that happen in various parts of the world during protests or civil unrest. If the internet is dark, but you are within earshot of someone, acoustic communication becomes a viable mesh network. You could pass a message down a line of people just by having their phones chirp to each other.

Now you are talking about a real world application. In a scenario where cellular towers are down or monitored, your phone becomes a high speed acoustic modem. You could pass messages through a crowd just by having phones "chirp" to each other. It creates a localized, off grid network that is incredibly hard to stop because it does not rely on any central infrastructure. You cannot turn off the air.

It is basically a digital version of talking, but with the speed and encryption of a computer. I am curious about the "agent to agent" part of Daniel's prompt, though. How does the A I make this better than just a standard modem protocol? Why do we need the "agent" part?

The A I brings adaptability. A standard modem protocol has a fixed set of rules. It says, "I will send data at exactly this pitch and this speed." If there is too much noise, it just fails. An A I agent, especially one trained on audio features, can "hear" the environment and realize, "Okay, there is a lot of low frequency noise from this air conditioner, so I am going to shift my data transmission to a higher register and slow down the bitrate to ensure clarity."

It is like how we naturally raise our voices or enunciate more in a loud bar. The A I can do that at a mathematical level with the audio signal. It is performing real time digital signal processing based on the context of the room.

Precisely. And it can do it much faster than a human can. It can negotiate the "handshake" in milliseconds, testing different frequencies and finding the optimal path for the data before the first bit of the actual message is even sent. That is what people saw in that demo. It was not just noise; it was a highly sophisticated, real time negotiation of physics. The agents were essentially saying, "Can you hear me at nineteen kilohertz? No? How about eighteen point five? Okay, let's use that, but let's use a shorter pulse width because the echo in this room is bad."

So, if we translate this to a human app, I am not just sending a "beep." My app and your app are actually having a tiny, invisible conversation about how to talk to each other before the message even goes through.

Exactly. It would look like this: I hit send. My phone emits an ultrasonic pulse. Your phone hears it and responds with a slightly different pulse that says, "I hear you, but the signal is a bit fuzzy on the left channel, try the right." My phone adjusts and then blasts the encrypted data. To anyone else in the room, it is total silence. To our phones, it is a high speed data exchange.

I can see why this would be appealing for privacy advocates. But I have to ask about the downsides. If we start filling our environments with ultrasonic data "chirps," does that affect anything else? I am thinking about pets, for instance. My dog, Barnaby, already hates the vacuum cleaner. What does he think about a high speed data burst?

That is the number one concern. Dogs and cats have a much wider hearing range than we do. A "silent" ultrasonic message to us might sound like a fire alarm or a piercing whistle to a dog. If this technology became widespread, every cafe and city street would be a cacophony of digital screams for animals. It would be an invisible nightmare for them.

That is a pretty big barrier to mass adoption. I do not think people would be happy if their "discreet" messaging app made every dog in the neighborhood start howling. It kind of ruins the "discreet" part if the local golden retriever starts acting like the world is ending every time you get a text.

Right. So you would have to stay within the audible range but make it sound like something else. This is where it gets really interesting. There is a concept called steganography, where you hide data inside another signal. Instead of a modem screech, you could hide the data inside a piece of music or even the ambient noise of the room.

Wait, so I could play a song on my phone, and the "data" is hidden in the slight variations of the beat or the harmony? Like a secret code in the bass line?

Yes. It is called "spread spectrum" audio steganography. You spread the data so thinly across the audio spectrum that it sounds like very faint white noise, which is easily masked by music or speech. Your phone's microphone can extract that data, but a human ear just hears the song. You could be sitting in that taxi, playing a lo-fi hip hop track, and the track itself is carrying the message.

That feels even more like a spy movie. I am playing a jazz track, and meanwhile, my phone is downloading a secret document from your phone. It is hiding in plain sight.

It is entirely possible. In fact, some retail stores have experimented with this to send coupons to people's phones via the background music in the store. It is a bit "Big Brother," but it proves the technology works. The music you hear in the grocery store could be telling your phone exactly which aisle you are in and what you should buy.

That is a bit terrifying, actually. But for Daniel's use case—intentional communication—it is brilliant. Let us circle back to his question about whether this technology actually exists for humans right now. If a listener wants to try this, can they? Is there an app I can download today to talk to you in modem screeches?

There are a few "proof of concept" apps out there. If you search for "acoustic modem" or "data over sound" in the app stores, you will find some basic tools. Most of them are for hobbyists and use audible tones. There is one called "Quiet" which is an open source project that does exactly this. It uses the liquid d s p library to transmit data over audio. It is very reliable, but it is definitely not "mainstream" yet.

"Quiet." That is a good name for it. How fast is it? Are we talking about sending images, or just text? Can I send you a high-res photo of my lunch via sound?

For audio, you are generally looking at very low bitrates. Maybe a few hundred bits per second if you want it to be reliable. So, text is fine. A short message might take a second or two to transmit. If you wanted to send a high resolution photo, you would be sitting there for a long time listening to your phone screech. It would be like being back in nineteen ninety two, waiting for a J P E G to render line by line.

So it is not a replacement for AirDrop or Bluetooth for large files. It is more for "low bandwidth, high privacy" situations.

Not at all. Bluetooth and Wi Fi use radio waves, which have much higher frequencies and therefore much higher bandwidth. Sound is limited by the speed of sound, which is incredibly slow compared to the speed of light. But sound has one big advantage: it stays in the room.

Ah, physical containment. That is a huge point.

Exactly. Radio waves go through walls. If I send something via Bluetooth, someone in the next room could potentially intercept it with a high gain antenna. But if I use sound, and the door is closed, that signal is physically contained. For highly sensitive, "room only" communication, acoustic protocols are actually more secure in a physical sense. It is the only digital signal that respects a "Do Not Disturb" sign on a door.

That is a perspective I had not considered. It creates a "security perimeter" that matches the physical walls of the room you are in. It is very intuitive. You know exactly who could possibly be "listening" to your data because you can see them. If there is no one else in the taxi, you know the data is private.

Right. It is the digital equivalent of a whisper. You can see everyone who is within earshot. It makes the security model very human and easy to understand. No one can "hack" your sound from three blocks away unless they have a very powerful directional microphone pointed at your window.

So, we have the "why" and the "how." Let us talk about the "what's next." Where do you see this going as A I agents become more common in our daily lives? If I have an A I assistant on my phone and you have one on yours, are they going to be "chirping" at each other all the time? Is the future just going to be a constant background hum of machines talking?

I think it is inevitable for certain types of interactions. Think about "handshakes" for sharing contact info or making a payment. Instead of fumbling with Q R codes or waiting for a Bluetooth pair, your phones could just do a quick acoustic handshake. It is nearly instantaneous and requires zero configuration. It is the ultimate "it just works" technology because every device has the hardware for it.

It feels like it brings a bit of that "magic" back to technology. Things just "knowing" about each other because they are in the same space. It is very ambient.

Exactly. And as the models get better at processing raw audio, the need for standardized protocols might even disappear. Two A I agents might just "evolve" their own way of communicating based on the specific acoustics of the room they are in. They might realize that the echoes in a particular hallway allow for a certain type of modulation that is faster than the standard. That is the truly "weird" part of the prompt. We are moving toward a world where machines have their own private languages that we cannot understand, even though we can hear them.

It reminds me of those stories about the Facebook A I agents back in twenty seventeen that started talking to each other in a shorthand that looked like gibberish to the researchers, but was actually a more efficient way of negotiating. People got really spooked by that, but it was just the A I doing what it was told: be efficient.

Yes, that was a classic example. If we do not force them to use English, they will find a better way. And audio is a very rich medium for that. It is full of texture and nuance that text lacks. An A I could encode emotion, priority, and data all into a single multi-layered chirp.

So, back to Daniel's practical application. Is it viable for humans? I would say yes, but with caveats. It is great for short, encrypted text messages in specific scenarios where you want physical security or lack a network. But it is probably not going to replace your daily messaging app. It is a tool for the paranoid, the off-grid, and the curious.

I agree. It is a niche tool, but a powerful one. It is for the person who wants that extra layer of "off grid" privacy. And honestly, there is something satisfying about the idea of my phone being a secret modem. It makes me feel like a character in a William Gibson novel, using the "low tech" of sound to bypass the "high tech" of the surveillance state.

It really does. I can imagine us sitting in a crowded cafe, phones on the table, silently exchanging encrypted notes while everyone else thinks we are just staring at our screens. We are having a full conversation and no one has any idea.

Or they just think our phones are making weird glitchy noises and we are too old to notice. They will think we are just two guys whose technology is failing them.

Also a very likely possibility. "There go those Poppleberry brothers again, their tech is acting up. Someone should tell them their phones are broken."

"Must be those old modems they love so much. They probably still have AOL accounts."

Exactly. But in all seriousness, the mechanics of this are a great reminder that "data" is just information encoded in a medium. Whether it is light, radio, or sound, it is all about how we translate it. And A I is getting much, much better at finding the most efficient translations possible. It is stripping away the human interface and getting down to the raw math of communication.

It really is. And as we see more of these "agentic" workflows, where one A I has to go out and talk to another A I to get a task done—like my A I booking a flight by talking to the airline's A I—these high speed handshakes are going to become the standard. We might not even call them "modem noises" soon. It will just be the sound of the machine economy humming along in the background.

"The sound of the machine economy." That is a bit haunting, Herman. It makes it sound like we are living inside a giant computer.

A little bit, yeah. But also fascinating. It is a new kind of nature. Instead of birds chirping to mark their territory, we have agents chirping to negotiate a contract.

Well, I think we have covered the bases here. We have looked at the A I side, the technical mechanics of audio data, the human privacy use cases, and the potential pitfalls like the "dog alarm" problem. It is a deep rabbit hole for such a simple concept.

It is a lot to chew on. I am definitely going to be looking at my phone's microphone with a bit more suspicion today, wondering what it might be "hearing" in the ultrasonic range that I am completely oblivious to. Is the room talking to my phone right now?

Probably. And hey, if any of you listening have actually used one of these "data over sound" apps, we would love to hear about your experience. Did it work? Did it drive your cat crazy? Did you feel like a secret agent in a taxi? Send us a message at show at myweirdprompts dot com.

Or you can use the contact form on our website, myweirdprompts dot com. We love hearing from the community, even if you do not send it via ultrasonic chirp. We are still fans of the old school text based email.

Especially if you do not send it via ultrasonic chirp. My ears are sensitive and I do not want to start howling like Barnaby.

Fair enough. And before we go, if you have been enjoying these deep dives into the weirder side of technology and A I, please consider leaving us a review on Apple Podcasts or Spotify. It really helps the show find new listeners who are just as curious as we are. We have been doing this for a long time, and the community is what makes it worth it.

It genuinely makes a difference. We have been doing this for seven hundred and eighty two episodes now, and it is the listener feedback and the weird prompts like Daniel's that keep us going. We never run out of things to talk about because the world never stops getting weirder.

Absolutely. We have a massive archive of episodes covering everything from A I ethics to obscure hardware, so if this piqued your interest, go explore the feed. There is a lot of gold in those older episodes.

Well, that is it for today. Thanks to Daniel for the prompt that sent us down this acoustic rabbit hole. It was a fun one. I am going to go see if I can make my phone talk to the toaster now.

Good luck with that, Corn. Until next time, I am Herman Poppleberry.

And I am Corn. You can find us on Spotify, Apple Podcasts, or wherever you get your shows. Thanks for listening to My Weird Prompts.

Goodbye everyone. Keep your ears open.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.