#1600: The Digital Tofu Crisis: Saving the World’s Scripts

Why does Bigfoot have an emoji while millions can’t type in their native script? Explore the hidden battle to save the world's writing systems.

0:000:00
Episode Details
Published
Duration
18:59
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Invisible Writing Systems of the Digital Age

In the modern world, digital communication is often taken for granted. However, for a significant portion of the global population, typing in a native script is an impossibility. When a computer or smartphone cannot render a specific character, it displays a small empty rectangle, colloquially known in the industry as "tofu." Today, roughly 41% of the world’s 293 known writing systems are served with a side of tofu, leaving them effectively invisible to the digital record.

This "digital extinction" risk is not merely a technical glitch; it is a profound cultural challenge. Of the hundreds of scripts used throughout human history, only 172 are currently supported by the Unicode Standard. The remaining 121 scripts represent thousands of linguistic communities that are digitally disenfranchised, forced to use foreign alphabets to communicate in a world that has not yet built the bridge for their spoken word.

The Gatekeepers of Language

The responsibility for encoding these scripts falls largely on the Unicode Consortium and the Script Encoding Initiative (SEI) at UC Berkeley. Adding a new script is an arduous process that can take decades. It requires proving a script’s stability, defining the technical properties of every character—such as how they sort and wrap—and documenting how they interact with other symbols.

Recent updates have seen progress, with scripts like Beria Erfe from Africa and Tolong Siki from India finally receiving official codepoints. However, the pace remains slow. While new emojis like trombones and mythical creatures are added annually, ancient systems like Maya Hieroglyphs and modern minority scripts often wait in a long bureaucratic queue.

The Prioritization Debate

A central tension exists between the demand for popular emojis and the need for linguistic preservation. Emojis are backed by major tech corporations and drive frequent software updates. In contrast, scripts for minority languages often lack corporate muscle. This creates a "chicken-and-egg" problem: Unicode requires evidence of modern usage before encoding a script, but communities cannot easily use a script that isn't supported by standard keyboards or search engines.

This cycle of exclusion forces a choice between preserving ancient history and supporting living cultures. Scholars need digital support for ancient scripts to modernize libraries and museums, while living communities need their scripts encoded to perform basic modern tasks, from sending a text to buying a bus ticket.

The Future of Digital Infrastructure

Digital infrastructure is a human choice rather than a natural law. As we look toward the future, the work of linguists and technical architects remains vital. Projects like the Missing Scripts Project are bypassing official hurdles by using the "Private Use Area" of Unicode to provide interim fonts for unencoded scripts.

Ultimately, the goal is to fill in the digital map. Supporting these scripts is about more than just technology; it is about ensuring that the digital world reflects the full diversity of human thought and history, rather than just the systems that are most profitable to support.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Read Full Transcript

Episode #1600: The Digital Tofu Crisis: Saving the World’s Scripts

Daniel Daniel's Prompt
Daniel
Custom topic: how many discrete writing systems exist at the moment and how many of them have supported unicode characters? Are tehre any that dont?
Corn
You know, Herman, I was looking through the new emoji set on my phone this morning, and I realized we have reached a very specific, and frankly bizarre, peak in human civilization. We now have a dedicated character for a trombone and something that looks suspiciously like a hairy Bigfoot, but if you want to type a message in a script used by millions of people in West Africa or the indigenous communities of North America, your phone basically just shrugs and gives you a bunch of empty boxes. We call those boxes tofu in the industry, and right now, the digital world is served with a side of tofu for nearly half of humanity's writing systems. It is like the digital world has decided that a forest monster is more essential to global communication than forty-one percent of the world’s actual writing systems.
Herman
It is a stark contrast, Corn. Herman Poppleberry here, and I have been diving into the data Daniel sent over in his prompt today regarding the digital divide in human writing systems. When you look at the numbers, it is actually quite staggering. We have roughly two hundred ninety-three discrete writing systems known to exist in human history, both ancient and modern. But as of right now, only one hundred seventy-two of them are supported by the Unicode Standard. That leaves one hundred twenty-one writing systems effectively invisible to the digital world. We are talking about a massive chunk of human literacy that simply does not exist as far as your processor is concerned. This is what experts are calling a digital extinction risk. If a script is not encoded, it is essentially being erased from the modern record. It is the twenty-first-century equivalent of burning a library, just through neglect rather than fire.
Corn
One hundred twenty-one scripts left in the dark. That feels like a lot of missing history, or at the very least, a lot of very frustrated people who just want to send a text without it looking like a glitch. But before we get into the weeds of why this is happening, I think we need to clear something up for the folks listening. People often use language and writing system interchangeably, but they are not the same thing, are they? I mean, I speak English, but I am using the Latin script. If the Latin script disappeared, a whole lot of languages would go down with it.
Herman
That is a crucial distinction that people often miss. There are over seven thousand living languages on Earth, but only about two hundred ninety-three ways we have ever come up with to write them down. One writing system, like Latin, supports hundreds of languages. Cyrillic supports dozens. Arabic supports dozens more. So when we say one hundred twenty-one writing systems are missing, we are not just talking about one hundred twenty-one languages. We are talking about thousands of potential linguistic communities that are digitally disenfranchised because the bridge between their spoken word and the digital screen has not been built yet. If you speak a language that uses the Latin alphabet, you are fine. But if your language uses a unique script that has not been invited to the Unicode party, you are essentially forced to use someone else's alphabet to communicate online.
Corn
It is like having a road but no car that can drive on it. And the gatekeeper for all of this is the Unicode Consortium. I always pictured them as this secret council of robed linguists meeting in a basement in Switzerland, but it is actually a non-profit based in California, led by CEO Toral Cowieson. They are the ones who decide which characters get a codepoint and which ones have to wait. Today’s prompt from Daniel really highlights that this is not just a technical issue; it is a prioritization issue. Why is it so hard to get a script added? Is it just a matter of drawing the letters and hitting save?
Herman
I wish it were that simple. Adding a script is an incredibly rigorous technical and bureaucratic process. It usually starts with the Script Encoding Initiative, or SEI, based at the University of California Berkeley. Anushah Hossain recently took over as director there, succeeding the founder Deborah Anderson, and their job is basically the linguistic detective work. They have to prove that a script is stable, that it has a community of users, and they have to define every single technical property of every character. We are talking about how they sort, how they wrap at the end of a line, and how they interact with other characters. In the latest Unicode seventeen point zero update from last September, they managed to add four new scripts: Beria Erfe from Africa, Tolong Siki from India, Tai Yo from Vietnam, and Sidetic from Turkey. But each of those represented years, sometimes decades, of research and negotiation.
Corn
Years for four scripts? At that rate, my great-great-grandchildren might finally be able to type in Maya Hieroglyphs while they are vacationing on Mars. It feels like there is a massive bottleneck here. And I love that you mentioned stability and adoption. That sounds like a classic catch-twenty-two. The Unicode Technical Committee says you need to show the script is being used before they encode it, but how are people supposed to use it if they cannot type it on a phone or a computer? It is like telling someone they cannot have a driver's license until they prove they have already driven five hundred miles on the highway.
Herman
The chicken-and-egg problem is a major point of contention in the industry. Critics of the current process argue that by requiring evidence of modern usage for new scripts, or neographies, Unicode is inadvertently stifling the growth of minority cultures. Take the Oduduwa script in Nigeria or Masaba in Mali. These are modern scripts designed specifically to give their communities a unique cultural identity. But without a Unicode codepoint, you cannot build a keyboard for them on a standard smartphone. You cannot search for them on Google. You cannot post in them on social media. If you try, it just looks like broken code. So the community struggles to adopt it, and then Unicode says, see, nobody is using it. It is a cycle of digital exclusion.
Corn
It is digital extinction by committee. It is funny, in a dark way, that we spent so much time in episode seven hundred seventy-five talking about the BiDi Battle and the chaos of mixing right-to-left and left-to-right text. That felt like a massive technical hurdle, but at least those scripts were in the building. For these missing one hundred twenty-one scripts, they are not even at the front door. They are still out in the parking lot trying to find the entrance. And while they are waiting, we are getting emojis for Bigfoot. I checked the news, and just a few days ago on March twenty-fourth, Apple released iOS twenty-six point four with those new emojis. The trombone, which was actually designed by students at the University of Maryland, the ballet dancer, the bigfoot. It feels like the technical debt of the one hundred fifty-nine thousand eight hundred one characters already in the standard is so high that we are just adding fluff instead of fixing the foundation.
Herman
The character count is actually a great way to look at the scale of the debt. Over one hundred thousand of those characters are CJK ideographs, which are Chinese, Japanese, and Korean characters. The complexity of managing that is enormous. But the emoji versus script debate is a bit of a lightning rod. Emojis are popular, they drive software updates, and they are backed by big tech companies like Apple and Google who want to keep their users engaged. Scripts for minority languages often lack that corporate muscle. However, it is not all bad news. The Missing Scripts Project is a massive collaboration between UC Berkeley, the National Association of Type Research in Nancy, and the Institute for Design in Mainz. They are working to create fonts for all of these unencoded scripts right now, even before they get official Unicode status. They use what is called the Private Use Area of Unicode to give these communities a way to communicate digitally in the interim.
Corn
So they are basically squatting in the unused rooms of the Unicode building until they get a formal lease? I like that. It is a very hacker-spirit approach to linguistics. But what about the ancient stuff? Daniel’s prompt mentioned that the Script Encoding Initiative just published a big strategy update in February called Scripts, Strategies, and Open Questions. They are talking about breakthroughs in Linear Elamite and Proto-Sinaitic. Is there a conflict between encoding ancient scripts that only a handful of scholars will ever use and modern scripts that could save a living culture?
Herman
That is the core of the prioritization debate within the Unicode Technical Committee. On one hand, you have scholars who need ancient scripts for digital preservation and research. If you want to digitize the world's museums and libraries, you need those codepoints. On the other hand, you have living communities whose languages are at risk of vanishing. We talked about this in episode thirteen hundred seventy-three, how a language dies every two weeks. When a language loses its digital footprint, that process accelerates. If you cannot use your native script to buy a bus ticket or send a WhatsApp message, you eventually stop using it. The SEI tries to balance both, but the resource constraints are real. It takes an expert to map out Maya Hieroglyphs, which are still a massive, unencoded challenge because they are so complex and visual. They are more like a puzzle than a linear alphabet.
Corn
Maya Hieroglyphs sound like the final boss of Unicode. I can only imagine the rendering engine trying to figure out how to stack those glyphs properly. It is not just a line of text; it is a three-dimensional grid of meaning. But it brings up an interesting point about what we value as a global society. If we can find the time to encode a trombone, we should probably find the time to encode the writing system of the people who built Chichen Itza. Or the Tulu-Tigalari script from Southern India, which has this beautiful, intricate structure. By the way, I practiced that pronunciation. Too-loo tee-gah-LAH-ree. Did I get it?
Herman
You nailed it. And that script is actually a perfect example of the technical complexity we are talking about. It is not just about the shapes; it is about the ligatures and the way characters change shape depending on what is next to them. It is what we call complex script rendering. If the software does not know the rules, the text is unreadable. This is where the work of people like Anushah Hossain becomes so vital. They are not just linguists; they are essentially technical architects for human thought. They have to write the rules for how these characters behave in a digital environment. If you get it wrong, the script is useless.
Corn
It makes me wonder what the digital landscape will look like in another ten years. We have the Unicode eighteen point zero alpha review closing in just a few days, on March thirty-first. From what I have seen, they are looking at adding the Seal script, which is that ancient Chinese script you see on red ink stamps. And they are expanding Egyptian Hieroglyphs even further. It feels like we are slowly filling in the map, but forty-one percent is still a huge blind spot. If you are a developer or a designer listening to this, what is the actual takeaway? Besides feeling slightly guilty about using the Bigfoot emoji?
Herman
The takeaway is that digital infrastructure is a choice. We often treat things like Unicode as if they are just part of the natural world, like gravity or the weather. But every codepoint is a result of human effort and political will. For developers, supporting the Missing Scripts Project is a great way to start. That means building software that is flexible enough to handle non-standard fonts and being aware of the limitations of the current systems. It also means realizing that the paperless dream we discussed in episode eleven hundred fifty-five is still out of reach for a lot of the world because they literally do not have the characters to write on a digital screen. If you are a font designer, there is a massive need for open-source fonts for these unencoded scripts.
Corn
It is the analog-digital paradox again. We have all this technology, but we are leaving the most fundamental human technology—writing—behind for nearly half the world's systems. It is wild to think that in twenty twenty-six, we are still arguing over whether a script used by thousands of people is stable enough to be allowed on a keyboard. I think the most important thing for people to realize is that Unicode support is not just a software update. It is a form of digital preservation. Every time a script gets encoded, a library is saved from the fire. It gives a culture a seat at the table of the modern world.
Herman
And every time we delay, we risk a form of digital extinction. The fact that we have one hundred fifty-nine thousand eight hundred one characters is impressive, but it is the one hundred twenty-one missing scripts that define the work left to do. Even something like the Gurung Khema script in Nepal or the Oduduwa script we mentioned earlier. These are living, breathing parts of human identity. When we encode them, we are not just adding data; we are acknowledging that those people exist in the digital age. Without that acknowledgment, they are effectively erased from the global conversation.
Corn
I also find it fascinating that we are seeing these breakthroughs in ancient scripts like Linear Elamite. It is like we are finally building the bridge back to the dawn of civilization at the same time we are trying to build the bridge to the future for modern minority scripts. It is a two-front war for human literacy. And Herman, I have to ask, since you are the one who reads all the papers. Does the rise of AI change this? I mean, we are sitting here as a human-AI collaboration. Can AI help bridge this gap, or does it just make it worse by focusing on the languages with the most data?
Herman
That is the double-edged sword. AI models, like the ones that power our scripts, are trained on existing digital data. If a script is not encoded, it does not exist in the training set. This creates what people call the Tokenization Tax, which we dug into in episode six hundred sixty-six. If your script is not in Unicode, you are effectively locked out of the AI revolution. You cannot train a large language model on your native script if the computer cannot even read the characters. However, on the flip side, vision-based AI is getting much better at recognizing and translating unencoded scripts from images. So AI might actually be the tool that helps scholars decipher things like Maya Hieroglyphs faster than ever before, providing the data needed to get them encoded. It can help us map the rules of these scripts so the Unicode Consortium can approve them faster.
Corn
So AI might be the scout that goes out and finds the missing scripts, but we still need the Unicode Consortium to build the actual road. It is a collaborative effort, much like this show. I am glad Daniel sent this prompt in. It is one of those topics that you do not think about until you realize your phone is basically a linguistic filter that only lets certain cultures through. It makes me look at that trombone emoji a little differently. It is a symbol of our technical capability, but also a reminder of our priorities. We can make a trombone work perfectly across every device on Earth, but we cannot do the same for the writing system of a sovereign nation?
Herman
It really is a question of priorities. And as we look toward Unicode eighteen point zero and beyond, I hope we see a shift toward more of these modern minority scripts. The public review closing on March thirty-first is a big milestone. It is where the community can actually weigh in on these proposals. If you are a linguist or a technical expert, that is your chance to have a say in how the world writes. We need more voices advocating for the scripts that do not have a multi-billion dollar tech company behind them.
Corn
Well, I for one am going to spend the rest of the day trying to figure out how to draw a Maya Hieroglyph for sloth. I am sure it is very regal and involves a lot of sleeping. But seriously, this has been a fascinating deep dive. It is easy to take for granted the fact that we can type these words and they show up on your screen, but for a huge portion of the world, that is still a dream. We are talking about forty-one percent of the world's writing systems being left in the dark. That is not just a technical gap; it is a cultural chasm.
Herman
A dream that is slowly becoming a reality, one script at a time. It is a long tail, as we discussed in episode thirteen hundred seventy-three, but every script added is a victory for human diversity. The work being done by the Missing Scripts Project and the SEI is some of the most important work in the digital age, even if it does not get the same headlines as a new iPhone release.
Corn
Or rather, as you would say, that is the reality of the situation. I am learning to avoid those empty affirmations, Herman. I am building on your points. See? I am evolving.
Herman
The evolution of the sloth. I love it. And I think that is a perfect place to wrap this up. We have covered the numbers, the technical hurdles, and the ethical crossroads of digital literacy.
Corn
We certainly have. Thanks to Daniel for the prompt that got us thinking about the invisible walls of the internet. And a huge thanks to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes.
Herman
And of course, a big thank you to Modal for providing the GPU credits that power the AI side of this collaboration. This has been My Weird Prompts. If you enjoyed this exploration of the digital divide, please consider leaving us a review on your favorite podcast app. It really does help other curious minds find the show.
Corn
You can also find our full archive and search for those related episodes at myweirdprompts dot com. We have got over fifteen hundred episodes now, so if you have a niche interest, we have probably talked about it at some point.
Herman
Until next time, I am Herman Poppleberry.
Corn
And I am Corn. Keep asking the weird questions, everyone.
Herman
Goodbye.
Corn
See ya.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.