You're deep in a Kdenlive edit. Forty-five minutes of timeline surgery. And the cursor freezes. Not the app. The whole desktop. That moment, the next thirty seconds, that's where people lose hours or keep their work. Herman, set this up.
The gap between system crashed and I'm back to work is where most people hemorrhage productivity. And with more professionals running Linux as their daily driver for creative work, video editing, development, data science, the old just reboot mentality doesn't cut it anymore. The prompt today gets into crash prevention and recovery on Linux desktop workstations, and the listener's coming from a real place. He's on Ubuntu, sixty-four gigs of RAM, i seven desktop, Kdenlive as the canary in the coal mine. He's asking three things. What prevention harnesses exist for memory, CPU, and GPU crashes? How do you calibrate them without false positives? And once everything's frozen, what's the recovery key combo, and can you use Claude Code to get the graphical session back?
It's not a server hardening episode. It's keeping a graphical workstation alive under load. And the insight about Claude Code being part of the recovery posture changes the calculus. If you can get a terminal and networking up, you can recover from just about anything.
That's the thing. Linux desktop crashes rarely come from a single catastrophic failure. A runaway Python script eats RAM. The OOM killer panics and kills Xorg or your compositor. Now you're staring at a TTY with no way to save your work, and the thing that actually caused the problem is still running.
The cascade is the actual enemy. Not the initial crash.
There are three distinct failure domains here, and each needs a different prevention strategy. Memory exhaustion, that's the OOM killer scenario. CPU lockup, runaway processes, kernel soft lockups. And GPU slash VRAM overcommit, which is driver hangs, Mesa crashes, display server freezes. The listener mentioned all three, and he's right that GPU has its own failure modes. They're fundamentally different from memory and CPU problems.
Let's start with the one that bites most people. The OOM killer. Why does a single app crash sometimes take down the entire desktop session?
Because the default OOM killer behavior on Linux is to target the largest memory consumer. And on a desktop workstation, the largest memory consumer is often your compositor, your desktop environment, or the X server itself. So a Python script leaks memory for thirty seconds, and the kernel's response is to shoot the compositor in the head. Everything else that was fine is now dead because the thing managing your display just got killed.
The OOM killer is basically a bouncer who ejects the biggest person in the room rather than the one starting the fight.
And here's the deeper problem. On Ubuntu twenty-five ten, there are two main alternatives to the kernel's default OOM killer. Systemd hyphen oomd, which is the default now on recent Ubuntu versions, and earlyoom, which you install separately. They work in fundamentally different ways. Systemd hyphen oomd uses cgroup v two pressure stall information, PSI, to preemptively kill processes before memory is completely exhausted. It's reading how long tasks are stalled waiting for memory and making decisions based on that. Earlyoom uses a simpler percentage based threshold. When available memory drops below a certain percentage and swap is filling up, it acts.
Which one actually works better for a desktop workstation?
For a workstation, I lean toward earlyoom, and I'll tell you why. Systemd hyphen oomd uses PSI, which is more adaptive and sophisticated, but it's harder to tune and harder to predict what it's going to do when you're in the middle of a render. Earlyoom is simpler. You set a threshold, you set prefer and avoid lists, and you know exactly what it's going to kill. The listener mentioned not wanting to be too aggressive. Earlyoom gives you that control. Systemd hyphen oomd can feel like a black box.
Walk through what an earlyoom configuration actually looks like for someone editing video in Kdenlive.
The key flags are prefer and avoid. Prefer tells earlyoom these are the processes I want you to protect. Avoid means don't touch these unless absolutely everything else is dead. For a desktop workstation, you want something like this. Earlyoom dash dash prefer, and then a regex pattern that matches Xorg, gnome shell, kwin underscore x eleven, or plasma shell. These are your display server and compositor. You want them alive at all costs. Then dash dash avoid with a pattern matching Firefox, Chrome, Kdenlive. These are your work applications. You'd rather kill the memory leaking Python script than your browser with thirty tabs. Then dash dash threshold ten and dash dash perf threshold five.
What do those numbers mean?
Threshold ten means earlyoom kicks in when available memory drops below ten percent and swap is below ninety percent free. Perf threshold five means it also activates when PSI memory pressure averages above five percent over the monitoring window. The default thresholds are usually lower, which means earlyoom waits too long and the kernel OOM killer beats it to the punch. Setting threshold to ten gives you a buffer.
This is where the calibration question comes in. The listener asked specifically, how do you know if your thresholds are too aggressive before you crash?
You monitor slash proc slash pressure slash memory for a week under your normal workload. That file gives you three numbers. Avg ten, avg sixty, and avg three hundred. They represent the percentage of time tasks were stalled waiting for memory over the last ten seconds, sixty seconds, and five minutes. If your avg ten regularly exceeds ten percent during normal work, your earlyoom threshold is too aggressive and you need to back it off. If it never goes above three percent even under heavy load, you can probably tighten the threshold further.
You're building a baseline of what normal pressure looks like for your specific workflow.
There's no one size fits all setting. The listener has sixty four gigs of RAM and says he rarely goes above thirty two. That means his normal PSI values are probably very low. He could set threshold fifteen or even twenty and still catch the runaway process before it exhausts all memory. But someone with sixteen gigs running multiple VMs might be cruising at forty percent PSI regularly. Their threshold needs to be lower, maybe five or six.
There's a real world case that illustrates why this matters.
Imagine a machine learning training script that leaks memory. It's allocating tensors and not freeing them. Without earlyoom, the kernel OOM killer wakes up, sees Firefox using twelve gigs with thirty tabs and an unsaved document, and kills it. The Python script survives because it's been killed and restarted by the training loop. With earlyoom configured with avoid for Firefox and prefer for the compositor, it kills the Python process at ninety percent RAM usage. Firefox stays alive. The desktop stays up. You lose the training run, not your whole session.
That's the difference between losing five minutes and losing an hour. So we've covered what happens when memory runs out. But what about when the CPU itself locks up? That's a different beast entirely.
A CPU lockup doesn't mean your processor is using a hundred percent. It means a task is stuck in what's called the D state, uninterruptible sleep. This usually happens when a process is waiting on I O that never completes. A failing hard drive, a network filesystem that's gone away, a kernel bug in a driver. The kernel has built in detection for this. Hung task timeout secs, default a hundred and twenty seconds, and soft lockup detection. If a task is stuck in D state for longer than that timeout, the kernel logs a warning and you see those hung task messages in dmesg.
Two minutes is a long time to wait when your desktop is frozen.
It is, and you can lower it, but you have to be careful. Some legitimate operations, like writing a very large file to a slow USB drive, can put tasks in D state for tens of seconds. If you set hung task timeout secs too low, you'll get false positives. I wouldn't go below thirty seconds for a desktop.
There's a more practical prevention step for CPU related lockups that isn't about kernel parameters at all. It's about how the kernel handles dirty pages.
This is where VM dot dirty ratio and VM dot dirty background ratio come in. When a process writes data, it doesn't go straight to disk. It goes to the page cache, those are dirty pages, and then a background process flushes them to disk. The default dirty ratio is usually twenty percent of RAM. On a machine with sixty four gigs, that's almost thirteen gigs of data that can be sitting in memory waiting to be written. When the kernel decides to flush that, it can cause massive I O stalls that look and feel exactly like a system freeze. Your mouse stops moving, your keyboard doesn't respond, and you think the system is dead.
You're sitting there with a frozen cursor and it's not actually crashed. It's just flushing thirteen gigs to disk.
Lowering those ratios prevents that. Set VM dot dirty ratio to ten and VM dot dirty background ratio to five. That means the kernel starts flushing when five percent of RAM is dirty, and it blocks writes when ten percent is reached. On sixty four gigs, that's about three gigs of dirty pages before throttling versus thirteen. Much less disruptive. This is especially important for video editing where Kdenlive is writing large render files.
Another practical CPU guard is using chrt to set real time priorities for audio and video apps.
The kernel's completely fair scheduler, CFS, tries to give every process a fair share of CPU time. But for real time workloads like audio processing or video rendering, fairness isn't what you want. You want that process to get the CPU when it needs it, without being preempted. Chrt lets you set a real time scheduling policy for a specific process. You can launch Kdenlive with chrt dash dash fifo and a priority, or use it on an already running process. This prevents scheduling starvation where your video editor is ready to render a frame but the CPU is busy with a background update check.
That covers system RAM and CPU. But there's another memory domain that's even more fragile on Linux. Your GPU's VRAM. Let's talk about why GPU crashes feel different.
GPU crashes on Linux are fundamentally different from CPU and memory crashes. They often manifest as a driver hang. You'll see GPU HANG in dmesg, or the entire display server will freeze while the rest of the system is technically still running. The kernel's GPU scheduler can recover from some hangs via GPU reset, but here's the problem. A GPU reset often kills all OpenGL and Vulkan contexts. Every application using the GPU loses its state. Your compositor dies, your browser windows go black, your video editor's preview window goes blank. It's not a system crash, but it might as well be from the user's perspective.
VRAM doesn't have an OOM killer equivalent.
That's the critical difference. System RAM has the OOM killer. VRAM has nothing. When a process allocates more VRAM than is available, one of two things happens. The GPU driver either swaps to system RAM, which causes massive performance degradation and can look like a freeze, or it crashes outright. There's no graceful handling. The driver just gives up. For AMD GPUs, you can set the amdgpu dot lockup timeout kernel parameter. The default is two thousand milliseconds. Two seconds before the driver declares a hang. You can lower that to a thousand or even five hundred milliseconds to catch hangs before they cascade into a full desktop freeze.
You're telling the driver to give up faster rather than waiting and hoping.
Which sounds counterintuitive, but a faster hang detection means a faster GPU reset, which means you get your display back sooner. The alternative is the driver waiting two seconds while the compositor is frozen, and by then other things have started timing out and the cascade is underway.
For Nvidia users, the situation is different because of the proprietary driver.
Nvidia's driver has its own hang detection and recovery mechanisms, but they're less transparent than the open source AMD and Intel drivers. You're mostly relying on nvidia smi to monitor VRAM usage in real time and hoping the driver handles things gracefully. For Intel integrated graphics, the i nine fifteen dot enable hangcheck equals one parameter enables GPU hang detection. It's usually on by default on recent kernels.
There are environment variables that can prevent GPU crashes before they happen.
For certain problematic applications, yes. MESA underscore GLSL underscore CACHE underscore DISABLE equals true disables the shader cache. This can prevent shader compilation storms that exhaust VRAM during application startup. Vblank underscore mode equals zero disables VSync, which can prevent some timing related GPU hangs. These are band aids, not solutions, but if you have a specific application that reliably triggers GPU hangs, they're worth trying.
We've covered prevention. Memory, CPU, GPU. But let's be realistic. Stuff still crashes. When it does, there's a secret weapon on your keyboard that most Linux users have never touched.
The magic SysRq key. Most people who know about it know the REISUB sequence. Raising Elephants Is So Utterly Boring. UnRaw, tErminate, kIll, Sync, Unmount, reBoot. That's the nuclear option. It reboots your system cleanly when everything is frozen. But the listener is asking about something different. He wants to recover without rebooting. And for that, the key is Alt plus SysRq plus K.
Secure Access Key.
Originally designed for secure attention sequences on multi user systems. The idea was that you press a key combination that the kernel intercepts directly, and it kills all programs on the current virtual terminal and gives you a fresh login prompt. It was meant to prevent spoofed login screens from stealing passwords. But for desktop recovery, it's a happy accident. Alt plus SysRq plus K kills everything on your current VT. Your graphical session, your compositor, Xorg or Wayland, all of it. And it drops you at a TTY login prompt. The kernel is still running. Your network stack is still up. Your SSH sessions are still alive. You just lost your graphical session.
Which is exactly the state the listener wants to be in. Terminal, networking, Claude Code available.
That's the recovery posture he described. If I can get a terminal and keep networking up, I can recover from just about anything. And SysRq plus K gives him exactly that.
Is it enabled by default on Ubuntu?
This is where it gets interesting. Ubuntu twenty five ten ships with kernel dot sysrq equals zero x four thirty eight by default. That's a bitmask. Zero x four thirty eight enables most SysRq functions but not all of them. Specifically, it enables the K function, the SAK key, but it does not enable F, which is the manual OOM kill, or T, which is the task dump. So the good news is that Alt plus SysRq plus K works out of the box on Ubuntu. The listener doesn't need to enable anything to use it.
If he wants the full suite of SysRq functions, including manual OOM kill, he needs to change that value.
To enable everything, you set kernel dot sysrq equals one. You can do that temporarily with echo one to slash proc slash sys slash kernel slash sysrq, or permanently by creating a file in slash etc slash sysctl dot d, say ninety nine hyphen sysrq dot conf, with kernel dot sysrq equals one. But for the listener's use case, the default is already sufficient. SysRq plus K is what he needs.
Walk through the actual recovery sequence. Kdenlive freezes the desktop. What happens next?
Step one, press Alt plus SysRq plus K. On most keyboards, SysRq is the same key as Print Screen. So you're holding Alt and Print Screen and pressing K. The screen goes black and you're looking at a TTY one login prompt. Step two, log in. Step three, the first command you run is sudo systemctl restart gdm three, or whatever your display manager is. For Ubuntu with GNOME, it's gdm three. For KDE, it's sddm. That restarts the graphical login manager, and you're back at your login screen. All your services are still running. Your SSH sessions are intact. You log back in and you're back to work.
If restarting the display manager doesn't work?
Then you isolate to multi user target and back. Sudo systemctl isolate multi hyphen user dot target, then sudo systemctl isolate graphical dot target. That's a more thorough restart of the entire graphical stack. It's still not a reboot. Your network services, your databases, your Docker containers, they all stay up.
This is where Claude Code enters the picture.
This is the listener's insight that I think is genuinely novel. Once you're at that TTY, you have a terminal and networking. You can run Claude Code. You can say, my graphical session crashed, here's the output of journalctl dash xe from the last two minutes, what happened and how do I fix it? Claude Code can read the logs, identify that Kdenlive triggered a GPU hang because of a shader compilation storm, and tell you to set that MESA environment variable before restarting the display manager. Or it can see that the OOM killer fired and killed the wrong process, and recommend adjusting your earlyoom configuration. The key is that you're not debugging from memory. You're feeding Claude Code the actual system state.
The recovery workflow is SysRq plus K, TTY login, Claude Code diagnosis, then restart the display manager with whatever fix Claude recommends.
That changes the whole calculus of crash prevention. If you can reliably recover from a full desktop crash in under sixty seconds, do you still need earlyoom? The answer is yes, because prevention is still better than recovery. But the stakes are lower. A crash isn't a disaster. It's an inconvenience. And that psychological shift matters. You work differently when you're not afraid of losing everything.
The listener also asked about mapping SysRq plus K to a physical button. A literal recover system button.
You can do this with udev. First, you need to find the keycode of the button you want to use. ACPI underscore listen will show you keycodes when you press buttons. The Scroll Lock key is keycode seventy on most keyboards, and almost nobody uses Scroll Lock anymore. Once you have the keycode, you create a udev rule. Something like ACTION double equals key, KERNEL double equals keycode seventy, RUN plus equals slash bin slash sh dash c echo k to slash proc slash sysrq hyphen trigger. That rule says when keycode seventy is pressed, write k to the sysrq trigger, which is exactly what Alt plus SysRq plus K does.
You press Scroll Lock and your graphical session dies and you're at a TTY.
Which sounds terrifying, but that's the point. It's a last resort button. Everything else is frozen, you press the button, and you get a terminal. You could also use a dedicated macro key if your keyboard has one, or even a USB foot pedal if you want to get creative. The udev approach is the same regardless of the input device.
There's one more SysRq function worth mentioning. Alt plus SysRq plus F.
The manual OOM kill. It invokes the OOM killer immediately, regardless of memory pressure. If your system is thrashing and you know it's a memory issue, Alt plus SysRq plus F will kill the process with the highest OOM score. But remember, on Ubuntu twenty five ten with the default sysrq mask, F is not enabled. You'd need to set kernel dot sysrq equals one to use it. And honestly, if you have earlyoom running, you probably don't need manual OOM kill. Earlyoom already did what F would do, just sooner.
Let's talk about the misconception that GPU crashes on Linux are always driver bugs.
Many GPU hangs are caused by VRAM overcommit or shader compilation storms, not driver bugs. A shader compilation storm happens when an application triggers the compilation of hundreds or thousands of shader programs simultaneously. Each compilation allocates memory, and if the total exceeds available VRAM, the driver hangs. This is preventable. Setting MESA underscore SHADER underscore CACHE underscore DIR to a location on a fast SSD can help. Disabling the cache entirely with MESA underscore GLSL underscore CACHE underscore DISABLE equals true can prevent the storm from happening at all, at the cost of slightly longer shader compilation times when the application starts.
The tradeoff is startup time versus stability.
For a video editor like Kdenlive, where you're going to be in the application for hours, a few extra seconds of startup is worth it if it prevents a crash forty five minutes into an edit.
SysRq plus REISUB is the only way to recover a frozen Linux system.
REISUB is a full reboot. It's clean, it syncs your disks, it unmounts filesystems, but it's still a reboot. SysRq plus K is a session reset. Your graphical session dies, but the system keeps running. For a workstation user with unsaved work and running services, K is almost always the better first step. If K doesn't work, then you escalate to REISUB. But starting with REISUB is like calling a demolition crew when you might just need to reset a circuit breaker.
The misconception that crash prevention tools like earlyoom are only for servers.
Workstations benefit more because the cost of a crash is higher. On a server running automated workloads, if a process dies, the orchestrator restarts it. You might lose a few seconds of throughput. On a workstation, if your desktop session dies, you lose unsaved work, your creative flow is disrupted, and you spend twenty minutes getting back to where you were. The listener said it exactly. You can accumulate quite a lot of work within a session and saved context that if your computer goes down, it can often derail your workday. That's not a server problem. That's a human problem.
Let's put together the actionable checklist. The listener asked for concrete steps.
Step one, install earlyoom. Sudo apt install earlyoom. Configure it with the prefer and avoid patterns we discussed. Protect your compositor and display server. Avoid your work applications. Set threshold to ten and perf threshold to five as a starting point. Step two, monitor slash proc slash pressure slash memory for a week. Establish your baseline PSI values under normal workload. Adjust thresholds if needed. Step three, set kernel dot sysrq equals one in slash etc slash sysctl dot d slash ninety nine hyphen sysrq dot conf. This enables all SysRq functions including F and T. Step four, test SysRq plus K in a safe session. Save everything, close anything important, and press Alt plus SysRq plus K. Watch what happens. Practice the recovery sequence. Step five, configure VRAM monitoring. Use radeontop for AMD, nvidia smi for Nvidia, or intel underscore gpu underscore top for Intel. Know what your VRAM baseline looks like so you can spot leaks. Step six, write down the recovery sequence somewhere you can access it when the system is frozen. SysRq plus K, TTY login, sudo systemctl restart gdm three, and if that fails, Claude Code for deeper diagnostics.
The mindset shift here is important. Crash prevention isn't about preventing crashes. It's about preventing cascading failures.
A well configured Linux workstation should degrade gracefully. The misbehaving process dies. The desktop stays up. You're back to work in thirty seconds. The difference between a process crash and a system crash is configuration. Most Linux users are running with the defaults, and the defaults are tuned for server workloads where a compositor doesn't exist and the largest memory consumer is usually the actual problem. On a desktop, the defaults are wrong. But they're fixable with about twenty minutes of configuration.
The listener's insight about Claude Code changes the recovery posture fundamentally. If you can get a terminal and networking, you have an AI assistant that can read logs, diagnose problems, and walk you through fixes. That didn't exist in a practical way two years ago.
It's a new layer of resilience. Prevention handles the common cases. SysRq plus K handles the cases prevention misses. And Claude Code handles the cases where you don't know what went wrong and need help diagnosing. Three layers of defense, each one catching what the previous layer couldn't.
There's a forward looking question here. As AI assisted recovery tools become more capable, does the calculus change for how aggressively we should configure crash prevention? If you can reliably recover from a full desktop crash in under a minute, do you still need earlyoom?
I think you do, but the answer might be different in five years. Right now, earlyoom prevents the crash from happening in the first place, and that's always better than recovering from one. But as recovery tools get faster and more reliable, the cost benefit analysis shifts. If Claude Code can restore your exact session state, all your windows, all your unsaved work, in thirty seconds, then maybe you accept a higher crash rate in exchange for fewer false positives from earlyoom. We're not there yet, but the listener's workflow is pointing in that direction.
There's also the Ubuntu direction to consider. With Ubuntu moving toward immutable and atomic updates, desktop snap confinement, the crash prevention and recovery patterns will need to adapt.
Snap confined apps can't be killed by earlyoom in the same way because they're sandboxed. The SysRq key may become more important as users have less direct control over the system. When your applications are running in containers and the host system is immutable, the recovery patterns change. You might not be able to kill a specific process. You might need to restart the entire graphical stack. SysRq plus K becomes the primary recovery mechanism rather than a last resort.
The practical challenge for listeners. Try the SysRq plus K recovery sequence this week. Deliberately freeze your desktop with a stress test.
Stress dash dash vm eight dash dash vm bytes forty eight G will allocate forty eight gigs of memory across eight processes. That'll trigger the OOM killer or earlyoom if you have it configured. Or if you want to test GPU recovery, run a shader compilation benchmark that pushes your VRAM. The point is to practice the recovery in a controlled environment. Knowing you can recover without rebooting changes how you work. You stop fearing crashes and start treating them as manageable events.
One thing we should mention. The SysRq plus K recovery doesn't save your unsaved work in Kdenlive. If the application crashed, that work is gone. The recovery gets you back to a working desktop quickly, but it doesn't recover the application state.
And that's why prevention is still the first line of defense. Earlyoom and GPU hang detection are about stopping the crash before it kills your application. SysRq plus K is about recovering when prevention fails. And Claude Code is about diagnosing why prevention failed so you can fix it for next time. It's a cycle. Prevent, recover, diagnose, improve.
The listener's system, sixty four gigs of RAM, i seven, desktop workstation, is actually in a sweet spot for this kind of configuration. He has enough headroom that he can set conservative thresholds without false positives. Someone on a laptop with eight gigs has a much harder calibration problem.
The headroom is key. With sixty four gigs and typical usage around thirty two, the listener can set earlyoom threshold to fifteen or even twenty and still have plenty of warning before memory exhaustion. The false positive risk is low because there's so much buffer. On a system with eight gigs, the gap between normal usage and OOM is much narrower, and the calibration has to be much more precise.
The advice scales with hardware. More RAM means you can be more conservative with your thresholds.
Which is counterintuitive. You'd think more RAM means you don't need OOM protection. But actually, more RAM means you can afford more aggressive protection without false positives. It's a luxury that lets you be safer.
Let's wrap with the open question. As AI assisted recovery tools like Claude Code become more capable, does the calculus change for crash prevention? If you can recover from a full desktop crash in under a minute, do you still need earlyoom?
I think the answer is yes, but the priority shifts. Prevention becomes about protecting unsaved work rather than protecting uptime. You still don't want Kdenlive to crash forty five minutes into an edit. But if it does, the recovery is fast enough that you're not losing your flow state for half an hour. You're back in the application in two minutes, and you've learned something about why it crashed that you can apply next time.
The listener's recovery posture, if I can get a terminal and keep networking up, I can recover from just about anything, is the right way to think about it. SysRq plus K is the key that unlocks that posture. Everything else is optimization.
The configuration is straightforward. Earlyoom for memory, dirty ratio tuning for I O stalls, GPU hang timeout for VRAM, and SysRq plus K for when all of that fails. Twenty minutes of setup for a system that degrades gracefully instead of cascading into oblivion. That's the value proposition.
Now, Hilbert's daily fun fact.
Now: Hilbert's daily fun fact.
Hilbert: By the early fifteen hundreds, when Portuguese sailors first reached Mauritius, the island was home to an estimated zero human inhabitants, making it one of the last habitable landmasses of its size to be colonized by humans, roughly thirteen hundred miles from the nearest Polynesian settlement.
Zero human inhabitants. That is a population statistic, technically.
The best kind of correct.
This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you want more episodes, find us at myweirdprompts dot com.
Try SysRq plus K this week. Deliberately break something and fix it. You'll work differently once you know you can.