#2577: Fixing Hidden UI Bugs on Real Devices

Tools and strategies to catch layout failures across devices before users abandon your app.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2735
Published: May 1
Duration: 26:51
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: human-computer-interaction software-development usability

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Problem: A Dropdown That Works on One Phone, Breaks on Another

A developer named Daniel encountered a frustrating bug on his OnePlus phone: a dropdown menu in an otherwise well-liked Android app became obscured by other UI elements. The functionality simply broke — not due to an obscure screen size or an unusual device, but because of how the app’s layout interacted with that specific phone’s environment.

His question: What automated tooling exists to catch these kinds of layout and interaction failures across device types, before users quietly abandon the app?

Why This Bug Is Harder Than It Looks

The obscured dropdown isn’t just a CSS media query issue. It lives at the intersection of several technical challenges: stacking context problems, z-index conflicts, and viewport calculation quirks. On mobile devices, the complexity multiplies. On-screen keyboards, system UI overlays, notches, punch-hole cameras, and gesture navigation bars all consume viewport space differently across manufacturers. A dropdown that works on a Pixel might render behind the keyboard on a OnePlus because of differing keyboard height calculations, or underneath a parent container due to Android version-specific overflow handling.

The Testing Tool Landscape

Visual Regression Testing

The most directly relevant category for this bug is visual regression testing. Modern tools have evolved beyond simple pixel comparison.

Percy (now part of BrowserStack) captures screenshots across viewport sizes and flags visual diffs against a baseline. Modern versions handle anti-aliasing differences, sub-pixel rendering variations, and dynamic content stabilization. You can set ignore regions and sensitivity thresholds to reduce false positives.

Applitools uses computer vision models to understand pages semantically. It recognizes UI components — buttons, dropdowns, text — and flags structural issues like overlapping elements, content extending beyond the viewport, or truncated text. For an obscured dropdown, Applitools would likely flag the problem even with subtle pixel differences, because its model understands a menu should be fully visible when opened. The trade-off is cost; Applitools is more expensive than Percy.

The Interaction Gap

Visual regression tools catch layout problems in captured states, but something still needs to drive the interactions — clicking the dropdown, scrolling, triggering the bug state. This is where end-to-end testing frameworks come in.

Playwright (Microsoft’s project) supports viewport testing across a matrix of device configurations, including pixel density, touch vs. mouse input, and browser-specific behaviors. Critically, it supports mobile WebView contexts, making it relevant for Android apps that wrap web views. Playwright includes “actionability checks” that verify an element is stable, visible, enabled, and not obscured before performing actions — a crucial distinction from merely checking if an element exists in the DOM.

Cypress is the other dominant framework, though historically weaker on mobile emulation. Playwright’s device support is more comprehensive for cross-device testing.

The Real Device Gap

No emulator fully replicates the rendering engine quirks of specific Android versions on specific manufacturer hardware. OnePlus’s OxygenOS has its own modifications to WebView rendering and system UI interaction. The practical approach combines layers: fast automated testing with Playwright or Cypress across emulated device profiles, visual regression with Percy or Applitools, and a subset of critical user flows on real devices.

BrowserStack’s real device cloud (20,000+ physical mobile devices) lets teams run Playwright or Selenium scripts against actual phones. The integration with Percy means you can run tests on a real Samsung Galaxy or OnePlus, capture screenshots at each step, and see visual diffs per device. The recommended pattern: emulated tests on every commit, real device tests on every release or nightly.

Codebase Type Changes Everything

The testing stack shifts dramatically depending on the app architecture:

Progressive Web Apps: The web testing stack (Playwright/Cypress + visual regression) applies directly.
React Native: Renders native components, not HTML/CSS in a WebView. Playwright and Cypress cannot test native components. Detox (maintained by Wix) is the primary end-to-end framework, designed specifically for React Native’s asynchronous JavaScript-to-native bridge. Applitools has a native mobile SDK; Percy’s mobile offering is newer and less battle-tested.
Flutter: Requires its own testing frameworks, distinct from both web and React Native tooling.

Key Takeaways

The fragmentation of the testing tool ecosystem is a real challenge, but integrations like BrowserStack + Percy and Applitools’ Ultrafast Grid are making multi-layer testing more turnkey. The most important conceptual shift: testing frameworks must verify not just that an element exists in the DOM, but that it is actually visible and interactable in the user’s real device environment.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2577: Fixing Hidden UI Bugs on Real Devices

Daniel sent us this one — he's been recording prompts on an Android app he otherwise likes, and ran into one of those maddening little UI bugs where a dropdown menu gets obscured on his OnePlus. The functionality just breaks. Not a weird phone, not an edge-case screen size. And his question is, essentially, what automated tooling exists to catch this stuff before users quietly abandon your app because a button doesn't work on their device? Not the manual responsive testing tools where you click through breakpoints one by one — he's asking about frameworks that can programmatically catch these layout and interaction failures across device types. Feels like fertile ground for automation, and honestly, I think he's right.

Oh, there's so much to dig into here. And before we get into the tooling, I want to flag something about why this problem is harder than it looks. The specific bug Daniel hit — a dropdown being obscured — that's not just a CSS media query issue. That's a stacking context problem, a z-index problem, potentially a viewport calculation problem. It's in this weird intersection between layout and interactivity that a lot of testing frameworks historically just didn't handle well.

That's where elements layer on top of each other in ways the developer didn't intend?

And on mobile, you've got the added fun of the on-screen keyboard, system UI overlays, notches, punch-hole cameras, gesture navigation bars. All of these eat into the actual viewport in ways that vary device to device. A dropdown that works beautifully on a Pixel might be completely hidden behind the keyboard on a OnePlus because the keyboard height calculation differs. Or it might render underneath a parent container because of how that specific Android version handles overflow.

It's not just screen dimensions. It's the entire environment the browser or WebView is operating in.

Which means any automated solution has to test in something approximating a real device environment, not just resized browser windows. But let's get to what Daniel actually asked about. The automated frameworks. And there are a few major categories here. The first one, and I think the most directly relevant to the bug he described, is visual regression testing.

Which is what — the tool takes screenshots and compares them?

That's the basic version, yeah. But the modern tools have gotten a lot more sophisticated. The big players are Percy, which is now part of BrowserStack, and Applitools. And they approach the problem differently. Percy does pixel-level screenshot comparison — you set a baseline, run your tests across different viewport sizes, and it flags any visual diffs. So if that dropdown is cut off on a specific device width, Percy would catch it because the screenshot wouldn't match the baseline.

Wouldn't that generate a lot of false positives? If you change literally anything in your UI, every screenshot breaks.

That's the old criticism of pixel-diffing tools, and it was valid five years ago. But Percy has gotten smarter about anti-aliasing differences, sub-pixel rendering variations between browsers, and dynamic content stabilization. You can also set ignore regions and sensitivity thresholds. The real workflow is, you approve the diffs that are intentional, and the tool learns. It's not perfect, but for catching obscured elements, it's genuinely good.

Applitools takes a different approach?

Applitools uses what they call Visual AI — it's not comparing pixels, it's using computer vision models to understand the page semantically. It knows what a button is, it knows what a dropdown is, it knows if text is being truncated. So instead of flagging every pixel change, it flags things that look wrong to a human viewer. Overlapping elements, elements that extend beyond the viewport, text that's cut off. For Daniel's exact bug — a dropdown menu being obscured — Applitools would likely flag it even if the pixel diff was subtle, because the model understands that a menu should be fully visible when opened.

The AI is essentially doing what a human QA person would do — looking at the screen and going, that doesn't look right.

That's the pitch. And they've published benchmarks showing their model catches something like ninety-three percent of visual bugs with near-zero false positives on certain test suites. The trade-off is cost. Applitools is not cheap. Percy's pricing is more accessible, especially if you're already in the BrowserStack ecosystem.

Neither of those actually interacts with the dropdown. They're looking at static states, right?

This is the key distinction, and I'm glad you pushed on it. Visual regression tools catch layout problems in whatever state you capture. But you still need something to drive the interactions — to actually click the dropdown, to scroll, to trigger the state where the bug manifests. That's where end-to-end testing frameworks come in. Playwright and Cypress are the two dominant ones right now for web apps.

Daniel mentioned he might be working with a consolidated codebase — web app and Android app. Does that change the recommendation?

It does, and we should talk about that. But let me lay out the web side first, because that's where the tooling is most mature. Playwright, which is Microsoft's project, has a feature that I think directly addresses Daniel's pain point. It's called viewport testing, and you can parameterize your test suites to run across a matrix of device configurations. Not just screen sizes — actual device emulation profiles that include pixel density, touch versus mouse input, and browser-specific behaviors.

You write one test that clicks the dropdown, and Playwright runs it against twenty device profiles?

And critically, Playwright supports mobile WebView contexts, not just desktop browsers. So if your Android app is wrapping a web view, you can test in something that approximates the actual runtime environment. Cypress has historically been weaker on mobile emulation — they've improved it, but Playwright's device support is more comprehensive. Playwright also has built-in assertions for visibility. You can assert that an element is in the viewport, that it's not obscured by another element, that it's actually interactable, not just technically visible in the DOM.

Wait, that's interesting. The difference between "visible in the DOM" and "actually visible to a user.

This is where most automated testing falls apart, and I think it's worth spending a moment on because it's the core of Daniel's frustration. A lot of test suites check whether an element exists and whether its CSS display property isn't set to none. That's it. They don't check whether the element is positioned off-screen, whether its z-index places it behind a modal overlay, whether it's zero pixels tall because of some flexbox bug, whether it's covered by the keyboard. Playwright's actionability checks — that's their term — verify that the element is stable, visible, enabled, and not obscured before performing any action on it.

If Daniel's team had been running Playwright tests with actionability checks on a OnePlus device profile, the test would have failed when it tried to click the dropdown and the framework realized something was in the way.

That's the theory. In practice, emulation isn't perfect. A Playwright device profile emulates the viewport dimensions, the user agent string, the pixel ratio, and some touch behaviors. But it doesn't emulate the actual rendering engine quirks of a specific Android version on a specific manufacturer's hardware. OnePlus uses OxygenOS, which has its own modifications to how WebViews render and how the system UI interacts with app content. No emulator fully replicates that.

We're back to the problem Daniel identified — you can't predict every device. Emulation gets you maybe eighty, ninety percent of the way there, but the truly weird bugs still slip through.

Which is why the serious approach combines multiple layers. You use Playwright or Cypress for fast, automated functional testing across emulated device profiles. You layer on Percy or Applitools for visual regression across those same profiles. And then — this is the part a lot of teams skip — you run a subset of your critical user flows on real devices.

Real device clouds. Like BrowserStack, Sauce Labs.

BrowserStack has the largest real device cloud — they've got something like twenty thousand real mobile devices in data centers. You can run your Playwright or Selenium scripts against actual physical phones and tablets, not emulators. The tests are slower and more expensive, but they catch the rendering quirks and OEM-specific bugs that emulators miss.

For a team that can't afford twenty thousand real devices, this is the practical compromise. Run the emulated tests on every commit, run the real device tests on every release, or nightly.

That's exactly the pattern most mid-size teams land on. And here's something specific that I think Daniel would appreciate. BrowserStack acquired Percy a few years back, and they've been integrating visual regression directly into their real device testing pipeline. So you can run a test on a real Samsung Galaxy, a real OnePlus, a real Pixel, capture Percy screenshots at each step, and the visual diffs are flagged per device. You're not guessing whether the bug reproduces on real hardware — you can see it.

That integration seems like the thing a lot of people don't know exists. They think of these as separate tools — the device cloud over here, the visual testing over there — and they never connect them.

The fragmentation of the testing tool ecosystem is a real problem. Every vendor wants to be the platform, and stitching them together often requires custom scripting. But the BrowserStack plus Percy integration is turnkey at this point. And Applitools has their own device cloud integration — they call it Ultrafast Grid — which runs visual tests across multiple browsers and viewports in parallel.

Let me pull us back to something Daniel mentioned specifically. He said he's developing an app that's both a web app and an Android app, possibly a consolidated code base. To me that sounds like either a progressive web app, or something like React Native, or Flutter. Does the testing story change depending on which of those it is?

And this is where I see teams make expensive mistakes. If you're building with React Native, you're not rendering HTML and CSS in a WebView by default. You're rendering native components. Playwright and Cypress cannot test native components. Percy and Applitools can still do screenshot-based visual regression on native apps, but they need a different integration path — usually through Appium or a native testing framework.

The whole web testing stack we just described becomes partially irrelevant.

For the native parts, yes. For React Native specifically, the testing story has improved a lot in the last couple of years. Detox is the main end-to-end framework — it's maintained by the Wix engineering team, and it's designed specifically for React Native. It handles the asynchronous bridge between JavaScript and native code properly, which was a huge source of flaky tests in earlier frameworks. You can run Detox tests on emulators or on real devices through a device cloud.

For visual regression on native?

Applitools has a native mobile SDK that works with both React Native and fully native apps. Percy has a mobile offering too, though it's newer and less battle-tested than their web product. The workflow is similar — you take screenshots at key points in your test flow, and the visual AI compares them to baselines. But the setup is more involved than the web version. You're not just injecting a JavaScript snippet — you're integrating a native library.

What about Flutter? That's the other big consolidated codebase option.

Flutter has its own testing framework built in — it's actually quite good for unit and widget testing. For integration testing, Flutter's test library lets you run tests on emulators or real devices. The challenge with Flutter and visual regression is that Flutter renders everything on its own canvas — it doesn't use platform UI components. So traditional screenshot comparison tools work fine at the pixel level, but semantic understanding tools like Applitools might not recognize Flutter widgets as buttons or dropdowns the way they recognize native or web elements.

Because to the visual AI, a Flutter-rendered button is just pixels. It doesn't have the accessibility metadata or the DOM structure that the AI relies on to understand what it's looking at.

Flutter's accessibility tree is separate from its rendering, and most visual testing tools don't tap into it. There are some community efforts to bridge this — there's a package called Golden Toolkit for Flutter that does visual regression at the widget level — but it's not as mature as the web ecosystem.

If Daniel is on React Native, the path is Detox plus Applitools or Percy Mobile, running on real devices through BrowserStack or Sauce Labs. If he's on Flutter, the path is more fragmented and may require more custom work. If he's on a web app wrapped in a WebView or a PWA, Playwright plus Percy on BrowserStack real devices is the most mature path.

That's a good summary. And I want to add one more layer that I think is under-discussed. Daniel mentioned CSS standards and the idea that modern CSS should handle responsive layouts without breaking. And he's not wrong — CSS Grid and Flexbox and container queries have made responsive design dramatically more robust than the float-based layouts of ten years ago. But the bug he described — an obscured dropdown — that's rarely a CSS layout problem. It's usually a component library bug, or a custom implementation that doesn't account for viewport edge cases.

The tooling can catch the bug, but the root cause is often deeper in the component architecture.

And this is where I get excited about some newer approaches. There's a category of tools emerging that I'd call accessibility-driven testing. The idea is that if an element is properly accessible — if it has the right ARIA attributes, if it's reachable via keyboard navigation, if it's announced correctly by screen readers — it's almost certainly visible and interactable for sighted users too.

Testing for accessibility as a proxy for general UI correctness.

Because the overlap is huge. An obscured dropdown isn't just a visual bug — it's an accessibility failure. A screen reader user can't interact with it either. Tools like Axe-core, which is the engine behind a lot of accessibility testing libraries, can be integrated into Playwright and Cypress tests. They catch a surprising number of the same bugs that visual regression tools catch, plus additional accessibility issues.

They wouldn't catch every case. A dropdown might be visually obscured but still present in the accessibility tree and technically reachable.

Accessibility testing is a complement, not a replacement. But here's what's clever — some teams are now running what they call "semantic snapshot testing." Instead of comparing pixel screenshots, they compare the accessibility tree across device profiles. If an element disappears from the accessibility tree on a specific viewport, or if its position in the tree changes unexpectedly, that gets flagged. It's faster than visual regression and produces fewer false positives, but it misses purely visual bugs like color contrast issues.

Let me ask a practical question. Daniel's a developer — he's not running a massive QA team. He wants to catch these bugs without spending forty hours a week on test infrastructure. What's the minimum viable setup that would have caught his dropdown bug?

For a solo developer or a small team, I'd say start with Playwright. It's free, it's open source, it's well-documented. Write tests for your critical user flows — the top five or six things users do in your app — and parameterize them across the top ten mobile device profiles. Playwright has a device descriptor list built in. Add the actionability checks, which are on by default in Playwright. That alone would catch a significant percentage of these bugs.

For visual regression on a budget?

Playwright has built-in screenshot comparison. It's not as sophisticated as Percy or Applitools — it does pixel-level comparison without the AI smarts — but it's free and it's integrated. You can write assertions like "this element's screenshot should match the baseline" and it'll flag diffs. For a small app, that might be sufficient. The next step up in cost but dramatically better in results is Percy's free tier, which gives you something like five thousand screenshots per month. For a solo developer, that might be plenty.

The real device piece?

That's where the cost jumps. BrowserStack's lowest paid plan starts around twenty-nine dollars a month, and that gives you limited real device minutes. For a solo developer, I'd say use Playwright on emulated devices for every commit, and then once a week or before releases, run the same tests on BrowserStack real devices manually. You don't even need to automate the real device runs at first — just running your test suite once on a handful of real phones would catch the OnePlus-specific rendering quirks that emulators miss.

There's something about this that bugs me, though. We're telling developers to buy device clouds, subscribe to visual testing services, integrate multiple frameworks — all to catch bugs that users will still find. The economics of this don't fully work for small teams.

I think that's a fair frustration. And it connects to something larger about the state of mobile development. The fragmentation isn't getting better. Every year there are more device models, more screen aspect ratios, more OEM Android modifications. Google has tried to rein this in with things like Project Treble and Mainline modules, but the reality is that a OnePlus phone renders web content differently than a Samsung phone, and both differ from a Pixel. The testing burden keeps growing.

Which is exactly why Daniel's instinct about automation being the right approach is correct. You can't test manually across a hundred device types. But even automation has limits — you're automating tests across emulated profiles that don't perfectly match real devices, or you're paying for real device access that's expensive and slow.

There's a middle ground that I haven't mentioned yet, and it's actually quite clever. Firebase Test Lab — it's Google's device cloud, and it has a free tier. You get something like fifteen test executions per day on physical devices, and up to ten device models per test matrix. It's not as extensive as BrowserStack, but for an Android developer, it's useful and the free tier is generous. You can run your Espresso tests if you're native, or your Appium tests if you're hybrid, across a range of real devices.

Daniel's app is on Android. That seems directly relevant.

Firebase Test Lab also has something called Robo Test. You don't even write test scripts — it crawls your app automatically, tapping on things, filling in text fields, navigating around, and it takes screenshots at each step. It'll flag crashes, ANRs, and some visual issues. It's not going to catch every obscured dropdown, but it catches a surprising amount for zero scripting effort.

The truly minimal setup for a small Android developer: Playwright or Appium tests on emulated devices locally, Firebase Test Lab Robo Tests on real devices for free, and maybe Percy's free tier if visual regression matters. That's not nothing.

I'd add one more thing that costs zero dollars and catches a lot. Use your app on a cheap, low-end device. Not your daily driver. Go buy a used, two-year-old mid-range phone for a hundred bucks. The kind of phone your users actually have. Developers tend to test on flagship devices — fast processors, lots of RAM, latest OS updates. The obscured dropdown bug Daniel described? It might only manifest on devices with slightly different aspect ratios, or on older Android versions with different overflow handling. Testing on a device you don't care about is surprisingly effective.

The "burner phone" testing strategy. I like it.

It's not automated, but it catches the class of bugs that automation misses because the test environment doesn't perfectly replicate the real-world conditions.

Let me take us in a slightly different direction. Daniel mentioned that these are "surprisingly common pain points" in well-funded apps. And he's right — I see these bugs in apps from major companies all the time. Is the tooling we're describing actually being used by these companies? Or are they just not investing in this?

Oh, the big companies are absolutely using these tools. Google uses Percy internally. Microsoft built Playwright because they needed better testing for their own web properties. The issue isn't that the tools don't exist or that companies don't use them. The issue is test coverage and prioritization.

What do you mean by prioritization?

A well-funded app might have ten thousand test cases. They're running Playwright across twenty device profiles. That's two hundred thousand test executions. Even with parallelization, that's a lot of time and compute. So teams prioritize. They test the happy path on every device, but edge cases — like "what happens when the user opens the dropdown while the keyboard is visible on a OnePlus in dark mode" — those don't make the cut. The test matrix explodes combinatorially, and nobody can afford to test every permutation.

The tooling exists, the tests exist, but the specific combination of device, OS version, app state, and user action that triggers the bug isn't in the test matrix.

And this is where I think the next generation of tools is heading. There are startups working on what they call "exploratory testing agents" — AI systems that don't follow scripted test cases but instead explore the app like a curious user would, trying unusual combinations of actions, looking for things that seem broken. It's early, but the idea is to catch the bugs that aren't in your test plan because you never thought to test for them.

An AI that just pokes at your app and goes, hey, this looks wrong.

Applitools has been moving in this direction with their Visual AI. But the really ambitious version is fully autonomous testing agents that understand app functionality and can generate their own test cases. It's not production-ready for most teams yet, but the direction is clear.

Which brings us back to Daniel's original point — this is fertile ground for automation precisely because the combinatorial explosion makes manual test planning impossible. The only way to cover the long tail of device and state combinations is with tools that explore automatically.

I think we're closer to that than a lot of people realize. The building blocks are there. You've got Playwright for reliable browser automation, Applitools and Percy for visual AI, real device clouds for hardware diversity, and now the first generation of AI testing agents. The integration isn't seamless yet, but a motivated developer can stitch together something quite powerful.

Let me ask one more question, and then we should start wrapping up. If Daniel is building this app right now, today, and he wants to add automated testing that catches these UI bugs — what's the one thing he should do first? The highest-leverage step?

Write five Playwright tests for his most critical user flows, parameterize them across the top ten mobile viewports, and run them on every commit. That's step one. It's free, it takes an afternoon to set up, and it will catch the most egregious bugs — including the kind of obscured dropdown he described, as long as the test actually clicks the dropdown and asserts visibility. Everything else — visual regression, real device clouds, AI testing — builds on that foundation. But that foundation has to be there first.

If he's React Native or Flutter rather than web?

Same principle, different tools. React Native: Detox with multi-device configuration. Flutter: integration tests with the built-in framework, running on multiple emulator profiles. The principle is the same — automate your critical flows across device profiles. The specific framework matters less than the discipline of actually doing it.

That feels like a good place to land. The tools exist, they're more mature than a lot of developers realize, and the barrier to entry is lower than it's ever been. The hard part isn't the tooling — it's the commitment to writing and maintaining the tests.

Accepting that you'll never catch everything. No test suite is exhaustive. The goal isn't zero bugs — it's catching the bugs that would drive users away before they report them. Daniel's instinct about automation being the right approach is spot on. The frustrating part is that the tools can't fully replace the diversity of real-world devices and user behavior. But they can shrink the problem from "we have no idea what's broken" to "we know exactly which permutations we haven't tested.

Now: Hilbert's daily fun fact.

Hilbert: The Greenland shark can live for over four hundred years, making it the longest-living vertebrate known to science. Researchers determine their age by radiocarbon dating the eye lens nuclei, which are formed before birth and never regenerate.

Four hundred years. That shark was swimming around when the Mayflower was crossing the Atlantic.

It's still got terrible eyesight from all those centuries of eye lens nuclei accumulating. On that note — thanks to our producer Hilbert Flumingtop, and this has been My Weird Prompts. If you want more episodes, find us at myweirdprompts.com or on Spotify. We'll be back next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2577: Fixing Hidden UI Bugs on Real Devices

The Problem: A Dropdown That Works on One Phone, Breaks on Another

Why This Bug Is Harder Than It Looks

The Testing Tool Landscape

Visual Regression Testing

The Interaction Gap

The Real Device Gap

Codebase Type Changes Everything

Key Takeaways

Downloads

You Might Also Like

#2577: Fixing Hidden UI Bugs on Real Devices