Skip to main content
Skip to docs content

How Screen Readers Work

Screen readers are assistive technologies that convert visual interfaces into spoken or braille output, enabling people who are blind or have low vision to navigate software independently. They don't "see" your page the way a sighted user does. Instead, they rely on a structured representation of your content built by the browser. Understanding this pipeline (how HTML becomes speech) helps developers write more accessible markup from the start. When you know what a screen reader actually receives, you can anticipate problems before they reach users: missing labels, broken relationships, invisible states. This page walks through that pipeline end to end, from raw HTML to spoken output, and shows how Speakable fits into the picture as a static analysis tool that mirrors part of the same process.

The Accessibility Tree

When a browser parses your HTML and builds the DOM (Document Object Model), it also constructs a parallel data structure called the accessibility tree. This tree is a simplified version of the DOM that strips away everything visual (colors, font sizes, layout) and exposes only what assistive technologies need to communicate your interface to users.

The accessibility tree exposes four categories of information for each node:

  • Role: what the element is (button, link, heading, list, navigation landmark)
  • Name: the accessible name, computed from text content, labels, or ARIA attributes
  • State: current status like checked, expanded, disabled, required
  • Relationships: how elements connect to each other (label-to-input, description-to-control, group membership)

Key concept

The accessibility tree is not a theoretical model. It's an actual data structure that browsers expose through platform accessibility APIs. You can inspect it in Chrome DevTools (Accessibility tab), Firefox (Accessibility Inspector), or Safari (Web Inspector > Node > Accessibility).

Here's a comparison showing how a navigation structure maps from DOM to accessibility tree:

DOM vs Accessibility Tree
DOM:                          Accessibility Tree:
<nav aria-label="Main">       navigation "Main"
  <ul>                          list
    <li>                          listitem
      <a href="/">Home</a>          link "Home"
    </li>                         
  </ul>                         
</nav>

Notice how the tree flattens the structure. The <nav> element becomes a navigation role with the name "Main" (from aria-label). The anchor tag becomes a link role with the name "Home" (from its text content). Structural elements like <ul> and <li> map to list and listitem roles.

Not every DOM node appears in the accessibility tree. Elements that exist purely for visual purposes, decorative images with empty alt="", wrapper <div>s and <span>s used for layout, are pruned from the tree. Similarly, elements hidden with display: none, visibility: hidden, or aria-hidden="true" are excluded entirely. This is by design: the tree represents what should be communicated, not what is rendered visually.

Conversely, elements hidden visually but present for screen readers (using techniques like the "visually-hidden" CSS pattern) remain in the accessibility tree. This asymmetry is intentional: it gives developers fine-grained control over what assistive technology users perceive versus what sighted users see.

Generic elements like <div> and <span> without explicit roles are typically represented as generic containers or flattened entirely. Their children are exposed but the container itself carries no semantic meaning. This is why "div soup" is problematic for accessibility: without semantic elements, the accessibility tree becomes a flat list of text nodes with no structural information for the screen reader to convey to users. Using native HTML elements like <nav>, <main>, <button>, and <h1> to <h6> ensures the tree carries meaningful structure automatically.

From HTML to Speech: The Pipeline

Understanding the full path from HTML source code to spoken words helps demystify screen reader behavior. The pipeline has distinct stages, each owned by a different piece of software:

1.

Browser parses HTML → DOM

The browser reads your HTML source and constructs the Document Object Model, a live, in-memory tree of elements, attributes, and text nodes.

2.

Browser builds Accessibility Tree from DOM

Using ARIA mappings, HTML semantics, and computed styles, the browser generates the accessibility tree. This is where roles, names, states, and relationships are determined.

3.

Screen reader queries via Platform Accessibility API

The screen reader doesn't read the DOM directly. It communicates with the browser through the operating system's accessibility API: MSAA/UIA on Windows, NSAccessibility on macOS, AT-SPI on Linux.

4.

Screen reader converts to speech/braille output

The screen reader takes the role, name, state, and value information and formats it into natural language that gets sent to a speech synthesizer or braille display.

Here's the full pipeline as a flow:

Pipeline Flow
HTML Source → DOM → Accessibility Tree → Platform API → Screen Reader → Speech

Each stage introduces potential failure points. If your HTML lacks semantic structure, the DOM won't carry meaningful information forward. If ARIA attributes are misused, the accessibility tree will contain incorrect data. If the platform API has bugs (they do), even correct HTML might be announced poorly. And each screen reader has its own interpretation logic. NVDA, JAWS, VoiceOver, and Narrator can all announce the same accessibility tree differently.

How Speakable mirrors this

Speakable's analysis pipeline replicates the browser's portion of this flow: HTML → parse → extract → model → render. It parses HTML into a DOM-like structure, extracts an accessibility tree representation, models how each screen reader would interpret it, and renders the predicted speech output. This gives you visibility into steps 1-4 without needing to run an actual screen reader.

API Reference - Pipeline stages

See the parse, extract, and render functions that mirror each pipeline stage.

Browse Mode vs Focus Mode

Screen readers operate in two primary interaction modes. Understanding these modes is essential because they determine what gets announced and how users navigate your content.

Browse Mode (Virtual Cursor)

In browse mode, the screen reader maintains a virtual cursor that moves through all content on the page: headings, paragraphs, links, images, lists, everything. Users navigate linearly with arrow keys, or jump between elements using shortcut keys (H for headings, K for links, T for tables).

This is the default mode. Everything in the accessibility tree is reachable. The screen reader announces each element as the virtual cursor lands on it, reading out the role, name, and any relevant state.

Focus Mode (Forms Mode)

In focus mode, keyboard events pass directly to interactive controls. Users interact with form fields, type text, select options, and press buttons. Only focusable elements are reachable; static text and headings between form controls are skipped.

The switch from browse to focus mode happens automatically when a user activates certain controls: pressing Enter on a text input, entering a composite widget like a listbox, or activating an application role region.

The mode switch changes the announcement model dramatically. In browse mode, a form might be announced as: "heading level 2, Contact Form, name, edit text, required, email, edit text, required, submit, button." Every label, every piece of descriptive text between fields is read. In focus mode, the user tabs between controls and hears only: "name, edit text, required" → "email, edit text, required" → "submit, button." The surrounding prose vanishes from the interaction model.

This distinction matters for developers because it affects how you structure forms. Instructions placed as plain text between inputs will be heard in browse mode but missed in focus mode. To ensure critical instructions reach users in both modes, associate them with controls using aria-describedby. This makes the description part of the control's announced information regardless of mode.

Some screen readers also support an "auto forms mode" setting, where focus mode activates automatically when the virtual cursor encounters a form control. NVDA enables this by default; JAWS requires a setting change. This means that some users will never hear the browse-mode version of your form. They'll land on the first input and immediately enter focus mode. Design your forms with this in mind: every input needs a properly associated label, and critical instructions should be linked via aria-describedby rather than relying on surrounding text.

Speakable and browse mode

Speakable's output corresponds to browse mode announcements. It walks through all content linearly, announcing every accessible element in document order. This is the most comprehensive view of what a screen reader user would encounter when first landing on your page and reading through it.

What Gets Announced

When a screen reader encounters an element, it assembles an announcement from the accessibility tree node's properties following a general pattern: Role → Name → State → Value. However, the exact ordering and phrasing varies between screen readers.

Example 1

A submit button

HTML: <button>Submit</button>

NVDA: "Submit, button"

VoiceOver: "Submit, button"

role=button, name="Submit"

Example 2

A required email input

HTML: <input type="email" aria-label="Email" required />

NVDA: "Email, edit, required"

VoiceOver: "Email, required, text field"

role=textbox, name="Email", state=required

Example 3

An expanded disclosure widget

HTML: <button aria-expanded="true">Menu</button>

NVDA: "Menu, button, expanded"

JAWS: "Menu, button, expanded"

VoiceOver: "Menu, expanded, pop-up button"

role=button, name="Menu", state=expanded

Notice the differences between screen readers. NVDA typically follows a name → role → state pattern. VoiceOver sometimes places role last, sometimes inserts state between name and role, and uses different terminology (e.g., "pop-up button" instead of just "button"). JAWS closely matches NVDA for many elements but diverges on complex widgets. Narrator has its own patterns on Windows, often being more verbose about states.

These differences aren't bugs. They're design decisions by each screen reader vendor based on user research and conventions in their user communities. As a developer, your job isn't to control the exact output phrasing. Instead, focus on providing the right inputs: correct roles (use semantic HTML), meaningful names (labels and text content), and accurate states (ARIA attributes that reflect current UI state).

Beyond individual elements, screen readers also announce context changes: entering and leaving landmarks ("navigation region"), list boundaries ("list, 5 items"), table dimensions ("table, 3 columns, 10 rows"), and heading levels ("heading level 2"). These structural announcements help users build a mental model of your page's organization.

Understanding what gets announced also means understanding what does not get announced. Decorative images (with alt=""), elements with aria-hidden="true", and presentation-role elements are skipped entirely. CSS-generated content (via ::before and ::after) is handled inconsistently. Some screen readers announce it, others don't. Avoid relying on pseudo-elements for meaningful content.

How Speakable Fits In

Speakable performs static analysis of HTML. It replicates step 2 of the pipeline described above. Given HTML input, it builds an accessibility tree representation, computes accessible names using the same algorithm browsers use (the Accessible Name and Description Computation), and then models how each screen reader would format the resulting information into speech.

Best for

  • Catching structural issues (missing labels, broken landmark hierarchy, invalid ARIA)
  • Regression testing: detecting when a refactor changes what screen readers would announce
  • CI/CD integration: automated checks on every pull request
  • Fast iteration: immediate feedback without launching a screen reader
  • Education: understanding what screen readers receive from your HTML

Limitations

  • Doesn't execute JavaScript: dynamic content changes won't be captured
  • Doesn't simulate user interaction (focus, click, type): no focus-mode testing
  • Can't detect timing issues (live regions that fire too fast, race conditions)
  • Screen reader heuristics vary by version. Predictions are approximations.

Speakable is a complement to, not a replacement for, testing with actual screen readers and real users. Think of it like a linter for accessibility: it catches a wide class of issues automatically and gives you fast feedback, but it can't fully replicate the experience of navigating your app with assistive technology. The testing pyramid for accessibility looks like this:

Accessibility Testing Pyramid
                    ╱╲
                   ╱  ╲         User testing with disabled people
                  ╱    ╲        (highest fidelity, lowest frequency)
                 ╱──────╲
                ╱        ╲      Manual screen reader testing
               ╱          ╲    (NVDA, VoiceOver, JAWS)
              ╱────────────╲
             ╱              ╲   Automated tools: Speakable, axe, Lighthouse
            ╱                ╲  (fast feedback, CI/CD, broad coverage)
           ╱──────────────────╲
          ╱                    ╲ Semantic HTML & ARIA linting
         ╱______________________╲(foundation — catches basics early)

Speakable sits in the automation layer. It gives development teams the confidence to ship accessible HTML by providing continuous validation. When it flags an issue, you know something is wrong structurally. When it passes, you have a strong baseline, but you should still validate complex interactions with manual testing.

For a practical guide on integrating Speakable into your workflow, including snapshot testing, CI pipeline configuration, and combining it with manual testing, see the Usage Guide and CI/CD Integration pages.

Related Pages