How to use
- Enter text in the text box (a sentence, paragraph, or longer passage).
-
Select a voice from the Voice list. The list is populated from
speechSynthesis.getVoices() and may load a moment after the page opens.
- Adjust Rate, Pitch, and Volume using the sliders.
- Click Speak to start reading.
- Use Pause/Resume to temporarily stop and continue playback.
- Use Stop to cancel speech immediately and clear the queue.
Tip: adding punctuation (commas, periods, parentheses, and line breaks) often improves rhythm and
pronunciation because it gives the synthesizer clearer phrasing boundaries.
What the browser is doing (high-level)
When you click Speak, the script creates a SpeechSynthesisUtterance object and assigns your
chosen settings. The browser then queues that utterance and streams synthesized audio to your speakers.
Under the hood, most engines follow a pipeline that looks like this:
-
Text normalization: numbers, abbreviations, and symbols are expanded into words (for example,
“12” becomes “twelve” in many voices; “Dr.” may become “doctor”).
-
Linguistic analysis: the engine identifies words, parts of speech, and likely pronunciations.
This is why “lead” can sound different in “lead pipe” vs “lead the team.”
-
Phoneme generation: words are mapped to phonemes (the small sound units of a language).
-
Prosody: timing, stress, and intonation are applied. Prosody is where punctuation and
sentence structure matter most.
-
Waveform synthesis: the final audio is generated and played. Modern systems may use neural
models; others use concatenative or parametric approaches.
Because the Web Speech API is a browser interface to platform voices, the exact behavior varies by operating
system, installed voice packs, and browser version. That variability is normal: the same text can sound
noticeably different across devices.
Understanding the controls
The three sliders map directly to properties on SpeechSynthesisUtterance. They are not “effects”
applied after the fact; they influence how the voice engine generates speech.
-
Rate: a multiplier for speaking speed. Values below 1.0 slow down speech; values above 1.0
speed it up. Extremely fast rates can reduce intelligibility.
-
Pitch: a multiplier for perceived pitch. Some voices respond strongly; others change only
subtly. Pitch changes can also affect perceived emotion.
-
Volume: output amplitude from 0.0 (mute) to 1.0 (max). If you need more loudness than 1.0,
use your system volume.
Practical workflow: start by choosing a voice that pronounces your language well, then set Rate for
comprehension, then adjust Pitch slightly if needed, and finally set Volume to a comfortable level.
The SpeechSynthesis API does not provide an exact duration estimate for an utterance. If you need a rough
planning estimate (for example, timing a narration), you can approximate the reading time from word count.
Let w be the number of words and r be the speaking rate in words per minute (WPM). Then:
Estimated time (seconds)
t = (w / r) × 60
Assumptions: this estimate ignores pauses, punctuation timing, and voice-specific pacing. The Rate slider in
this tool is a multiplier (not WPM), so treat the formula as a conceptual guide unless you calibrate WPM for a
specific voice on your device.
Worked example
Suppose you have a 120-word paragraph and you want to know roughly how long it will take to read aloud.
If you assume a comfortable narration speed of 150 WPM:
- w = 120 words
- r = 150 words/min
- t = (120 / 150) × 60 ≈ 48 seconds
You can use this to decide whether a script fits into a time slot, then fine-tune the Rate slider by ear.
If you want to calibrate your device, read a known 200-word sample at Rate 1.0, time it with a stopwatch,
and compute your effective WPM. Then you can reuse that baseline for future estimates.
These are simple estimates for planning. Actual playback time varies by voice, language, punctuation, and how
the browser implements timing.
Small edits to your text can dramatically improve how it sounds when synthesized. If a sentence feels rushed
or robotic, try one or more of the following adjustments and listen again.
If you are proofreading, a useful technique is to listen at a slightly faster Rate (for example 1.1–1.3) to
catch repeated words, then slow down (0.9–1.0) to evaluate clarity and emphasis.
Text-to-speech is a key accessibility technology, but it is not identical to a full screen reader.
Screen readers provide navigation, focus management, and semantic interpretation (headings, landmarks,
form controls, and ARIA roles). This tool is best viewed as a supplement for listening to content.
If you are a developer using TTS in an application, consider these inclusive design practices:
This page does not include analytics scripts and does not transmit your typed text to a server as part of its
core functionality. However, the voices themselves are provided by your operating system and browser.
Some platforms may download voice data packs or use cloud-enhanced voices depending on system settings.
If you have strict privacy requirements, test on the exact device and configuration you plan to use.
For sensitive content, consider using offline voices and disabling any optional cloud voice features at the OS
level. Also remember that audio can be overheard; use headphones when reading confidential material.
If you are exploring speech synthesis for development, the Web Speech API is a practical starting point.
It is easy to prototype with, requires no server, and demonstrates core concepts like voice selection,
utterance configuration, and playback control. For production-grade narration (especially for long-form audio),
you may also evaluate dedicated TTS services, but for quick local reading and accessibility experiments,
browser synthesis is often enough.