The world’s writing systems encompass thousands of symbols. Early computers, built around limited memory and designed primarily for English, used encodings like ASCII that mapped just 128 characters to numbers. As computing spread globally, this proved insufficient. The Unicode Consortium, founded in 1991, set out to assign every character used in writing, mathematics, and technical notation a unique number called a code point. Today, Unicode specifies more than one hundred forty thousand characters across scripts from Latin and Cyrillic to Egyptian hieroglyphs and emoji. Each code point provides a stable identifier enabling consistent text interchange across platforms, programming languages, and operating systems.
A Unicode code point is conventionally written as , where is a hexadecimal number. For example, the smiling face emoji 😀 corresponds to . Internally, computers encode these numbers using schemes such as UTF‑8 or UTF‑16, which break the code point into bytes or 16‑bit units. JavaScript strings use UTF‑16, meaning each character is stored as one or two 16‑bit code units. Characters beyond the Basic Multilingual Plane (with code points above ) require a pair of surrogate code units. This inspector illustrates these relationships by displaying binary, decimal, and hexadecimal values for the character you enter.
Suppose you type the letter A
. Its code point is , and UTF‑16 stores it as a single 16‑bit unit . Entering the musical symbol G clef 🎼 () is more complex: UTF‑16 represents it as two surrogate units and . The formula for converting a code point above into surrogate pairs is:
Understanding the distinction between code points and code units helps avoid bugs when slicing strings or counting characters. JavaScript’s length
property reports the number of code units, not code points, so characters outside the Basic Multilingual Plane count as two.
The form above accepts either a character or its code point. If both are provided, the code point takes precedence. When you press the button, the script parses the input. Hexadecimal values may include an optional 0x
prefix; decimal values are also accepted. The resulting code point appears in decimal, hexadecimal, and binary. The corresponding character is displayed, and a simple classification identifies whether it is a letter, digit, whitespace, control character, or symbol using regular expressions with Unicode property escapes (/\p{L}/u
for letters, /\p{Nd}/u
for digits, and so forth). These classifications demonstrate Unicode’s rich property system that goes beyond mere numbering.
The table below lists a few representative characters to illustrate the diversity of Unicode. Each row shows the glyph, its code point, decimal value, and category.
Character | Code Point | Decimal | Category |
---|---|---|---|
A | U+0041 | 65 | Letter |
Ω | U+03A9 | 937 | Letter |
Я | U+042F | 1071 | Letter |
中 | U+4E2D | 20013 | Letter |
😀 | U+1F600 | 128512 | Symbol |
While Unicode assigns numbers and properties, it does not dictate appearance. Rendering a character involves selecting a font that contains a glyph for the code point. If a font lacks a glyph, systems display a placeholder such as a box or question mark. This explains why some emoji appear in color on certain platforms but monochrome elsewhere. In web development, specifying a font stack ensures that common characters render consistently. Fonts like Noto attempt to cover the entire range of Unicode, but no font includes every glyph. The inspector relies on your system’s fonts, so results may vary.
Many characters can be represented in multiple ways. For example, the character “é” may appear as a single code point or as the combination plus a combining acute accent . Unicode normalization forms such as NFC (Normalization Form C) and NFD (Normalization Form D) provide standardized ways to compose or decompose such sequences. Normalization is critical when comparing user input, processing search queries, or ensuring consistent storage. The inspector applies NFC normalization before analysis so that visually identical strings produce the same code point sequence.
Although the inspector works with abstract code points, actual data transmission relies on encodings. UTF‑8 encodes numbers using variable-length byte sequences. For code points below , UTF‑8 uses one byte identical to ASCII. Higher values require two, three, or four bytes. UTF‑16, used by many programming languages, employs one or two 16‑bit units. UTF‑32 represents each code point with a fixed 32‑bit integer, simplifying indexing but wasting space. The choice of encoding balances storage efficiency with ease of processing. This calculator does not display encoding bytes, but understanding them helps diagnose issues when transferring text between systems.
Unicode’s expressiveness can be exploited for mischief. Attackers may use visually similar characters (homoglyphs) to spoof URLs or variable names, known as an “IDN homograph attack.” Mixed script confusables—like combining Latin a
with Cyrillic а
—can deceive users. Unicode also contains directional control characters that can reorder text, potentially obfuscating source code. Recognizing the numeric identity of each character helps detect such tricks. When displaying user-generated content, it is prudent to normalize strings and, in security-sensitive contexts, restrict allowed scripts.
Unicode is not static. New versions appear regularly, adding scripts, symbols, and emoji. Each update reflects cultural needs and technological developments. For instance, recent releases have incorporated pictographs representing diverse professions, accessibility symbols, and new scientific notation. When processing text, it is wise to reference the version of Unicode supported by your environment. Browsers generally track the latest standard, but embedded systems may lag. This inspector assumes current support yet operates gracefully if a glyph is missing, displaying a fallback character.
The effort to encode humanity’s writing and symbols is monumental, intertwining linguistics, typography, and computing. By exploring code points interactively, you appreciate the structure behind everyday text. Whether you are debugging a string bug, internationalizing an application, or satisfying curiosity about a symbol’s identity, the Unicode Code Point Inspector offers a convenient window into the heart of digital text.
Estimate logical error rates and qubit overhead for a planar surface code from physical error probabilities and code distance.
Create custom QR codes instantly with this online generator. Enter text or a URL and download a scannable code to share with others.
Estimate the smallest QR code version required for your data length, type, and error correction level.