Unicode Code Point Inspector

JJ Ben-Joseph headshot JJ Ben-Joseph

What this Unicode Code Point Inspector does

The Unicode Code Point Inspector lets you move back and forth between characters (like A, é, or 😀) and their underlying numeric identifiers. For any valid Unicode character or code point it will:

This is useful when you are debugging encoding problems, writing regular expressions with Unicode support, working with emoji, or just learning how Unicode works internally.

How to use this inspector

The form accepts either a single character or a numeric code point. You can use whichever is more convenient.

1. Enter a character

In the Character field, type or paste exactly one user‑perceived character. Examples:

  • Latin letter: A, é, ß
  • Emoji: 😀, 🎼, 🚀
  • Non‑Latin script: م, Ж, ,
  • Common symbol: , ©,

The inspector will derive the code point from that character and show all numeric representations.

2. Enter a code point

In the Code Point field, you can enter the numeric value directly in one of several formats:

  • Standard Unicode notation: U+1F600, U+00E9
  • Hex with 0x prefix: 0x1F600, 0x41
  • Plain hexadecimal: 1F600, 0041
  • Decimal: 128512, 65

If you fill in both fields, the code point takes precedence. This is helpful if the character field contains multiple characters or whitespace by accident.

3. Run the inspection

After entering your value, activate the Inspect button. The page will show:

  • The normalized code point (e.g., U+1F600).
  • Decimal, hex, and binary values.
  • UTF‑16 code units (one value for BMP characters, two for surrogate pairs).
  • A basic category label derived from Unicode properties.

What is a Unicode code point?

Unicode assigns every character a unique number called a code point. Conceptually, the Unicode space is a numbered list from U+0000 to U+10FFFF. Each position may represent a letter, digit, punctuation mark, symbol, emoji, or a special non‑printing control code.

By convention, code points are written as U+HHHH where HHHH is a hexadecimal number. For example:

  • AU+0041
  • éU+00E9
  • 😀 (grinning face) → U+1F600

Internally, computers still store bytes, not abstract code points. Encoding schemes such as UTF‑8 and UTF‑16 map each code point to one or more underlying code units, which are then expressed as bytes in memory or on disk.

Formula for surrogate pairs (UTF‑16)

For code points above U+FFFF, UTF‑16 uses surrogate pairs. If CP is a code point in the range U+10000 to U+10FFFF, the transformation from CP to the two UTF‑16 units can be expressed formally.

The core relationship can be written as:

CP 65536 = v

Then the high and low surrogates are:

high = 0xD800 + v / 1024 low = 0xDC00 + v mod 1024

In words:

  1. Subtract 0x10000 (65536) from the code point.
  2. Divide the result by 1024. The quotient, added to 0xD800, gives the high surrogate.
  3. The remainder, added to 0xDC00, gives the low surrogate.

The inspector applies this logic under the hood when it displays UTF‑16 units for characters outside the BMP.

Worked example: 😀 (U+1F600)

Suppose you paste 😀 into the Character field and click Inspect. The tool will find its code point and derive related data:

  1. The Unicode code point is U+1F600.
  2. In decimal, this is 128512.
  3. In binary, it is a 21‑bit value: 0001 1111 0110 0000 0000 (grouped for readability).
  4. Because it is greater than U+FFFF, UTF‑16 uses a surrogate pair.

Following the surrogate pair formula:

  • CP = 0x1F600.
  • v = CP − 0x10000 = 0xF600.
  • high = 0xD800 + (v / 0x400) = 0xD800 + 0x3D = 0xD83D.
  • low = 0xDC00 + (v mod 0x400) = 0xDC00 + 0x00 = 0xDE00 (values are illustrative; the inspector shows the exact units used by the runtime).

The inspector presents these UTF‑16 units so that you can see why JavaScript sees this character as length 2, and how it will look when encoded as bytes.

Interpreting the inspector output

Once you run the tool, you will usually see several distinct fields in the results. Here is how to read them:

  • Character — The visual representation of the code point. Non‑printing characters may be shown as a placeholder or may appear blank.
  • Unicode code point — The normalized U+HHHH form. Use this for documentation or specifications.
  • Decimal value — The base‑10 version of the same number, useful for APIs or protocols that expect decimal input.
  • Hex value — The base‑16 representation (without U+), often used in programming, terminals, and encoding tables.
  • Binary value — The bit pattern of the code point, which can help you understand how it maps to lower‑level encodings.
  • UTF‑16 units — One or two 16‑bit values that show how this character appears in JavaScript strings and many internal APIs.
  • Category — A simplified label such as letter, digit, whitespace, control, or symbol, based on Unicode property tests.

Common representations compared

The same Unicode character can appear in different notations depending on context. The table below outlines some of the most common ways to express a code point and how they relate to the inspector’s outputs.

Context Example notation Relationship to inspector output
Unicode standard U+1F600 Matches the inspector’s normalized code point field.
Hex literal in code 0x1F600 Same numeric value as the hex output, using a language‑specific prefix.
Decimal code 128512 Matches the inspector’s decimal value.
JavaScript escape "\u{1F600}" Uses the same code point; older syntax may show surrogate units like "\uD83D\uDE00".
HTML entity 😀 or 😀 These numeric character references are based on the decimal and hex outputs.
UTF‑16 units D83D DE00 Corresponds to the inspector’s UTF‑16 code unit field.

Practical uses and limitations

This inspector is designed as a lightweight, browser‑based helper. It is powerful enough for everyday development and learning tasks, but it does make some assumptions.

Practical uses

  • Debugging encoding problems — When a character does not display as expected, check whether the code point and UTF‑16 units match what you expect.
  • Working with emoji — See why emoji and other supplementary characters occupy two UTF‑16 units, and verify that your tooling handles them correctly.
  • Regular expressions with Unicode — Use the category information to decide whether to use constructs like \p{L} (letters) or \p{Nd} (decimal digits) in Unicode‑aware regex engines.
  • Generating escape sequences — Convert a visible character into the numeric form required by HTML entities, JavaScript, or other languages.

Limitations and assumptions

  • Supported range — The tool targets the standard Unicode range from U+0000 to U+10FFFF. Values outside this range are considered invalid.
  • Unassigned or deprecated code points — If you manually enter a code point that is not currently assigned to a character in Unicode, the tool will still treat it as a numeric value. The character display may be empty or show a generic replacement symbol, depending on your system fonts.
  • Non‑printing characters — Control codes (such as line breaks) and formatting characters do not have a visible glyph. The inspector may appear to show “nothing” in the character field even though the code point is recognized.
  • No normalization — The tool does not perform Unicode normalization (NFC, NFD, etc.). Canonically equivalent sequences (for example, a precomposed é versus e plus combining accent) are treated as distinct sequences of code points.
  • Single code point focus — The inspection logic is oriented around a single code point at a time. Grapheme clusters made of multiple code points (such as flags, family emoji, or letter+combining mark) will only be inspected one code point at a time, based on the script’s parsing rules.
  • Environment‑dependent fonts — Whether a character looks correct, appears as a colored emoji, or shows a missing‑glyph box depends on your browser and installed fonts, not on the inspector itself.

Keeping these limitations in mind will help you interpret the results accurately and avoid misdiagnosing encoding issues.

Code points vs. UTF‑16 code units

This tool focuses on how a code point is represented in UTF‑16, the encoding used by JavaScript strings and many APIs. In UTF‑16:

  • The Unicode range is split into the Basic Multilingual Plane (BMP), from U+0000 to U+FFFF, and supplementary planes, from U+10000 to U+10FFFF.
  • BMP characters use one 16‑bit code unit. For example, A (U+0041) is stored as a single unit 0041.
  • Supplementary characters (including most emoji) use two 16‑bit code units called a surrogate pair.

JavaScript’s string.length property counts UTF‑16 code units, not code points. That means characters above U+FFFF are counted as length 2. The inspector helps you see this clearly by displaying the UTF‑16 units next to the code point.

Next steps

Once you are comfortable with the numeric side of Unicode, you can apply what you learn here directly in code, markup, and debugging sessions. Use the inspector as a quick reference whenever you need to confirm a code point, generate an escape sequence, or understand how a particular character is stored in UTF‑16.

Provide either a single character or a code point in hexadecimal (U+1F600) or decimal form.

Enter a character or code point to see its properties.