Unicode General Categories

Every Unicode character is assigned a general category — a high-level classification like Letter, Number, or Symbol. Categories are used by regular expressions (\p{L}, \p{N}) and text processing algorithms.

Letter

Lowercase Letter

Letters that are lowercase, such as a, b, c.

Modifier Letter

Non-combining characters used to modify preceding letters.

Letters that are not uppercase, lowercase, titlecase, or modifier letters.

Titlecase Letter

Letters that are titlecase, used at the start of words in certain scripts.

Uppercase Letter

Letters that are uppercase, such as A, B, C.

Mark

Marks that take up space when rendered, such as some vowel signs.

Marks that surround or enclose other characters.

Non-Spacing Mark

Marks that do not take up space, typically accents and diacritics.

Number

Decimal digit characters, such as 0–9 and their equivalents in other scripts.

Number characters that look like letters, such as Roman numerals.

Number characters that are not decimal digits or letter numbers.

Punctuation

Connector Punctuation

Punctuation marks that connect words, such as the underscore.

Dash Punctuation

Punctuation marks that separate words or clauses, such as hyphens and dashes.

Close Punctuation

Closing punctuation marks, such as brackets and parentheses.

Final Punctuation

Closing quotation marks.

Initial Punctuation

Opening quotation marks.

Other Punctuation

Punctuation marks that are not connectors, dashes, brackets, or quotes.

Open Punctuation

Opening punctuation marks, such as brackets and parentheses.

Symbol

Currency Symbol

Currency symbols such as $, £, €, and ¥.

Modifier Symbol

Modifier symbols that are not spacing combining marks.

Mathematical symbols such as +, =, <, and >.

Symbols that are not math, currency, or modifier symbols.

Separator

The Unicode line separator character.

Paragraph Separator

The Unicode paragraph separator character.

Space Separator

Space characters of various widths.

Other

Control characters such as carriage return, tab, and null.

Non-visible formatting characters such as the zero-width joiner.

Code points reserved for private use by applications.

High and low surrogate code points used in UTF-16 encoding.