Our website is made possible by displaying online advertisements to our visitors.
Please consider supporting us by disabling your ad blocker.

Responsive image


Unicode and HTML

Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in an HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.

In RFC 1866, the initial HTML 2.0 standard, the document character set was defined as ISO-8859-1 (later HTML standard defaults to Windows-1252 encoding). It was extended to ISO 10646 (which is basically equivalent to Unicode) by RFC 2070. It does not vary between documents of different languages or created on different platforms. The external character encoding is chosen by the author of the document (or the software the author uses to create the document) and determines how the bytes used to store and/or transmit the document map to characters from the document character set. Characters not present in the chosen external character encoding may be represented by character entity references.

The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike. The accurate representation of text in web pages from different natural languages and writing systems is complicated by the details of character encoding, markup language syntax, font, and varying levels of support by web browsers.


Previous Page Next Page








Responsive image

Responsive image