Character encoding determination Unicode and HTML

- September 15, 2015

1 character encoding determination

1.1 encoding information
1.2 encoding defaults
1.3 encoding trends
1.4 byte order mark/unicode sniffing
1.5 encoding overriding

character encoding determination

in order correctly process html, web browser must ascertain unicode characters represented encoded form of html document. in order this, web browser must know encoding used.

encoding information

when document transmitted via mime message or transport uses mime content types such http response, message may signal encoding via content-type header, such content-type: text/html; charset=utf-8. other external means of declaring encoding permitted used. if document uses unicode encoding, encoding info might present in form of byte order mark. finally, encoding can declared via html syntax. text/html serialisation then, long page encoded in extension of ascii (such utf-8, , thus, not if page using utf-16), meta element, <meta http-equiv= content-type content= text/html; charset=utf-8 > or (starting html5) <meta charset= utf-8 > can used. html pages serialized xml, declaration options either rely on encoding default (which xml documents utf-8), or use xml encoding declaration. meta attribute plays no role in html served xml.

encoding defaults

an encoding default applies when there no external or internal encoding declaration , no byte order mark. while encoding default html pages served xml required utf-8, encoding default regular web page (that is: html pages serialized text/html) varies depending on localization of browser. system set western european languages, windows-1252. cyrillic alphabet locales, default typically windows-1251. browser location legacy multi-byte character encodings prevalent, form of auto-detection applied.

encoding trends

because of legacy of 8-bit text representations in programming languages , operating systems , desire avoid burdening users need understand nuances of encoding, many text editors used html authors unable or unwilling offer choice of encodings when saving files disk , not allow input of characters beyond limited range. consequently, many html authors unaware of encoding issues , may not have idea encoding documents use. misunderstandings, such belief encoding declaration affects change in actual encoding (whereas label inaccurate), reason editor attitude. factor contributing in same direction, arrival of utf-8 — diminishes need other encodings, , modern editors tends default, recommended html5 specification, utf-8.

byte order mark/unicode sniffing

for both serializations of html (content-type text/html , content/type application/xhtml+xml ), byte order mark (bom) effective way transmit encoding information within html document. utf-8, bom optional, while must utf-16 , utf-32 encodings. (note: utf-16 , utf-32 without bom formally known under different names, different encodings, , needs form of encoding declaration – see utf-16be, utf-16le, utf-32le , utf-32be.) use of bom character (u+feff) means encoding automatically declares processing application. processing applications need initial 0x0000feff, 0xfeff or 0xefbbbf in byte stream identify document utf-32, utf-16 or utf-8 encoded respectively. no additional metadata mechanisms required these encodings since byte-order mark includes of information necessary processing applications. in circumstances byte-order mark character handled editing applications separately other characters there little risk of author removing or otherwise changing byte order mark indicate wrong encoding (as can happen when encoding declared in english/latin script). if document lacks byte-order mark, fact first non-blank printable character in html document supposed < (u+003c) can used determine utf-8/utf-16/utf-32 encoding.

encoding overriding

many html documents served inaccurate encoding information, or no encoding information @ all. in order determine encoding in such cases, many browsers allow user manually select encoding name list. may employ encoding auto-detection algorithm works in concert or — in case of bom , in case of html served xml — against manual override.

for html documents text/html serialized, manual override may apply documents, or encoding cannot ascertained looking @ declarations and/or byte patterns. fact manual override present , used hinders adoption of accurate encoding declarations on web; therefore problem persist. note internet explorer, chrome , safari — both xml , text/html serializations — not permit encoding overridden whenever page includes bom.

for html documents serialized preferred xml label — application/xhtml+xml, manual encoding override not permitted. override encoding of such xml document mean document stopped being xml, fatal error xml documents have encoding declaration detectable errors. currently, gecko browsers such firefox, abide rule, whereas bulk of other common browsers support html xml, such webkit browsers (chrome/safari) allow encoding of xhtml documents manually overridden.

Search This Blog

Wikio

Character encoding determination Unicode and HTML

Comments

Post a Comment

Popular posts from this blog

History The Vandals

Causes Portuguese conquest of the Banda Oriental

Publications Daniel Kolak