Appendix:Unicode normalization

See also Unicode normalization considerations on the MediaWiki website.

Wikimedia, along with most servers on the internet, stores Unicode strings in the form called NFC or Normalization Form (Canonical) Composition. This means that often several different Unicode strings are mapped to the same canonical form. When you enter a Unicode string and save the page, it is automatically converted to the normalized form. Non-normalized strings cannot be saved in a Wiktionary page.

Equivalence edit

Type of Canonical
Combining sequence C ◌̧ Ç
Ordering of combining marks q + ̣+ ̇ q+ ̇+ ̣
Hangul ᄀ +ᅡ
Singleton Å
Hebrew ל ָ ֽ ִ ל ִ ָ ֽ

Issues edit

Most of the time NFC makes processing text easier, but there are some oddities, both semantic and non-semantic that do appear. There are four cases where single characters are not the NFC form.

  1. Sometimes an alternative single character is the canonical composed form.
    Example: U+212B ( Å - ANGSTROM SIGN) is converted to U+00C5 ( Å - LATIN CAPITAL LETTER A WITH RING ABOVE)
  2. For some scripts, precomposed characters are not preferred.
    Example: U+0958 ( क़ - DEVANAGARI LETTER QA) is converted to the decomposed क़ which is U+0915 ( - DEVANAGARI LETTER KA) + U+093C ( - DEVANAGARI SIGN NUKTA).
  3. Where a decomposition exists in pre-Unicode 3.0 for a precomposed character added afterwards, the decomposition is preferred.
    Example: U+2ADC ( ⫝̸ - FORKING) is converted to ⫝̸ which is U+2ADD ( - NONFORKING) + U+0338 ( ̸ - COMBINING LONG SOLIDUS OVERLAY).
  4. A decomposition is preferred to precomposed characters where the decomposition begins with a non-starter.

In a number of common cases, Unicode's canonical ordering of two diacritics is counterintuitive, and/or interoperates poorly with certain existing software. In other, less common cases, the problem is that the diacritics should not have a canonical ordering, because the two orderings are not actually equivalent (that is, the two diacritics should have the same value for the Canonical_Combining_Class (ccc) property, but instead they have different ones). For example, Hebrew לִַ ("lai") is mistakenly normalized to לִַ ("lia").

As the conversion is automatic, there cannot exist pages for the non-NFC form. Attempting to explicitly link to the non-NFC form, , will display the non-NFC form, but when clicked on will take the user to the NFC page Å.

Display edit

One can display the non-NFC characters on a page using {{HTML char}} ({{HTML char|212B}} will show Å). To note canonical equivalence between two single characters, use {{normalization}} in the caption field of the appropriate {{character info}} template on the NFC character (see Å for an example). To note that the NCF of a precomposed character is a decomposition, use {{decomposed}} in the caption field of the appropriate {{character info}} template on the NFC decomposition (see क़ for an example).

Notes edit

Wikimedia does not enforce Compatibility Equivalence which combines even more forms together (such as N and ).

See also edit