Wiktionary:Languages

Accessories-text-editor.svg This is a Wiktionary policy, guideline or common practices page. Specifically it is a policy think tank, working to develop a formal policy.
Policies: CFI - ELE - BLOCK - REDIR - BOTS - QUOTE - DELETE - NPOV - AXX
For a list of all language codes, see Wiktionary:List of languages.
For information on how to add or remove a language from Wiktionary, see Wiktionary:Guide to adding and removing languages.

Wiktionary includes many words in many languages. To distinguish languages, Wiktionary gives each a unique name and a unique code, which identify it.

Language namesEdit

Wiktionary calls each language it includes by a distinct name; these names are used in headers, translations tables, categories, appendices, and some other places. (When a single language is known by multiple names, only one is used in those places.) Language names are chosen by consensus. Whenever possible, common English names of languages are used, and diacritics are avoided. Attested names (names which meet CFI) are strongly preferred.

When two or more languages are commonly known by the same name, Wiktionary distinguishes them by using synonyms for one or all of them, if possible. Thus the language of the Pyu city-states, though called "Pyu" by some scholars, is called "Tircul" (code: pyx) on Wiktionary, to distinguish it from the language of Papua New Guinea which is called "Pyu" (code: pby). Variant spellings are one source of alternative names: thus the Riang language of India and Bangladesh (code: ria) goes by the name "Reang" on Wiktionary, to distinguish it from the "Riang" of Burma/Myanmar (code: ril).

If languages cannot be distinguished by alternative names, the place where each language is spoken is appended in parentheses after its name: thus the Ghanan language called "Buli" is referred to as "Buli (Ghana)" on Wiktionary and represented by the code "bwu", while the Indonesian Buli is referred to as "Buli (Indonesia)" and represented by the code "bzq".

If languages go by the same name and are spoken in the same place, they can be disambiguated by their linguistic families, like "Austronesian Mor" (code: mhz) and "Papuan Mor" (code: moq).

Language codesEdit

Wiktionary has an intricate system for determining which string of letters (code) represents each language and language family, as well as storing other information associated with a particular language or family. Language codes are used in naming categories, and are called by many templates. The module Module:languages is used to retrieve all language-related information, while Module:families covers language families. The module has a number of data modules (see Category:Language data modules) which store different pieces of information, such as the names, the family and the scripts of the language. Module:languages cannot be used directly in a template, so instead there is another module named Module:language utilities, which allows templates to access the information.

If you know the name of a language, you can determine its code by using {{langrev}} with the language's name as a parameter: the template will return the language's code if it can find it. (Type {{langrev|English}}, for example, in the Sandbox or Special:ExpandTemplates, and it will return "en".)

Wiktionary also has a simple system for recording which family individual languages belong to, and which scripts they are written in.

Wiktionary represents individual languages as follows:

  1. Languages which were assigned two-letter codes in the international standard ISO 639-1 are generally represented on Wiktionary by those codes. The individual codes are stored in submodules of Module:languages. English, for example, is represented by en. German is represented by de. Esperanto is represented by eo. Wikipedia has a list of ISO 639-1 codes.
    1. A few languages are represented on Wiktionary by 639-1 codes the ISO has deprecated. (This is generally the case when the ISO has come to consider a lect a group of languages, but Wiktionary still considers it a single language.) Serbo-Croatian, for example, is represented by sh.
  2. Languages which were not assigned codes by ISO 639-1, but which were assigned three-letter codes (based on Ethnologue codes) in the international standard ISO 639-3 are generally represented on Wiktionary by those codes. Abenaki, for example, is represented by abe. Wikipedia has a list of ISO 639-3 codes.
  3. A few languages are represented by other, "exceptional" codes. Exceptional codes are chosen as follows:
    1. A few are ISO 639-2 codes. (This is the case, for example, for languages which were not assigned specific, single codes by either ISO 639-1 or ISO 639-3.) Nahuatl, for example, is represented by nah.
    2. A few are codes devised by the Wikimedia Foundation Language Committee. (This is the case when a Wikimedia project is begun in a language which was not assigned a code by any ISO standard.) Zamboanga Chavacano, for example, is represented by cbk-zam. Wiktionary has a list of such codes in its Appendix:Wikimedia language codes.
    3. Any language which does not have an ISO or specially-devised Wikimedia code, but which is to be included in Wiktionary, is given a two-part exceptional code. The first part of this code is a relevant ISO 639-5 family code (see Wiktionary's appendix); after a hyphen, the second part of the code is a series of three lowercase letters which generally approximate the language name. (No digits, upper case letters, etc are used: IANA tags allow these, case independent, but Mediawiki software is more restrictive.) For example, Samoan Plantation Pidgin is cpe-spp: "cpe" is the ISO 639-5 code for English-based creoles and pidgins, "spp" abbreviates "Samoan Plantation Pidgin". Gallo is roa-gal: "roa" is the ISO 639-5 code for Romance languages, "gal" abbreviates "Gallo".

Constructed languages which are not widely used but which have been assigned ISO 639-3 codes are sometimes accepted by Wiktionary for inclusion in dedicated Appendices. These languages are represented by their ISO 639-3 codes. Láadan, for example, is represented by the ISO 639-3 code ldn. Such languages have their type set to appendix-constructed in the data modules of Module:languages. Some other constructed languages are also included in dedicated Appendices though they do not have ISO 639-3 codes: these languages are given codes which consist of "art-" followed by three letters.

Reconstructed languages are assigned special codes consisting of the language family's code with "-pro" added to the end. Proto-Germanic, for example, is represented by the code gem-pro. Such languages have their type set to reconstructed in the data modules of Module:languages.

Not all lects which have been assigned codes by the ISO are assigned codes or included by Wiktionary.

  1. The ISO has assigned codes to some constructed languages which Wiktionary excludes.
  2. The ISO has assigned codes to some lects which Wiktionary treats as dialects of other languages and thus of other codes. (This is the case, for example, with Moldovan/Moldavian: the ISO assigned the lect the 639-1 code mo, but Wiktionary regards it as a form of Romanian and represents it and Romanian by the same code ro. See Wiktionary:Language treatment.)

Languages' family and script informationEdit

Wiktionary sorts languages into families. Most families are related through descent from a common ancestor, but a few are merely categories, such as "creoles and pidgins". Wiktionary records which family a language belongs to in the data modules of Module:languages. Each family is represented by a code; the family codes are explained in Wiktionary:Families.

  1. English belongs to the family of West Germanic languages; this information is recorded in the module as gmw. Serbo-Croatian is a South Slavic language, as recorded as zls. Abenaki is an Algonquian language, as alg. Nahuatl is a Nahuan language, azc-nah.
  2. The widely-used constructed language Esperanto has its membership in the category "Artificial languages" recorded as art.
  3. Zamboanga Chavacano has its membership in the category "Creole or pidgin languages" recorded as crp.
  4. Wiktionary even records information about appendix-only constructed languages in this way: Láadan has its membership in the category "Artificial languages" recorded as art.

Wiktionary records which script(s) a language uses through the module as well. Each script is represented by a code, which is itself the name of a template, stored in the Template: namespace. The script codes are explained in Wiktionary:Scripts.

  1. English is written in the Latin script; this is recorded as Latn.
  2. Serbo-Croatian is written in both the Latin and the Cyrillic scripts; this is recorded as Latn and Cyrl.
  3. Wiktionary even records information about appendix-only constructed languages in this way: the information that Láadan is written in the Latin script is recorded as Latn.

Lects which appear only in etymology sectionsEdit

Some lects (dialects, chronolects and topolects) are referred to in etymology sections without having entries. These languages are given certain exceptional codes which generally not do fit the pattern described above. These languages and their codes are stored in Module:etymology language/data and described in Wiktionary:Dialects.

See alsoEdit

Last modified on 28 March 2014, at 06:10