Module:languages
- The following documentation is located at Module:languages/documentation. [edit]
- Useful links: subpage list • links • transclusions • testcases • sandbox (diff)
This module is used to retrieve and manage the languages that can have Wiktionary entries, and the information associated with them. See Wiktionary:Languages for more information.
For the languages and language varieties that may be used in etymologies, see Module:etymology languages. For language families, which sometimes also appear in etymologies, see Module:families.
This module provides access to other modules. To access the information from within a template, see Module:languages/templates.
The information itself is stored in the various data modules that are subpages of this module. These modules should not be used directly by any other module, the data should only be accessed through the functions provided by this module.
Data submodules:
- Two-letter codes
- Three-letter codes by their first letter: a b c d e f g h i j k l m n o p q r s t u v w x y z
- Codes containing hyphens (
-
)
Extra data submodules (for less frequently used data):
- Two-letter codes
- Three-letter codes by their first letter: a b c d e f g h i j k l m n o p q r s t u v w x y z
- Codes containing hyphens (
-
)
Finding and retrieving languages
The module exports a number of functions that are used to find languages.
getByCode
getByCode(code, paramForError, allowEtymLang, allowFamily)
Finds the language whose code matches the one provided. If it exists, it returns a Language
object representing the language. Otherwise, it returns nil
, unless paramForError
is given, in which case an error is generated. If paramForError
is true
, a generic error message mentioning the bad code is generated; otherwise paramForError
should be a string or number specifying the parameter that the code came from, and this parameter will be mentioned in the error message along with the bad code. If allowEtymLang
is specified, etymology language codes are allowed and looked up along with normal language codes. If allowFamily
is specified, language family codes are allowed and looked up along with normal language codes.
getByCanonicalName
getByCanonicalName(code, paramForError, allowEtymLang, allowFamily)
Finds the language whose canonical name (the name used to represent that language on Wiktionary) or other name matches the one provided. If it exists, it returns a Language
object representing the language. Otherwise, it returns nil
, unless paramForError
is given, in which case an error is generated. If allowEtymLang
is specified, etymology language codes are allowed and looked up along with normal language codes. If allowFamily
is specified, language family codes are allowed and looked up along with normal language codes.
The canonical name of languages should always be unique (it is an error for two languages on Wiktionary to share the same canonical name), so this is guaranteed to give at most one result.
This function is powered by Module:languages/canonical names, which contains a pre-generated mapping of non-etymology-language canonical names to codes. It is generated by going through the Category:Language data modules for non-etymology languages. When allowEtymLang
is specified for the above function, Module:etymology languages/by name may also be used, and when allowFamily
is specified for the above function, Module:families/by name may also be used.
getByName
getByName(name)
Like getByCanonicalName()
, except it also looks at the otherNames
listed in the non-etymology language data modules, and does not (currently) have options to look up etymology languages and families.
getNonEtymological
getNonEtymological(lang)
If given an etymology language, this iterates through parents until a regular language or family is found, and the corresponding object is returned. If given a regular language or family, the object itself is returned.
Finding all languages
Use Module:languages/iterateAll to find all languages.
Language objects
A Language
object is returned from one of the functions above. It is a Lua representation of a language and the data associated with it. It has a number of methods that can be called on it, using the :
syntax. For example:
local m_languages = require("Module:languages")
local lang = m_languages.getByCode("fr")
local name = lang:getCanonicalName()
-- "name" will now be "French"
Language:getCode
:getCode()
Returns the language code of the language. Example: "fr"
for French.
Language:getCanonicalName
:getCanonicalName()
Returns the canonical name of the language. This is the name used to represent that language on Wiktionary, and is guaranteed to be unique to that language alone. Example: "French"
for French.
Language:getDisplayForm
:getDisplayForm()
Returns the display form of the language. The display form of a language, family or script is the form it takes when appearing as the SOURCE in categories such as English terms derived from SOURCE
or English given names from SOURCE
, and is also the displayed text in :makeCategoryLink
links. For regular and etymology languages, this is the same as the canonical name, but for families, it reads "NAME languages" (e.g. "Indo-Iranian languages"
), and for scripts, it reads "NAME script" (e.g. "Arabic script"
).
Language:getOtherNames
:getOtherNames(onlyOtherNames)
Returns a table of the "other names" that the language is known by, excluding the canonical name. The names are not guaranteed to be unique, in that sometimes more than one language is known by the same name. Example: {"Manx Gaelic", "Northern Manx", "Southern Manx"}
for Manx. If onlyOtherNames
is given and is non-nil
, only names explicitly listed in the otherNames
field are returned; otherwise, names listed under otherNames
, aliases
and varieties
are combined together and returned. For example, for Manx, Manx Gaelic is listed as an alias, while Northern Manx and Southern Manx are listed as varieties. It should be noted that the otherNames
field itself is deprecated, and entries listed there should eventually be moved to either aliases
or varieties
.
Language:getAliases
:getAliases()
Returns a table of the aliases that the language is known by, excluding the canonical name. Aliases are synonyms for the language in question. The names are not guaranteed to be unique, in that sometimes more than one language is known by the same name. Example: {"High German", "New High German", "Deutsch"}
for German.
Language:getVarieties
:getVarieties(flatten)
Returns a table of the known subvarieties of a given language, excluding subvarieties that have been given explicit etymology language codes. The names are not guaranteed to be unique, in that sometimes a given name refers to a subvariety of more than one language. Example: {"Southern Aymara", "Central Aymara"}
for Aymara. Note that the returned value can have nested tables in it, when a subvariety goes by more than one name. Example: {"North Azerbaijani", "South Azerbaijani", {"Afshar", "Afshari", "Afshar Azerbaijani", "Afchar"}, {"Qashqa'i", "Qashqai", "Kashkay"}, "Sonqor"}
for Azerbaijani. Here, for example, Afshar, Afshari, Afshar Azerbaijani and Afchar all refer to the same subvariety, whose preferred name is Afshar (the one listed first). To avoid a return value with nested tables in it, specify a non-nil
value for the flatten
parameter; in that case, the return value would be {"North Azerbaijani", "South Azerbaijani", "Afshar", "Afshari", "Afshar Azerbaijani", "Afchar", "Qashqa'i", "Qashqai", "Kashkay", "Sonqor"}
.
Language:getType
:getType()
Returns the type of language, which can be "regular"
, "reconstructed"
or "appendix-constructed"
.
Language:getWikimediaLanguages
:getWikimediaLanguages()
Returns a table containing WikimediaLanguage
objects (see Module:wikimedia languages), which represent languages and their codes as they are used in Wikimedia projects for interwiki linking and such. More than one object may be returned, as a single Wiktionary language may correspond to multiple Wikimedia languages. For example, Wiktionary's single code sh
(Serbo-Croatian) maps to four Wikimedia codes: sh
(Serbo-Croatian), bs
(Bosnian), hr
(Croatian) and sr
(Serbian).
The code for the Wikimedia language is retrieved from the wikimedia_codes
property in the data modules. If that property is not present, the code of the current language is used. If none of the available codes is actually a valid Wikimedia code, an empty table is returned.
Language:getWikipediaArticle
:getWikipediaArticle()
Returns the name of the Wikipedia article for the language. If the property wikipedia_article
is present in the data module it will be used first, otherwise a sitelink will be generated from :getWikidataItem
(if set). Otherwise :getCategoryName
is used as fallback.
Language:getWikidataItem
:getWikidataItem()
Returns the Wikidata item id for the language or nil
. This corresponds to the the second field in the data modules.
Language:getScripts
:getScripts()
Returns a table of Script
objects for all scripts that the language is written in. See Module:scripts.
Language:getScriptCodes
:getScriptCodes()
Returns the table of script codes in the language's data file.
Language:findBestScript
:findBestScript(text)
Given some text, this function iterates through the scripts of a given language and tries to find the script that best matches the text. It returns a Script
object representing the script. If no match is found at all, it returns the None
script object.
Language:getFamily
:getFamily()
Returns a Family
object for the language family that the language belongs to. See Module:families.
Language:getAncestors
:getAncestors()
Returns a table of Language
objects for all languages that this language is directly descended from. Generally this is only a single language, but creoles, pidgins and mixed languages can have multiple ancestors.
Language:getCategoryName
:getCategoryName(nocap)
Returns the name of the main category of that language. Example: "French language"
for French, whose category is at Category:French language. Unless optional argument nocap
is given, the language name at the beginning of the returned value will be capitalized. This capitalization is correct for category names, but not if the language name is lowercase and the returned value of this function is used in the middle of a sentence.
Language:makeCategoryLink
:makeCategoryLink()
Creates a link to the category; the link text is the canonical name.
Language:makeEntryName
:makeEntryName(text)
Converts the given term into the form used in the names of entries. This removes diacritical marks from the term if they are not considered part of the normal written form of the language, and which therefore are not permitted in page names. It also removes certain punctuation characters like final question marks or periods which are never present in page names. Example for Latin: "amō"
→ "amo"
(macron is removed).
The replacements made by this function are defined by the entry_name
setting for each language in the data modules.
Language:makeSortKey
:makeSortKey(text)
Creates a sort key for the given entry name, following the rules appropriate for the language. This removes diacritical marks from the entry name if they are not considered significant for sorting, and may perform some other changes. Any initial hyphen is also removed, and anything parentheses is removed as well.
The sort_key
setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the entry name and returns a sortkey.
Language:transliterate
:transliterate(text, sc, module_override)
Transliterates the text from the given script into the Latin script (see Wiktionary:Transliteration and romanization). The language must have the translit_module
property for this to work; if it is not present, nil
is returned.
The sc
parameter is handled by the transliteration module, and how it is handled is specific to that module. Some transliteration modules may tolerate nil
as the script, others require it to be one of the possible scripts that the module can transliterate, and will show an error if it's not one of them. For this reason, the sc
parameter should always be provided when writing non-language-specific code.
The module_override
parameter is used to override the default module that is used to provide the transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no default module yet, or you want to demonstrate an alternative version of a transliteration module before making it official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked by Template:tracking/module_override.
Language:hasTranslit
:hasTranslit()
Returns true
if the language has a transliteration module, false
if it doesn't.
Language:getRawData
:getRawData()
- This function is not for use in entries or other content pages.
Returns a blob of data about the language. The format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes.
Language:getRawExtraData
:getRawExtraData()
- This function is not for use in entries or other content pages.
Returns a blob of data about the language that contains the "extra data". Much like with getRawData, the format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes.
Error function
See also
{{Module:etymology languages}}
{{Module:families}}
{{Module:languages/templates}}
{{Module:languages/JSON}}
local export = {}
local function track(page, code)
local tracking_page = "languages/" .. page
if code then
require("Module:debug/track"){tracking_page, tracking_page .. "/" .. code}
else
require("Module:debug/track")(tracking_page)
end
return true
end
-- Temporarily convert various formatting characters to PUA to prevent them from being disrupted by the substitution process.
-- TODO: Handle arbitrary number of capture groups.
local function doTempSubstitutions(text, subbedChars, keepCarets, noTrim)
-- Cloning the table locally is much faster.
local patterns = mw.clone(require("Module:languages/data/patterns"))
if keepCarets then
table.insert(patterns, "((\\\\)%^)")
table.insert(patterns, "((\\)%^)")
table.insert(patterns, "((%^))")
end
local i, pe = #subbedChars, require("Module:utilities").pattern_escape
for _, pattern in ipairs(patterns) do
for m1, m2, m3, m4 in text:gmatch(pattern) do
local m, m1New = {m1, m2, m3, m4}, m1
for j = 2, #m do
subbedChars[i+j-1] = m[j]
m1New = m1New:gsub(pe(m[j]), mw.ustring.char(0x100000+i+j-1), 1)
end
text = text:gsub(pe(m1), pe(m1New), 1)
i = i + #m - 1
end
end
-- Ensure any whitespace at the beginning and end is temp substituted, to prevent it from being accidentally trimmed. We only want to trim any final spaces added during the substitution process (e.g. by a module), which means we only do this during the first round of temp substitutions. We have to use ustring due to the PUA chars, and gsub because ustring's gmatch gets stuck in infinite loops due to a bug.
if not noTrim then
for _, pattern in ipairs{"^([-]*(%s+))", "((%s+)[-]*)$"} do
text = mw.ustring.gsub(text, pattern, function(m1, m2)
local m, m1New = {m1, m2}, m1
for j = 2, #m do
subbedChars[i+j-1] = m[j]
m1New = mw.ustring.gsub(m1New, pe(m[j]), mw.ustring.char(0x100000+i+j-1), 1)
end
i = i + #m - 1
return m1New
end)
end
end
return text, subbedChars
end
-- Reinsert any formatting that was temporarily substituted.
local function undoTempSubstitutions(text, subbedChars)
local pe = require("Module:utilities").pattern_escape
for i = 1, #subbedChars do
text = text:gsub(mw.ustring.char(0x100000+i), pe(subbedChars[i]))
end
return text
end
-- Convert any HTML entities.
local function noEntities(text)
if text:find("&[^;]+;") then
return require("Module:utilities").get_entities(text)
else
return text
end
end
-- Check if the raw text is an unsupported title, and if so return that. Otherwise, remove HTML entities. We do the pre-conversion to avoid loading the unsupported title list unnecessarily.
local function checkNoEntities(text)
local textNoEnc = noEntities(text)
if textNoEnc ~= text and mw.loadData("Module:links/data").unsupported_titles[text] then
return text
else
return textNoEnc
end
end
-- If no script object is provided (or if it's invalid or None), get one.
local function checkScript(text, self, sc)
if type(sc) ~= "table" or sc._type ~= "script object" or sc:getCode() == "None" then
return self:findBestScript(text)
else
return sc
end
end
local function normalize(text, sc)
text = sc:fixDiscouragedSequences(text)
return sc:toFixedNFD(text)
end
-- Convert risky characters to HTML entities, which minimizes interference once returned (e.g. for "sms:a", "<!-- -->" etc.).
local function escapeRiskyChars(text)
for _, pattern in ipairs(mw.clone(require("Module:languages/data/patterns"))) do
text = text:gsub(pattern, function(cap1) return mw.text.encode(cap1, "\"'") end)
end
-- Spacing characters in isolation generally need to be escaped in order to be properly processed by the MediaWiki software.
if not mw.ustring.find(text, "%S") then
return mw.text.encode(text, "%s")
else
return mw.text.encode(text, "#%%&+/:<=>@[\\%]_{|}")
end
end
-- Split the text into sections, based on the presence of temporarily substituted formatting characters, then iterate over each one to apply substitutions. This avoids putting PUA characters through language-specific modules, which may be unequipped for them.
local function iterateSectionSubstitutions(text, subbedChars, keepCarets, self, sc, substitution_data, function_name)
local pe = require("Module:utilities").pattern_escape
local fail, cats, sections = nil, {}
-- See [[Module:languages/data]].
if mw.loadData("Module:languages/data").contiguous_substitution[self:getCode()] then
sections = {text}
else
sections = mw.text.split(text, "[-]")
end
for i, section in ipairs(sections) do
-- Don't bother processing empty strings or whitespace (which may also not be handled well by dedicated modules).
if section:gsub("%s", "") ~= "" then
local sub, sub_fail, sub_cats = require("Module:languages/doSubstitutions")(section, self, sc, substitution_data, function_name)
-- Second round of temporary substitutions, in case any formatting was added by the main substitution process. However, don't do this if the section contains formatting already (as it would have had to have been escaped to reach this stage, and therefore should be given as raw text).
if sub and subbedChars then
local noSub; for _, pattern in ipairs(mw.clone(require("Module:languages/data/patterns"))) do
if section:match(pattern) then noSub = true end
end
if not noSub then
sub, subbedChars = doTempSubstitutions(sub, subbedChars, keepCarets, true)
end
end
if (not sub) or sub_fail then
text = sub
fail = sub_fail
cats = sub_cats or {}
break
end
text = sub and text:gsub(pe(section), pe(sub), 1) or text
if type(sub_cats) == "table" then
for _, cat in ipairs(sub_cats) do
table.insert(cats, cat)
end
end
end
end
-- Trim, unless there are only spacing characters, while ignoring any final formatting characters.
text = text and mw.ustring.gsub(text, "^([-]*)%s+(%S)", "%1%2")
text = text and mw.ustring.gsub(text, "(%S)%s+([-]*)$", "%1%2")
-- Remove duplicate categories.
if #cats > 1 then
cats = require("Module:table").removeDuplicates(cats)
end
return text, fail, cats, subbedChars
end
-- Process carets (and any escapes). Default to simple removal, if no pattern/replacement is given.
local function processCarets(text, pattern, repl)
text = text
:gsub("\\\\^", "\1^")
:gsub("\\^", "\2")
return mw.ustring.gsub(text, pattern or "%^", repl or "")
:gsub("\1", "\\")
:gsub("\2", "^")
end
-- Remove carets if they are used to capitalize parts of transliterations (unless they have been escaped).
local function removeCarets(text, sc)
if not sc:hasCapitalization() and sc:isTransliterated() and text:find("%^") then
return processCarets(text)
else
return text
end
end
local Language = {}
function Language:getCode()
return self._code
end
function Language:getCanonicalName()
return self._rawData[1]
end
function Language:getDisplayForm()
return self:getCanonicalName()
end
function Language:getOtherNames(onlyOtherNames)
self:loadInExtraData()
return require("Module:language-like").getOtherNames(self, onlyOtherNames)
end
function Language:getAliases()
self:loadInExtraData()
return self._extraData.aliases or {}
end
function Language:getVarieties(flatten)
self:loadInExtraData()
return require("Module:language-like").getVarieties(self, flatten)
end
function Language:getType()
return self._rawData.type or "regular"
end
function Language:getWikimediaLanguageCodes()
if not self._wikimediaLanguageCodes then
self._wikimediaLanguageCodes = type(self._rawData.wikimedia_codes) == "table" and self._rawData.wikimedia_codes or type(self._rawData.wikimedia_codes) == "string" and mw.text.split(self._rawData.wikimedia_codes, "%s*,%s*") or {self:getCode()}
end
return self._wikimediaLanguageCodes
end
function Language:getWikimediaLanguages()
if not self._wikimediaLanguageObjects then
local m_wikimedia_languages = require("Module:wikimedia languages")
self._wikimediaLanguageObjects = {}
local wikimedia_codes = self:getWikimediaLanguageCodes()
for _, wlangcode in ipairs(wikimedia_codes) do
table.insert(self._wikimediaLanguageObjects, m_wikimedia_languages.getByCode(wlangcode))
end
end
return self._wikimediaLanguageObjects
end
function Language:getWikipediaArticle()
if self._rawData.wikipedia_article then
return self._rawData.wikipedia_article
elseif self._wikipedia_article then
return self._wikipedia_article
elseif self:getWikidataItem() and mw.wikibase then
self._wikipedia_article = mw.wikibase.sitelink(self:getWikidataItem(), 'enwiki')
end
if not self._wikipedia_article then
self._wikipedia_article = self:getCategoryName():gsub("Creole language", "Creole")
end
return self._wikipedia_article
end
function Language:makeWikipediaLink()
return "[[w:" .. self:getWikipediaArticle() .. "|" .. self:getCanonicalName() .. "]]"
end
function Language:getWikidataItem()
local item = self._rawData[2]
if type(item) == "number" then
return "Q" .. item
else
return item
end
end
function Language:getScriptCodes()
if not self._scriptCodes then
self._scriptCodes = type(self._rawData[4]) == "table" and self._rawData[4] or type(self._rawData[4]) == "string" and mw.text.split(self._rawData[4], "%s*,%s*") or {"None"}
end
return self._scriptCodes
end
function Language:getScripts()
if not self._scriptObjects then
local m_scripts = require("Module:scripts")
self._scriptObjects = {}
if self:getScriptCodes()[1] == "All" then
self._scriptObjects = mw.loadData("Module:scripts/data")
else
for _, sc in ipairs(self:getScriptCodes()) do
table.insert(self._scriptObjects, m_scripts.getByCode(sc))
end
end
end
return self._scriptObjects
end
-- Find the best script to use, based on the characters of a string. If forceDetect is set, run the detection algorithm even if there's only one possible script; in that case, if the text isn't in the script, the return value will be None.
function Language:findBestScript(text, forceDetect)
if (not text) or text == "" or text == "-" then
return require("Module:scripts").getByCode("None")
end
if table.concat(self:getScriptCodes()) == "All" then
return require("Module:scripts").findBestScriptWithoutLang(text)
end
local scripts = self:getScripts()
if not scripts[2] and not forceDetect then
-- Necessary, because Hani covers the entire Han range (while the Hant & Hans lists don't list shared characters).
if scripts[1]:getCode():match("^Han") and require("Module:scripts").getByCode("Hani"):countCharacters(text) > 0 then
return scripts[1]
elseif scripts[1]:countCharacters(text) > 0 then
return scripts[1]
else
return require("Module:scripts").getByCode("None")
end
end
return require("Module:languages/findBestScript")(export, self, text, scripts, forceDetect)
end
function Language:getFamily()
if self._familyObject then
return self._familyObject
end
if self._rawData[3] then
self._familyObject = require("Module:families").getByCode(self._rawData[3])
end
return self._familyObject
end
function Language:getAncestorCodes()
if not self._ancestorCodes then
self._ancestorCodes = type(self._rawData.ancestors) == "table" and self._rawData.ancestors or type(self._rawData.ancestors) == "string" and mw.text.split(self._rawData.ancestors, "%s*,%s*") or nil
end
return self._ancestorCodes
end
function Language:getAncestors()
if not self._ancestorObjects then
self._ancestorObjects = {}
local ancestors
if self._rawData.ancestors then
ancestors = self:getAncestorCodes()
for _, ancestor in ipairs(ancestors) do
table.insert(self._ancestorObjects, export.getByCode(ancestor) or require("Module:etymology languages").getByCode(ancestor))
end
else
local fam = self:getFamily()
local protoLang = fam and fam:getProtoLanguage() or nil
-- For the case where the current language is the proto-language
-- of its family, we need to step up a level higher right from the start.
if protoLang and protoLang:getCode() == self:getCode() then
fam = fam:getFamily()
protoLang = fam and fam:getProtoLanguage() or nil
end
while not protoLang and not (not fam or fam:getCode() == "qfa-not") do
fam = fam:getFamily()
protoLang = fam and fam:getProtoLanguage() or nil
end
table.insert(self._ancestorObjects, protoLang)
end
end
return self._ancestorObjects
end
local function iterateOverAncestorTree(node, func)
for _, ancestor in ipairs(node:getAncestors()) do
if ancestor then
local ret = func(ancestor) or iterateOverAncestorTree(ancestor, func)
if ret then
return ret
end
end
end
end
function Language:getAncestorChain()
if not self._ancestorChain then
self._ancestorChain = {}
local step = self
while true do
local ancestors = step:getAncestors()
step = #ancestors == 1 and ancestors[1] or nil
if not step then break end
table.insert(self._ancestorChain, 1, step)
end
end
return self._ancestorChain
end
function Language:hasAncestor(otherlang)
local function compare(ancestor)
return ancestor:getCode() == otherlang:getCode()
end
return iterateOverAncestorTree(self, compare) or false
end
function Language:getCategoryName(nocap)
local name = self:getCanonicalName()
-- If the name already has "language" in it, don't add it.
if not name:find("[Ll]anguage$") then
name = name .. " language"
end
if not nocap then
name = mw.getContentLanguage():ucfirst(name)
end
return name
end
function Language:makeCategoryLink()
return "[[:Category:" .. self:getCategoryName() .. "|" .. self:getDisplayForm() .. "]]"
end
function Language:getStandardCharacters()
return self._rawData.standardChars
end
-- Make the entry name (i.e. the correct page name).
function Language:makeEntryName(text, sc, escape_characters)
if (not text) or text == "" then
return text, nil, {}
end
-- Remove bold, italics, soft hyphens, strip markers and HTML tags.
text = text
:gsub("('*)'''(.-'*)'''", "%1%2")
:gsub("('*)''(.-'*)''", "%1%2")
:gsub("", "")
text = mw.text.unstrip(text)
:gsub("<[^<>]+>", "")
-- Don't remove italics, as that would allow people to use it instead of {{m}} etc.
local textWithEnc, unsupported = text
text = mw.uri.decode(text, "PATH")
text = noEntities(text)
-- Check if the text is an interwiki link.
if text:find(":") and text ~= ":" then
local m_utildata, lower = mw.loadData("Module:utilities/data"), require("Module:string utilities").lower
-- If this is an a link to another namespace or an interwiki link, ensure there's an initial colon and then return what we have (so that it works as a conventional link, and doesn't do anything weird like add the term to a category.)
local prefix = mw.ustring.match(text, "^:*[-]*([^:]*)[-]*:")
prefix = prefix and lower(prefix)
if m_utildata.namespaces[prefix] or m_utildata.interwikis[prefix] then
return ":" .. text:gsub("^:+", ""), nil, {}
end
-- If it would be an interwiki link, if not for any escaped colons, then set `unsupported` as true.
prefix = mw.ustring.match(text, "^[-]*([^:]*)[-]*\\:")
prefix = prefix and lower(prefix)
if m_utildata.interwikis[prefix] or m_utildata.namespaces[prefix] then
unsupported = true
end
prefix, m_utildata = nil
end
-- Convert any escaped colons.
text = text:gsub("\\:", ":")
textWithEnc = textWithEnc:gsub("\\:", ":")
-- Check if the text is a listed unsupported title (with and without converting percent encoding/HTML entities).
local unsupportedTitles = mw.loadData("Module:links/data").unsupported_titles
if unsupportedTitles[text] or unsupportedTitles[textWithEnc] then
return "Unsupported titles/" .. (unsupportedTitles[text] or unsupportedTitles[textWithEnc]), nil, {}
end
sc = checkScript(text, self, sc)
local fail, cats
text = normalize(text, sc)
text, fail, cats = iterateSectionSubstitutions(text, nil, nil, self, sc, self._rawData.entry_name, "makeEntryName")
text = removeCarets(text, sc)
text = mw.ustring.gsub(text, "^([-]*)[¿¡]?([-]*)(.-[^%s%p].-)([-]*)%s*([-]*)[؟?!;՛՜ ՞ ՟?!︖︕।॥။၊་།]?([-]*)$", "%1%2%3%4%5%6") or text
text = escape_characters == false and text or escapeRiskyChars(text)
text = unsupported and "Unsupported titles/" .. text or text
return text, fail, cats
end
-- Generates alternative forms using a specified method, and returns them as a table. If no method is specified, returns a table containing only the input term.
function Language:generateForms(text, sc)
if self._rawData.generate_forms then
sc = checkScript(text, self, sc)
return require("Module:" .. self._rawData.generate_forms).generateForms(text, self:getCode(), sc:getCode())
else
return {text}
end
end
function Language:makeSortKey(text, sc)
if (not text) or text == "" then
return text, nil, {}
end
-- Remove soft hyphens, strip markers and HTML tags.
text = text:gsub("", "")
text = mw.text.unstrip(text)
:gsub("<[^<>]+>", "")
text = mw.uri.decode(text, "PATH")
text = checkNoEntities(text)
-- Remove initial hyphens and *.
text = mw.ustring.gsub(text, "^([-]*)[-־ـ᠊*]+([-]*)(.)", "%1%2%3")
sc = checkScript(text, self, sc)
text = normalize(text, sc)
text = removeCarets(text, sc)
-- For languages with dotted dotless i, ensure that "İ" is sorted as "i", and "I" is sorted as "ı".
if self._rawData.dotted_dotless_i then
text = text
:gsub(mw.ustring.toNFD("İ"), "i")
:gsub("I", "ı")
text = sc:toFixedNFD(text)
end
-- Convert to lowercase, make the sortkey, then convert to uppercase. Where the language has dotted dotless i, it is usually not necessary to convert "i" to "İ" and "ı" to "I" first, because "I" will always be interpreted as conventional "I" (not dotless "İ") by any sorting algorithms, which will have been taken into account by the sortkey substitutions themselves. However, if no sortkey substitutions have been specified, then conversion is necessary so as to prevent "i" and "ı" both being sorted as "I".
local fail, cats
text = require("Module:string utilities").lower(text)
text, fail, cats = iterateSectionSubstitutions(text, nil, nil, self, sc, self._rawData.sort_key, "makeSortKey")
if self._rawData.dotted_dotless_i and not self._rawData.sort_key then
text = text
:gsub("ı", "I")
:gsub("i", "İ")
text = sc:toFixedNFC(text)
end
text = require("Module:string utilities").upper(text)
-- Remove parentheses, as long as they are either preceded or followed by something.
text = text
:gsub("(.)[()]+", "%1")
:gsub("[()]+(.)", "%1")
return escapeRiskyChars(text), fail, cats
end
-- Create the form used as as a basis for display text and transliteration.
local function processDisplayText(text, self, sc, keepCarets, keepPrefixes)
local subbedChars = {}
text, subbedChars = doTempSubstitutions(text, subbedChars, keepCarets)
text = mw.uri.decode(text, "PATH")
text = checkNoEntities(text)
sc = checkScript(text, self, sc)
local fail, cats
text = normalize(text, sc)
text, fail, cats, subbedChars = iterateSectionSubstitutions(text, subbedChars, keepCarets, self, sc, self._rawData.display_text, "makeDisplayText")
text = removeCarets(text, sc)
-- Remove any interwiki link prefixes (unless they have been escaped or this has been disabled).
if text:find(":") and not keepPrefixes then
local m_utildata, lower = mw.loadData("Module:utilities/data"), require("Module:string utilities").lower
text = text
:gsub("\\\\:", "\1:")
:gsub("\\:", "\2")
local prefix, oldText = mw.ustring.match(text, "^[-]*([^:]*)[-]*:"), text
local lower_prefix = prefix and lower(prefix)
while m_utildata.interwikis[lower_prefix] or prefix == "" do
oldText = text
text = text:gsub("^([-]*)" .. prefix .. "([-]*):", "%1%2")
prefix = mw.ustring.match(text, "^[-]*([^:]*)[-]*:")
lower_prefix = prefix and lower(prefix)
end
-- If the whole text has been removed (i.e. the text ends with a colon), then the final prefix is not actually a prefix.
if mw.ustring.gsub(text, "[%s-]", "") == "" then text = oldText end
text = text
:gsub("\1", "\\")
:gsub("\2", ":")
end
return text, fail, cats, subbedChars
end
-- Make the display text (i.e. what is displayed on the page).
function Language:makeDisplayText(text, sc, keepPrefixes)
if (not text) or text == "" then
return text, nil, {}
end
local fail, cats, subbedChars
text, fail, cats, subbedChars = processDisplayText(text, self, sc, nil, keepPrefixes)
text = escapeRiskyChars(text)
return undoTempSubstitutions(text, subbedChars), fail, cats
end
function Language:transliterate(text, sc, module_override)
-- If there is no text, or the language doesn't have transliteration data and there's no override, return nil.
if not (self._rawData.translit or module_override) then
return nil, true, {}
elseif (not text) or text == "" or text == "-" then
return text, nil, {}
end
-- If the script is not transliteratable (and no override is given), return nil.
sc = checkScript(text, self, sc)
if not (sc:isTransliterated() or module_override) then
return nil, true, {}
end
-- Remove any strip markers.
text = mw.text.unstrip(text)
-- Get the display text with the keepCarets flag set.
local fail, cats, subbedChars
text, fail, cats, subbedChars = processDisplayText(text, self, sc, true)
-- Transliterate (using the module override if applicable).
text, fail, cats, subbedChars = iterateSectionSubstitutions(text, subbedChars, true, self, sc, module_override or self._rawData.translit, "tr")
-- Incomplete transliterations return nil.
-- FIXME: Handle transliterations with characters that are in both Latn/Latinx and a transliteratable script (e.g. U+A700-U+A707 are in Latinx and Hani).
if (not text) or sc:countCharacters(text) > 0 then
return nil, true, cats
end
text = escapeRiskyChars(text)
text = undoTempSubstitutions(text, subbedChars)
-- If the script does not use capitalization, then capitalize any letters of the transliteration which are immediately preceded by a caret (and remove the caret).
if text and not sc:hasCapitalization() and text:match("%^") then
text = processCarets(text, "%^([-]*[^-])", require("Module:string utilities").upper)
end
-- Track module overrides.
if module_override ~= nil then
track("module_override")
end
return text, fail, cats
end
function Language:overrideManualTranslit()
return not not self._rawData.override_translit
end
function Language:hasTranslit()
return not not self._rawData.translit
end
function Language:link_tr()
return not not self._rawData.link_tr
end
-- Provides a way to apply a substitution method via gsub (or a series of gsubs), where the output is dependent on every substitution being successful (e.g. in a term).
function Language:gsubSubstitutions(text, sc, method, patterns)
local get_entities = require("Module:utilities").get_entities
local categories, section_categories, fail, fail_message = {}
local function process_section(pre, section, post)
section = get_entities(section)
section, fail, section_categories = self[method](self, section, sc)
if type(section_categories) == "table" then
for i, category in ipairs(section_categories) do
table.insert(categories, category)
end
end
if fail then
fail_message = section
categories = section_categories
end
return (pre or "") .. (section or "") .. (post or "")
end
for i, pattern in ipairs(patterns) do
text = text:gsub(pattern, process_section)
if fail then break end
end
return (fail_message or text), fail, categories
end
function Language:toJSON(returnTable)
local entryNamePatterns = nil
local entryNameRemoveDiacritics = nil
if self._rawData.entry_name then
entryNameRemoveDiacritics = self._rawData.entry_name.remove_diacritics
if self._rawData.entry_name.from then
entryNamePatterns = {}
for i, from in ipairs(self._rawData.entry_name.from) do
table.insert(entryNamePatterns, {from = from, to = self._rawData.entry_name.to[i] or ""})
end
end
end
local ret = {
ancestors = self:getAncestorCodes(),
canonicalName = self:getCanonicalName(),
categoryName = self:getCategoryName("nocap"),
code = self:getCode(),
entryNamePatterns = entryNamePatterns,
entryNameRemoveDiacritics = entryNameRemoveDiacritics,
family = self._rawData[3],
otherNames = self:getOtherNames(true),
aliases = self:getAliases(),
varieties = self:getVarieties(),
scripts = self:getScriptCodes(),
type = self:getType(),
wikimediaLanguages = self:getWikimediaLanguageCodes(),
wikidataItem = self:getWikidataItem(),
}
if returnTable then
return ret
end
return require("Module:JSON").toJSON(ret)
end
-- Do NOT use these methods!
-- All uses should be pre-approved on the talk page!
function Language:getRawData()
return self._rawData
end
function Language:getRawExtraData()
self:loadInExtraData()
return self._extraData
end
Language.__index = Language
function export.getDataModuleName(code)
if code:find("^%l%l$") then
return "languages/data/2"
elseif code:find("^%l%l%l$") then
local prefix = code:sub(1, 1)
return "languages/data/3/" .. prefix
elseif code:find("^[%l-]+$") then
return "languages/data/exceptional"
else
return nil
end
end
function export.getExtraDataModuleName(code)
local dataModule = export.getDataModuleName(code)
return dataModule and dataModule .. "/extra" or nil
end
local function getRawLanguageData(code)
local modulename = export.getDataModuleName(code)
return modulename and mw.loadData("Module:" .. modulename)[code] or nil
end
local function getRawExtraLanguageData(code)
local modulename = export.getExtraDataModuleName(code)
return modulename and mw.loadData("Module:" .. modulename)[code] or nil
end
function Language:loadInExtraData()
if not self._extraData then
-- load extra data from module and assign to _extraData field
-- use empty table as a fallback if extra data is nil
self._extraData = getRawExtraLanguageData(self:getCode()) or {}
end
end
function export.makeObject(code, data)
if data and data.deprecated then
require("Module:debug").track {
"languages/deprecated",
"languages/deprecated/" .. code
}
end
return data and setmetatable({_rawData = data, _code = code, _type = "language object"}, Language) or nil
end
function export.getByCode(code, paramForError, allowEtymLang, allowFamily)
if type(code) ~= "string" then
error("The function getByCode expects a string as its first argument, but received " .. (code == nil and "nil" or "a " .. type(code)) .. ".")
end
local retval = export.makeObject(code, getRawLanguageData(code))
if not retval and allowEtymLang then
retval = require("Module:etymology languages").getByCode(code)
end
if not retval and allowFamily then
retval = require("Module:families").getByCode(code)
end
if not retval and paramForError then
require("Module:languages/errorGetBy").code(code, paramForError, allowEtymLang, allowFamily)
end
return retval
end
function export.getByName(name, errorIfInvalid)
local byName = mw.loadData("Module:languages/by name")
local code = byName.all and byName.all[name] or byName[name]
if not code then
if errorIfInvalid then
error("The language name \"" .. name .. "\" is not valid. See [[Wiktionary:List of languages]].")
else
return nil
end
end
return export.makeObject(code, getRawLanguageData(code))
end
function export.getByCanonicalName(name, errorIfInvalid, allowEtymLang, allowFamily)
local byName = mw.loadData("Module:languages/canonical names")
local code = byName and byName[name]
local retval = code and export.makeObject(code, getRawLanguageData(code)) or nil
if not retval and allowEtymLang then
retval = require("Module:etymology languages").getByCanonicalName(name)
end
if not retval and allowFamily then
local famname = name:match("^(.*) languages$")
famname = famname or name
retval = require("Module:families").getByCanonicalName(famname)
end
if not retval and errorIfInvalid then
require("Module:languages/errorGetBy").canonicalName(name, allowEtymLang, allowFamily)
end
return retval
end
--[[ If language is an etymology language, iterates through parent languages
until it finds a non-etymology language. ]]
function export.getNonEtymological(lang)
while lang:getType() == "etymology language" do
local parentCode = lang:getParentCode()
lang = export.getByCode(parentCode)
or require("Module:etymology languages").getByCode(parentCode)
or require("Module:families").getByCode(parentCode)
end
return lang
end
-- for backwards compatibility only; modules should require the /error themselves
function export.err(lang_code, param, code_desc, template_tag, not_real_lang)
return require("Module:languages/error")(lang_code, param, code_desc, template_tag, not_real_lang)
end
return export