it Questo utente è di madrelingua italiana.
en-3 This user is able to contribute with an advanced level of English.
Search user languages or scripts

Wiktionary validatorEdit

What is it?Edit

An application (command line) that parse enwiktionary dump file and returns an error each time it encounter a format incongruence (both in layout and in content).

Where is it?Edit

Well, at the moment is just on my HDD but if you are interested in looking at the code I'll upload to github (coded in Go language).


Some time ago I bought an electronic translator (I needed something portable and offline), it was supposed to be good (no names) but it let a lot to be desired. While looking for some alternative I found wiktionary (I knew wikipedia of course but not wiktionary itself) and I thought that what I was looking for, why buying a better translator when on wiktionary if something is missing can be simply added once and for all? It's sound way more efficient! Moreover I could extract the data I needed from the dump file to have something usable offline. So I wrote a short command line application to extract translations from enwiktionary (something like User:Matthias_Buchmeier effort) but I soon realized that formatting is not exactly a constant here, so my original idea became to code an application to find problems and, at the same time, to write a unified formatting guide.

How does it work?Edit

The application (command line) requires a dump file (I'm currently testing on enwiktionary-20120505-pages-articles.xml.bz2) and return a list of errors and miscellaneous statistics.

Errors, and output in general, include:

  • Statistics (WIP)
    • Number of terms for each language, iso and sublanguages...
    • Number of templates, number of occurrences, number of transclusions and calls from other templates...
    • Number of entries by regexp for each heading...
  • Basic line format (DONE)
    • Empty pages
    • Multiple empty lines (only containing whitespaces)
  • Page content (TESTING)
    • Invalid or unknown metadata at the beginning of a page
    • Missing headings
    • Headings without content
    • Invalid or unknown headings
  • Heading content (WIP)
    • Modules (make content validation modular) (TODO)
    • Alternative forms
    • Translations
    • Nyms
    • ...
  • Post validation (requires a complete list of terms for cross checking) (TODO)
  • Templates (needed to validate heading content)
    • Various statistics (WIP)
    • Template parser (TODO)
    • Find low-usage templates (never trascluded and never called by other templates) (WIP)
  • Bot-helper (use this application as a helper feeder for bots) (TODO)

The list is a WIP itself and will be more detailed in the future, anyway should give an idea of the direction of the project.

Unified formatting guideEdit

(online and pdf version) TODO

Work in ProgressEdit

All sections under this heading are comments, doubts that need clarifications and wild brainstorming so feel free to comment, Thank you! Application is currently tested on enwiktionary-20120505-pages-articles.xml.bz2 so it could be different from actual status of wikitionary.

Metadata and data not under an headingEdit

redirection (from beer-parlour)Edit


Heading POSEdit

(NTS) Misuse of context labels (October 2012): context usage.

Heading Alternative FormsEdit

I'm trying to define a format for entries under Alternative forms heading (nothing is defined in Wiktionary:ELE). Currently, over a total of 59863 entries the formatting breakdown is:

  • 29828 * [[wikified]] {{qualifier}}?
  • 19495 * {{l}} {{qualifier}}?

the rest are templates, wikified terms and plain text combined in various ways: {{term}}, {{sense}}, {{l-nn}}, {{l-nb}}, {{l}}, {{forms}}, {{nn-inf}}, {{pedlink}}, {{zh-ts}}, {{R:Webster 1913}}, {{alternative form of}}, {{seeCites}}, {{soplink}}

I think only formats that result in the same output (one wikified term per line with optional qualifier) should be allowed, so the two main one (which one should be the preferred one?) and similar cases: {{l-nn}}, {{l-nb}} that are Norwegian versions of {{l}}.

Heading TranslationsEdit

translations {{t}}Edit

translations should be short and "template only" (IMHO) It would be great (from a validator point of view) to have only combinations of {{t}} and {{qualifier}} templates, with commas, semicolons and newlines (*::) to separate/group them. However some cases are excluded (such as the example in http://en.wiktionary.org/wiki/Template:t) so the idea needs some work...: * Arabic: {{t|ar|فراشة|sc=Arab|f|tr=fará:sha}}; (fertito) ''(Morocco)''; (fartattu) {{italbrac|Tunisia}}(Fedso (talk) 20:49, 27 June 2012 (UTC))

t-templates, template only translations: I agree that generally templates should be be used for translations. However there is a problem with Sum-Of-Part (SOP) translations, for which no standard formatting exists. Particularly it's not clear how to add transliterations to SOP-translations and whether to use a template for them. Some SOP-translations are formated with {{el-p}} (only Greek), some {{onym}}, others have individually wiki-linked words. Anyhow IMHO the practice of putting transliterations in round brackets is no good idea as in that case there is no easy way to (automatically) identify them as a transliteration.Matthias Buchmeier (talk) 11:14, 28 June 2012 (UTC)
There is another thing you could validate for translations: genders. It may be a bit tricky because genders should only be present in noun translations, and generally not in adjective translations. And of course you need to identify which languages have genders and which don't. —CodeCat 11:26, 28 June 2012 (UTC)


If multiline is not necessary (or not to be used, possibly messy if {{trans-mid}} end up in the wrong place) a semicolon could be used instead of a newline (as in yours#Translations -> possessive pronoun -> Italian) I'd like to see a single way to format translations, I like the "* language, *: sublanguage, *:: newline" format, as in cousin#Translations -> "nephew or niece of a parent" -> Chinese. Only the language line uses a bullet list and definitions are one per line for complex cases (as in Mandarin) and single line for simple cases (as in Min Nan)(Fedso (talk) 20:49, 27 June 2012 (UTC))

multi-line: I far as I understand multi-line translations are discouraged.Matthias Buchmeier (talk) 11:14, 28 June 2012 (UTC)

Templates & Magic wordsEdit


(NTS) Grease pit: Template:recons or Template:proto? (proto deprecated)


There are about 15000 {{DEFAULTSORT}}, isn't possible to automatically index terms in categories with diacritics removed, instead of having to add {{DEFAULTSORT}} in each page?

{{qualifier}} / {{italbrac}} Edit

Are (''text''), {{italbrac|text}} and {{qualifier|text}} equivalent? wouldn't be better to use only one?(Fedso (talk) 20:49, 27 June 2012 (UTC))

Re italics: {{qualifier}} should be used, yes. {{italbrac}} has been deleted. - -sche (discuss) 00:44, 28 June 2012 (UTC)
I have changed the four remaining mainspace uses of {{italbrac}} to {{qualifier}}, so a validator can treat any appearance of {{italbrac}} in NS:0 as an error. - -sche (discuss) 07:52, 22 September 2012 (UTC)
Great, thanks! Fedso (talk) 19:34, 24 September 2012 (UTC)