Open main menu

Wiktionary:Beer parlour/2013/December

discussion rooms: Tea roomEtym. scr.Info deskBeer parlourGrease pit ← November 2013 · December 2013 · January 2014 → · (current)


Middle ChineseEdit

There seems to be consensus on WT:RFM#Template:ltc that "Late Middle Chinese" (which has the ISO code ltc) should not be distinguished from "Middle Chinese" (which has the code zhx-mid). The question is: does that mean we should simply delete ltc, switching uses of it to zhx-mid? Or would it be better to delete the exceptional code zhx-mid and keep the (repurposed, renamed) 'official' code ltc? Both codes are used by hundreds of entries. - -sche (discuss) 19:52, 2 December 2013 (UTC)

I support the latter, and that's how I interpreted that discussion as well. —Μετάknowledgediscuss/deeds 06:20, 3 December 2013 (UTC)
AFAICT, the only person who took a position (at RFM) on which code to keep was CodeCat. - -sche (discuss) 07:10, 3 December 2013 (UTC)
Seeing some support for and no opposition to replacing all instances of zhx-mid with ltc, so I have requested that someone with a bot make the replacement: Wiktionary:Grease pit/2014/January#Bot_request:_replace_zhx-mid_with_ltc. - -sche (discuss) 22:38, 26 January 2014 (UTC)
  • As a late comment, I just want to note that I (and possibly other JA editors) have been using the ltc code for Middle Chinese ever since that discussion. So you'll probably find lots of JA entries using ltc in the etym sections, but not many using zhx-mid. FWIW. ‑‑ Eiríkr Útlendi │ Tala við mig 00:43, 28 January 2014 (UTC)
As a final update : zhx-mid has been orphaned and deleted. - -sche (discuss) 01:40, 18 February 2014 (UTC)

Oropom languageEdit

Is this real? I'm inclined to believe it is, in which case we should assign it a code, but then again it is a bit suspicious, being from only one paper, and Wikipedia calls it "dubious". —Μετάknowledgediscuss/deeds 06:19, 3 December 2013 (UTC)

If it's not simply bogus — and that's a big if — Wilson's list is such a patchwork that it would be difficult to know which of the words it contains are truly Oropom and which are "gaps in the semi-speaker's memory, which she filled with other languages when pressed to provide a word she did not know" (as Mostafa Lameen Souag puts it in his review of it).
Notably, that is the most benign of the "three theoretically possible explanations for its patchwork character" Souag lists; the others are "that the language was made up on the spot by the woman in question; [and] that Wilson made it up". Souag considers the third possibility implausible, but notes that the speaker would have had a "motive [] to please the stranger asking for words, and perhaps earn a reward; in Wilson's own article, he mentions having to deal with 'a number of charlatans' after word got around that he was 'on the look-out for people who had some knowledge of the language'."
Bernd Heine surveyed the area less than ten years after Wilson, and found no evidence of the language whatsoever, and Lionel Bender and Roger Blench are both of the opinion that it was made up as a joke. - -sche (discuss) 08:08, 3 December 2013 (UTC)

Different transliterations for the same languageEdit

We seem to be OK with using multiple transliterations for languages like Burmese and Korean, but when only one transliteration can be supplied, the same system is always chosen. How about languages with sublects that might need a different transliteration system?

For example, we merged hbo (Biblical Hebrew) with he (Hebrew), but many etymologies still use a traditional scholarly transliteration (which is only appropriate for the ancient variety of Hebrew) when showing derivations from the Torah's vocabulary. Similarly, there is a proposal to merge gkm (Byzantine Greek) into grc (Ancient Greek), and if we do that, it may be advantageous to transliterate it as if it were el (Greek), because the literary language hadn't changed as much as the pronunciation had (so since <β> is /b~β/ in grc, /β~v/ in gkm, and /v/ in el, it should be transliterated more or less accordingly).

How do we feel about how we're handling the various transliteration systems for the same L2s, and about allowing transliterations to differ by chronolect? —Μετάknowledgediscuss/deeds 01:48, 8 December 2013 (UTC)

I think the problem here is the association between transliteration and pronunciation. We should find a transliteration that would satisfy both Ancient and Byzantine Greek. Same goes for Hebrew, since all modern pronunciation does is merge some phonemes, there is nothing wrong with using a more detailed Biblical-compatible transliteration system for Modern Hebrew. Especially since, even in Modern Hebrew, these distinctions are maintained in the rare case that text is written fully vocalized. --WikiTiki89 01:54, 8 December 2013 (UTC)
I don't really agree with that, because it sounds to me like making transliterations less user-friendly with high cost, and that is a Bad Thing. Atelaes (talkcontribs) has talked about this before — transliteration is an aid or guide, and should not be exhaustive if the native script can show the intricacies better. However, that's different than a distinct sublect's case, because how can you satisfy two lects with a binary (or trinary etc) choice? How would you transliterate beta per my example above? Homer said /b/ but Greeks from the last millennium (in fact, probably last millennium and a half, at least colloquially) have all said /v/. —Μετάknowledgediscuss/deeds 02:12, 8 December 2013 (UTC)
Re: "should not be exhaustive if the native script can show the intricacies better": The Hebrew script cannot show all the intricacies, most notably the differences between shva na' and shva nach, and between kamatz katan and kamatz gadol.
Re. pronunciation differences: Nevertheless, the writing didn't change. In Spanish, "b" is often pronounced like a "v" (more precisely [β̞]), that doesn't mean we have to respell these Spanish words with a "v". So then why should we do this for Byzantine Greek? --WikiTiki89 02:36, 8 December 2013 (UTC)
@Wikitiki89: If we want to break the association between transliteration and pronunciation, then how about we just line up the Hebrew characters in Unicode order, and assign them ASCII-range characters starting at U+0021? For example, בֶּן־(bén-, son of) would become TEJbL (or TJEbL, if we undo NFC before transliterating). Much more consistent and logical, and completely Luifiable. I can't imagine why we aren't already doing it that way —RuakhTALK 03:23, 8 December 2013 (UTC)
We could do that, but is generally a better idea to use characters that are commonly used to transliterate a particular grapheme, which is why ben- is a much more reasonable transliteration. All sarcasm aside, obviously there has to be some connection between transliteration and pronunciation, what I meant is that we should focus on what is written more than on what is pronounced. In the specific case of Hebrew, if we use the scholarly transliteration system, the modern pronunciation can be derived simply by converting "ḇ" to "v", "ḵ" to "kh", "p̄" to "f", "ḥ" to "kh", and "š" to "sh" (most of which are pretty logical and/or have precedent in many other transliteration systems we already have on Wiktionary) and then ignoring all other diacritics and symbols. Transliterations are generally not meant for people who are learning the language (those people should have learned the native script already), but for people who are curious about a few of the words or are comparing it with other languages. These people generally don't care how closely the transliteration resembles modern pronunciation (if they want to pronounce it, we have the pronunciation section), but would be more interested in the morphology of the word, which is very hard to see with our current transliteration system. Not to mention that transliterations are currently the only way we can distinguish placement of stress, the types of kamatzes, the types of shvas (which we do not currently consistently distinguish anyway, especially right before an alef or ayin), and the types of dageshes (which we currently don't distinguish at all). Someone who does not know all the nuanced rules of how to tell these apart, will not be able to find some of this information on Wiktionary. Someone who does know the rules will be stumped by exceptions such as בָּתִּים(bātīm). There is nothing wrong, per se, with our current system; it just is not sufficient, and unless we want to have two (or more) transliteration systems, we need to use something less lossy. --WikiTiki89 04:06, 8 December 2013 (UTC)
But if we transliterate אורגניזציה‎ as ōrgānīzāṣyā, or פלירטוט‎ as p̄līrṭūṭ (or p'lirṭūṭ?), then we have the opposite problem: instead of being "lossy", the transliteration is "gainy", in that we've invented a bunch of B.S. that has no basis in any form of Hebrew. You qualify your last statement with "unless we want to have two (or more) transliteration systems", but I'm not the one saying I don't want that; you are. —RuakhTALK 08:09, 8 December 2013 (UTC)
Well do you want to have more than one transliteration everywhere? Recent loanwords are always exceptional cases and there is not reason why we can't treat them differently from other words (somewhere at Wiktionary talk:About Arabic we discussed and agreed to treating loanwords differently from native words). --WikiTiki89 18:22, 8 December 2013 (UTC)
Neither אורגניזציה‎ nor פלירטוט‎ is exceptional IMHO; they both conform to the ordinary phonotactics of Modern Hebrew. (אורגניזציה‎, for that matter, even conforms to the ordinary phonotactics of Ancient Hebrew; its morphology is bonkers, of course, but phonologically speaking the only difficulty I see is the orthographic ō in an unstressed closed syllable, which is really an issue of orthography rather than phonology. It's only slightly related to its being a loanword.) And also, if we were to treat recent loanwords differently, then wouldn't that still amount to having more than one transliteration system?
Ultimately, I'm not sure whether I would be on board with using an Ancientish/scholarly transliteration for Ancient Hebrew; but I'm quite confident that I would not be on board with one for Modern Hebrew.
RuakhTALK 18:46, 8 December 2013 (UTC)
By "exceptional" I meant that they would be exceptions to using a scholarly transliteration scheme, because a scholarly transliteration scheme doesn't make much sense for them. The essential problem here is that we are treating what would otherwise be separate languages as one language due to the fact that the orthography and grammar are compatible. If you would like to use a scholarly transliteration for old-only words, a modernized transliterations for modern-only words, and both for still-living old words (which by some definitions could be all old words), that would be fine with me (even though it would mean using two transliterations for most words). My essential point is that all old words (even still-living ones) need a scholarly transliteration, otherwise we are intentionally leaving out valuable information from our entries. --WikiTiki89 19:27, 8 December 2013 (UTC)
I'm not sure what "valuable information" is being left out. Ivan's proposal below, which you say you're on board with, would cause the supposedly-scholarly transliteration to include exactly the same information as the Hebrew-script version next to it (even down to the reliance on the Masoretes and their notation, e.g. in not distinguishing the different sh'va and kamats pronunciations). If you're saying that the scholarly transliteration itself is the "valuable information", then it would make more sense to provide all the standard transliterations that a person might be looking for, with appropriate labeling and referencing, rather than just to tack on an additional nonstandard transliteration that we have deemed "scholarly". —RuakhTALK 20:26, 8 December 2013 (UTC)
Scholarly transliterations of Hebrew generally do differentiate the shvas and kamatzes. --WikiTiki89 20:28, 8 December 2013 (UTC)
Yes, I'm aware, but I thought that you and Ivan decided below that we should transliterate solely based on attested spellings? (Hence my phrase "supposedly-scholarly transliteration".) —RuakhTALK 21:57, 8 December 2013 (UTC)
What I meant was that I'm ok with the way he wants to treat loanwords as no different from native words. --WikiTiki89 22:56, 8 December 2013 (UTC)
Could you post some examples of quotations transliterated as you propose? From a range of sources? Say, maybe, a verse/sentence/headline or two each from Torah, (l'havdil) a Haaretz article (or similar), and something like a Facebook status or Youtube clip? —RuakhTALK 23:39, 8 December 2013 (UTC)
Proverbs 25:24 (chosen by
טוֹב שֶׁבֶת עַל פִּנַּת גָּג מֵאֵשֶׁת מדונים [מִדְיָנִים] וּבֵית חָבֶר.
ṭōḇ šéḇeṯ ʿal-pinnaṯ-gāḡ mēʾḗšeṯ miḏyānīm [mdvnym] ūḇēṯ ḥā́ḇer.
First headline I found at
החלטת הנושים של דנקנר - צעד אחד קטן בדרך לתיקון שוק ההון
haḥlāṭaṯ hannōšīm šel danḳnēr - ṣáʿad ʾeḥāḏ ḳāṭān bᵊḏéreḵ lᵊṯiḳḳūn šūḳ hahōn
I expected it to be uglier, but doesn't look bad at all. --WikiTiki89 00:36, 9 December 2013 (UTC)
I think we'll have to agree to disagree. (And I can't help but notice that you didn't do a Facebook status . . .) —RuakhTALK 19:46, 9 December 2013 (UTC)
Fine, here's a random Tweet I found:
לרקוד כל הסוף שבוע. אני צריכה לשלם שכר דירה לאוויטה #שמח אז יאללה אני אחזור לקיבוץ
lirḳōḏ kol hassōp̄ šāḇū́aʿ. ʾănī ṣᵊrīḵā lᵊšallēm śᵊḵar dīrā lᵊʾēvīṭā #śāmḗaḥ ʾāz yállā ʾănī ʾeḥzōr laḳḳibbūṣ
I understand why you disagree, but you don't seem to be suggesting any sort of compromise for words that have existed since Biblical Hebrew. --WikiTiki89 20:41, 9 December 2013 (UTC)
If I thought you were joking, I would think that Tweet transliteration was a really awesome joke. (Also, how come <ʾeḥzōr> rather than <ʾaḥăzor>? I don't understand all the logic in your examples. How are you deciding when to use grammar vs. when to use pronunciation?)
Re: "you don't seem to be suggesting any sort of compromise": Right, I'm not suggesting anything at all. I'm certainly open to possibilities, but I have to see them to know how I feel about them. (And I don't feel bad asking you to type up some examples of your proposals, because transliterations should be quick and easy, and if they're not, then the proposal is a non-starter anyway.)
RuakhTALK 21:17, 9 December 2013 (UTC)
It most certainly is not <ʾaḥăzor>. It's either <ʾeḥĕzōr> or <ʾeḥzōr> (it's unpredictable as far as I can tell, but this book seems to say it's the former so I guessed wrong). It would help if you answered this question: Is there anything specific about the transliteration scheme that you dislike, or is it just the overwhelming amount of diacritics? --WikiTiki89 21:29, 9 December 2013 (UTC)
Whoops, yes, I meant <ʾeḥĕzōr>. (As for "unpredictable", I think there were two khet sounds until Late Antiquity; presumably one was gutturaler than the other.) And — the diacritics actually don't bother me in and of themselves, but the transliteration as a whole certainly gives the Tweet a Biblical flavor. It reminds me of people who go to Jerusalem, visit Ben Yehuda Street, and think they're in the Bible. (Should Wikimilon's transliterations of English represent 'gh' as 'ח'? Should they distinguish 'ea' from 'ee'?) —RuakhTALK 22:40, 9 December 2013 (UTC)
The parts of Israel that feel like you're in the Bible is everywhere where there are people around, so certainly not Ben Yehuda Street. I see what you mean by the Biblical flavor, but I don't see a problem with that. As for the dual ח issue, I've looked into it before and didn't find any correlation between how vowels were treated around it and which of ح or خ the Arabic cognate has, although the issue may have been complicated with Arabic re-borrowing words from Aramaic (in which the two ח's also merged). As for Wikimilon's "transliterations" of English, I would have said that I believe they are meant to be phonetic, but I can't seem to find any English words on Wikimilon other than he:English, he:Hebrew, and he:berg, none of which have any transliteration. Anyway, Hebrew is not the ideal alphabet for transliterations, while the Latin alphabet has become optimized for it over past couple centuries. --WikiTiki89 23:01, 9 December 2013 (UTC)
Re: Wikimilon: Yeah, sorry, I just meant that question as a hypothetical analogy, not as an actual policy question. (I have less than 100 contributions there, so I generally stay out of their decision-making.) —RuakhTALK 05:12, 10 December 2013 (UTC)
  • Byzantine Greek pronunciation is already shown in {{grc-ipa-rows}}, so there is no need to further abuse the Greek "transliteration". --Ivan Štambuk (talk) 03:08, 8 December 2013 (UTC)
As a person who can't read Hebrew or can read with efforts I'm interested in having more transliteration of Hebrew in translations (which often lacks), hopefully resembling closer the pronunciation, not necessarily the spelling. In fact, I don't care if transliteration doesn't match the spelling, it is often the case with Thai, Arabic, Korean, partially Russian, Japanese hiragana where knowledge of the script only confuses, when letters are not pronounced as expected. Mismatch between spellings and reading can be explained in appendices. Just my two cents. I know some people will disagree. If we were to transliterate English or French words into Cyrillic, the result would only partially resemble the original, since it's normally done phonetically. --Anatoli (обсудить/вклад) 08:51, 8 December 2013 (UTC)
Transliteration is not pronunciation. It's transliteration. A a bijective conversion of one script into another (in our case, Latin). If you are interested how a Hebrew word is pronounced you click it and look up its pronunciation. For some languages tr= is indeed transliteration, for some others it's apparently a mixture of transliteration and transcription, and for logogram-based scripts it's necessarily a lossy Romanization due to high ambiguity. However, for alphabetic script it's expected that transliteration be an actual transliteration, rather than ASCII-based dumbed-down IPA. The conversion of English and French word into Russian Cyrillic is not transliteration, but a special form of transcription (see Orthographic transcription). Both transliterations and IPA transcriptions (phonemic and phonetic) should be based on schemes established by scholars and not Wiktionary-devised ones that require additional learning. If there is a need to provide a quick-and-dirty English-based phoneticlike (perhaps something similar to "spelled pronunciations") transcription for all of the words not written in Latin script in translations, headwords and elsewhere, perhaps some other facility should be created specifically for that goal - an additional parameter, or tr= should be completely abolished for transliterations and transferred to ====Transliteration(s)==== header, or something else. --Ivan Štambuk (talk) 14:30, 8 December 2013 (UTC)
I agree with everything Ivan said. --Vahag (talk) 14:41, 8 December 2013 (UTC)
I agree with most of what Ivan said. But Ivan fails to take into account the unique case of recent borrowed, dialectal, or onomatopoeic words, for which, as Ruakh points out above, (in Hebrew and Arabic at least) a scholarly transcription would result in total BS. --WikiTiki89 18:22, 8 December 2013 (UTC)
It would only be "BS" from the transliteration-as-an-approximation-of-speech perspective, which is not the purpose of transliteration at all. Transliteration and phonemic transcription (within //) have absolutely nothing to do with how the word is pronounced, and the latter has absolutely nothing to do with how the word is spelled. The first recovers the original spelling, the second phonemic contrasts. Pronunciations should be listed only in []s and some form of English-friendly "spelled pronunciations" to make them more useful to the readers that can't decode obscure IPA sounds, and/or read the original scripts. Trying to fit the first two for the purpose they were not designed to represent will and does results in a horrible cross-language mess where phonemic transcriptions have phantom phonemes, and transliterations of the same spelling vary across time periods (of what we deem "different languages"), as well as different words (examples by Ruakh that you cite). --Ivan Štambuk (talk) 18:43, 8 December 2013 (UTC)
You are missing something about "abjad" languages such as Hebrew and Arabic. In the native scripts themselves, loanwords are exceptional. Native words have well-defined vocalizations, that are usually unwritten except in certain situations. Borrowings do not have well-defined vocalizations. The native Arabic word لون(l-w-n, color) is unquestionably vocalized as لَوْن (lawn) even though it is pronounced /loːn/ in most modern dialects. The word بوسطن(b-w-s-ṭ-n, Boston), which is pronounced something like /bostˤon/ in all dialects, is ambiguously either vocalized as بُوسْطُن (būsṭun), بَوْسْطُن (bawsṭun), or with some non-standard symbols, or not vocalized at all even in otherwise vocalized text. The letter و is used to represent the /o/ sound in borrowed words, because there is no native way to do so, thus transliterating it using a pseudo-native vocalization is just plain wrong, and we should transliterate such words as "bosṭon" or "bosṭun". --WikiTiki89 19:27, 8 December 2013 (UTC)
In general, loanwords are not written with taškīl, they are most often written with full vowels (ا و ي) for /a, o~u, e~i/. It also depends on the person's knowledge of the original language. Boston is mostly pronounced as /(ˈ)bo(ː)ston/, but there is a medieval-aged common practice to Arabize western languages consonants with some of their Arabic-emphatic equivalent. So, to transliterate Boston from Arabic, it would be bōsṭon, which is the same by Hans Wehr, DIN 31635 and ALA-LC. The need to transliterate Arabized loanwords is very limited, because it is already an English word, in that example. Another note is that Latin-based script is pronounced and Arabized as spelled, but with a few exceptions, like London لندن landan. The letter و is used in Boston, but it is for /o(ː)/, not /u(ː)/. The /aw/ is never used for Boston, it may be, as I mentioned, a misused spelling pronunciation for words as Dawkins داوكينز dawkinz (short a but spelled with ا) or as its pronunciation approximation would be pronounced in English language, دوكينز dōkinz. It is understood that ō, ē can never be anything but (و, ي), since there are no other Arabic letters for them. I don't understand the purpose of transliterating Arabic or Arabized words without their unspelled vowels. Hans Wehr, DIN 31635 and ALA-LC transliterations were made to easily reconstruct their Arabic spelling. --Mahmudmasri (talk) 13:40, 13 December 2013 (UTC)
I did not mean that بوسطن is pronounced with /aw/, but that it can be vowelized with "aw" because in many dialects, "aw" is pronounced /oː/. --WikiTiki89 13:59, 13 December 2013 (UTC)
It's true that etymologically many words with /oː/, originally had /aw/, but we (all Arabic speakers) never write them with fatḥa+و to pronounce them /oː/, but just with a plain و without sukūn. --Mahmudmasri (talk) 11:43, 14 December 2013 (UTC)
There is no difference between vocalized abjads and alphabets - list all of the forms attested, and transliterate accordingly. If no there is no vocalized form attested or defined by orthography, transliterate only consonants. --Ivan Štambuk (talk) 19:45, 8 December 2013 (UTC)
Ok, you've convinced me. If you manage to convince everyone else, I'll go along with it. --WikiTiki89 19:54, 8 December 2013 (UTC)
Yes. Now that we can show multiple headwords, soon also with each its own transliteration, there's no problem with just showing all possibilities. It's not so different from having more than one way to put accents on a word, like some Russian words. —CodeCat 22:15, 8 December 2013 (UTC)
Perhaps someone is excited about abjads transliterated with consonants only but not me. Dictionaries and textbooks don't do that. If we are going to see طبيعة‎ transliterated as "ṭbyʿẗ" (or similar) instead "ṭabīʿa" or Korean 녹말 (nongmal) as "nongmal" instead of "nogmal", Japanese べんきょう as "benkyou" instead of "benkyō", particles and as "ha" and "he", instead of "wa" and "e" and Russian что (što) as "čto" instead of "što", then it's better to have no transliteration at all! Just give the users links to alphabets. @Wikitiki89, I thought you have agreed with Mahmudmasri (talkcontribs) re transliteration of foreign and dialectal words in Wiktionary_talk:About_Arabic? The practice of transliterating Arabic words phonetically is Hans Wehr standard, it's also the Revised Romanisation of Korean standard and Hepburn for Japanese. --Anatoli (обсудить/вклад) 23:12, 8 December 2013 (UTC)
I didn't exactly "agree". I was simply ok with it. But you seem to have misunderstood something. Ivan did not say to transliterate only consonants in all cases, but to do so only when vocalized forms are unattested and not well-defined (mostly only loanwords). And I disagree that for "что" no transliteration is better than "čto". I think "čto" is better than "što" for a number of reasons. --WikiTiki89 23:29, 8 December 2013 (UTC)
Textbook and dictionaries transliterate "что" as "što" or "shto", if they transliterate at all (Cyrillic is considered too easy, so transliteration is usually not necessary) or provide Cyrillic respelling/transliteration "что [што]", "сего́дня [сево́дня]" and "танде́м [тандэ́м]". "čto" is only representation of Cyrillic letters one by one (ч (č), т (t), о (o)), which may help understand what letters were used (why not used alphabet for that?) but confuse the hell out of learners and teach them wrong pronunciation. For the same reason, Japanese particles and are transliterated as "wa" and "e", which doesn't match what is actually written ("ha" and "he"). Call it transcription or romanisation, if you will but literal transliteration is not very useful. "Scientific" transliteration are seldom used by Russian dictionaries, only for standardizing proper nouns but then, phonetic respelling is used much more often, anyway, e.g it's "Sheremetyevo", not "Šeremétʹjevo" (Шереметьево). Russian lacks generally accepted transliteration system used by everyone, including the government, so dictionaries and textbook use their proprietary, own systems made by authors, which are explained at the beginning of textbooks or dictionaries. In case of Wiktionary, pages such as WT:RU TR serve that, which only explained what is used by Russian speaking editors in this dictionary. --Anatoli (обсудить/вклад) 23:57, 8 December 2013 (UTC)
As I have said before. Transliterations are of very little use to learners (or even an impediment). Learners should learn the native script. If we wanted a more pronunciation-based transliteration scheme for Russian, then why don't we transliterate "щ" and "сч" as "ś" or "здра́вствуй" as "zdrástvuj" (without the "v"), "руга́ться" as "rugátsa", or even "молоко́" as "malakó"? The answer is because these things are silly and unhelpful and I don't see why the case of "что" is any different. --WikiTiki89 00:35, 9 December 2013 (UTC)
I understand your sarcasm but your examples are not really relevant because the above words follow standard phonetic rules of the Russian language, they are not exceptions. Reading "жи" as "жы" and "ши" as "шы" are not exceptions, they are a rule. Some textbooks transliterate the way you describe as well but nobody (regular and random contributors) thinks it was ever necessary (such transliterations did exist but have been corrected by Stephen G. Brown (talkcontribs) and myself). Words like "сча́стье" ([ща́стье]), "здра́вствуй ([здра́ствуй])", "руга́ться ([руга́цца])", etc. can also be respelled, which actually explains, why there are misspellings and why Belarusian is written so (about a half of Belarusian spelling reflects Russian actual pronunciation). As I said, the rules for transliteration and which exceptions need to be manually transliterated are described in WT:RU TR. I should say that only exceptions need to be manually transliterated, not expected sound changes, you forgot to mention devoicing and many other things. I have no certainty about cases like со́лнце (sólnce), which could be transliterated as "sólnce" or "sónce". Dropping "л" is not really an expected change here (consonants are usually dropped in the middle of a consonant cluster) but it's OK to leave it as "sólnce" (/l/ reappears in derivations - со́лнечный, со́лнышко). --Anatoli (обсудить/вклад) 00:56, 9 December 2013 (UTC)
In case you didn't realize, the /tɕ/ reappears in all other forms of "что" as well (чего, чему, чем, чём). --WikiTiki89 01:02, 9 December 2013 (UTC)
Oh, I do realize, of course, since I do use these words in translations all the time :). I only mentioned со́лнечный, со́лнышко for the benefit of someone interested. чего, чему use different stems but a better example is не́что (néčto) ничто́жный (ničtóžnyj), which are direct derivatives of "что" but are pronounced as expected ("the Moscow shift" hasn't affected the standard pronunciation). --Anatoli (обсудить/вклад) 01:25, 9 December 2013 (UTC)
Another point I just realized is these dictionaries and textbooks are probably using transliterations as a replacement for any other kind of pronunciation information (such as IPA transcriptions). Since we do have IPA transcriptions, why do we need to duplicate this information in the transliteration? I wonder if there are any other dictionaries that have both pronunciations and transliterations, and if there are, what transliteration schemes they use. --WikiTiki89 01:14, 9 December 2013 (UTC)
If we have adopted IPA (which I doubt) for any transliteration in any case (usexes, synonyms, etc. it would be a different story. IPA is used occasionally in introductions to dictionaries and scientific papers, not used for headwords. IPA is a common practice for English dictionaries for Russians, though (used in headwords). --Anatoli (обсудить/вклад) 01:25, 9 December 2013 (UTC)
I said "any kind of pronunciation information such as IPA", in other words IPA was an example. I've seen many Russian dictionaries that use pronunciation respelling. My question is do any dictionaries do that and include transliterations? --WikiTiki89 01:29, 9 December 2013 (UTC)
I don't have examples handy but from memory no, one is sufficient. A textbook would transliterate a Russian text phonetically but may provide "моде́м" ([модэ́м]) in a word list. Having both /mɐˈdɛm/ and [модэ́м] together is unlikely. --Anatoli (обсудить/вклад) 01:36, 9 December 2013 (UTC)
You misunderstood me. In my mind IPA and pronunciation respellings are the same thing. My question was do any dictionaries include both a pronunciation (IPA, respelling, or other) and a transliteration (romanization)? Because we do do this and each serves its own purpose: the transliterations are to aid people who do not know the script while pronunciations are to aid people in pronouncing the word. --WikiTiki89 01:40, 9 December 2013 (UTC)
Do you mean like this: яи́чница (jaíšnica, jaíčnica) ([яи́шница] or [яи́чница])? The number of exceptions is not so big but it's hard to find examples, since Russian textbooks switch to no transliteration quite early (as I said, Cyrillic is considered an easy alphabet). Respelling is more often used by monolingual Russian dictionaries or bilingual dictionaries without transliteration. It's only beneficial to users to show how a word is really pronounced (exceptions), even if IPA, transliteration, respelling duplicate the info. Thai textbooks also provide some traditional spelling info (along with the phonetic transliteration) when a word is not pronounced by usual spelling rules. It's especially important because Thai spelling and tone rules are very complicated and pronunciation and tones have changed over time. --Anatoli (обсудить/вклад) 01:59, 9 December 2013 (UTC)
So I'll ask you this: Why do we need transliterations if we have IPA transcriptions? --WikiTiki89 02:07, 9 December 2013 (UTC)
As I said, IPA is often misunderstood and disliked and not used in all contexts and situations, transliteration seems to be liked by the majority and used much more frequently. If people favouring literal transliteration take over those favouring transcribing exceptions differently then, also as I said, I'd prefer NO transliteration over misleading transliteration. Why these questions? I think I have explained my position and reasons. --Anatoli (обсудить/вклад) 02:13, 9 December 2013 (UTC)
Why these questions? To clarify your position for me. The only way that "čto" can be misleading is if the reader assumes that it is intended to reflect pronunciation. In and of itself, it is no more misleading than the original spelling itself ("что"). As I said before, the purpose of transliteration is to aid people who do not understand a script, while the purpose of pronunciation transcription is to aid in pronouncing the word. Note that pronunciation transcriptions do not have to be IPA, we can introduce other additional systems such as respellings or Romanized respellings. But transliterations should remain faithful the written word. Following -sche's suggestion below, we could have {{term/t|ru|что||what|pr=[што]}} display что (čto, [што], “what”). Would you be against that? --WikiTiki89 02:38, 9 December 2013 (UTC)
For Slovene, our entries use a standard pronunciation spelling system with diacritics. It's placed in the pronunciation section of each entry using the {{sl-tonal}} template. The same can be done for Russian. I don't like adding pronunciation information to templates, because then there's nothing to stop us from applying it to words in Latin script as well. Do we really want enPR spellings in {{term}}? —CodeCat 02:46, 9 December 2013 (UTC)
To be honest, I don't know why we include transliterations with every mention of foreign-script words. Transliterations are only useful in the main entry and in etymology sections, everywhere else, people are supposed to click on the link. The only reason we seem to like translation tables to have a lot of extra information that really belongs in the entry is because many of these entries do not exist and it is much easier to add a translation than a full-blown entry. This has its advantages and disadvantages, but overall I think it is bad practice. --WikiTiki89 02:53, 9 December 2013 (UTC)
We include a transliteration in every mention of a word as an accessibility aid for our readers. They shouldn’t have to learn a bunch of alphabets to understand tree#EtymologyMichael Z. 2013-12-09 19:46 z
Which is why I mentioned etymology sections as the only place other than headword lines where we should have transliterations. --WikiTiki89 15:13, 10 December 2013 (UTC)
So you think they should learn a bunch of alphabets to read all other parts of the site then? Michael Z. 2013-12-11 18:37 z
Or they could click on the entry... The Etymology section is the only place where someone who is not interested in a particular language enough to click a link still has to read words in it. --WikiTiki89 18:43, 11 December 2013 (UTC)
That's right. A simple transliteration is far more popular than IPA.
The respelling and IPA in headwords is not going to be supported by majority, so I can't consider these as suggestions. Besides, header style with (respellings or IPA) won't be a substitute for transliteration, which is used increasingly often in many situations. --Anatoli (обсудить/вклад) 12:25, 10 December 2013 (UTC)
I don't understand why you keep going on about IPA. No one mentioned it here. --WikiTiki89 15:13, 10 December 2013 (UTC)
The disagreement over whether transliteration should merely transliterate letters or whether it should change from one word to the next to represent pronunciation is a fundamental and recurring one. Following up on a suggestion Ivan made several paragraphs ago which seems to have been overlooked in favour of continuing to disagree over how to use the transliteration parameters of templates ... is there any appetite for adding, to (most of) the templates that currently have transliteration parameters, a pronunciation parameter? It would be unneeded on headword line templates, because those are used on entries which already have pronunciation sections. (I realize there are those who strenuously disagree, but I refuse to believe that anyone is stupid enough to assume that the letter-for-letter transliteration provided next to the headword of an entry, rather than the section above it explicitly labelled Pronunciation in big font, is intended to give the definitive pronunciation.) However, in translations tables and etymology sections, etc, we could have: {{term|со́лнце|tr=sólnce|pr=ˈsontsɘ|lang=ru}} which could display со́лнце (sólnce, pronounced /ˈsontsɘ/). - -sche (discuss) 01:00, 9 December 2013 (UTC)
I am in favour of such additional parameters suggested by Ivan (which can also be used for Roman based languages, which are actually disadvantaged, IMHO) but it doesn't change my position on transliterating exceptions. IPA is only used in headwords, to my knowledge it's poorly understood and disliked. There is a general agreement about transliterating exceptions in Arabic, Japanese, Korean, Thai (even if seldom mentioned), Persian, Hebrew(?) but for some reason Russian transliteration methods are always under attack. Russian uses a simpler alphabet but it uses traditional spelling in a small number of cases and uses "е" instead of "ё" in a running adult native text. Words written with "е" instead of "ё" are bad for automatic or literal transliteration, like transliterating verbally Arabic or Hebrew unvocalised words, relaxed Arabic when letters are substituted, traditional Thai words, which are pronounced very differently from the original spelling. --Anatoli (обсудить/вклад) 01:13, 9 December 2013 (UTC)
Re "I am in favour of such additional parameters [...] but it doesn't change my position on transliterating exceptions": that's a non-starter. If we have pronunciations in a pronunciation field right after a word, there is no reason to misuse the transliteration field to provide a second, dumbed-down pronunciation right next to it. - -sche (discuss) 02:22, 9 December 2013 (UTC)
Words "misuse" and "dumbed-down" just show personal attitudes to the problem at hand. To me, showing automatic transliteration for exceptions is a misuse, it's taking "transliteration" too literally. I think it's actually much clearer when you have to deal with more complicated scripts. Automatic letter-to-letter transliteration is only a choice when no native knowledge is available, like with Cyrillic Mongolian where half of the words are pronounced differently from the spelling with few rules you could follow. --Anatoli (обсудить/вклад) 02:33, 9 December 2013 (UTC)

Straw poll: What is the basic purpose of Wiktionary transliteration?Edit

A lot of the disagreement above seems to come from a basic disagreement on what transliteration is supposed to do on Wiktionary, and also what it's not meant to do. I think it might help if everyone lists their stance and their reason for it.

Transliterations should be orthographicEdit

Transliterations should use each character, or combination of characters, in the target script to represent matching characters in the source. The transliteration should reflect the writing conventions that are present in the source text, even if the spelling does not faithfully reflect the pronunciation.

  1. For the purpose of Wiktionary, I think transliterations should be orthographic. Phonetic transliterations can definitely be useful, but as a pronunciation aid. We don't need pronunciation aids on Wiktionary, or at least we don't need them for non-Latin script languages alone. Latin-script languages would benefit just as much from such pronunciation-based transliterations, but we don't have them either and I don't think we should. For example, if the purpose of transliteration is to show pronunciation details, then there's not really much of a difference between {{l|ru|лёгкого|tr=ljóxkovo}} and {{l|nl|bijzonder|tr=bizondər}} or {{l|nl|werkelijk|tr=werkələk}}. And I don't think we should do this. —CodeCat 22:47, 9 December 2013 (UTC)
Roman based languages are not transliterated here, no matter how inconsistent, complicated or different from other languages the orthography is, including languages with complex diacritics. I haven't made this rule. People generally think that mastering a Roman based script is still easier, you just need to know the rules. However, if a Roman based languages gets transliterated into another script, it's a totally different matter. E.g., Japanese uses a fully phonetic system, the spelling is ignored (even though Japanese has a different phonetic system, so words sound unusual). So, in Japanese the Dutch words above become something like "ビゾンダル" (bizondaru), "ヴェルケレク" (verkereku), in Russian something like "бизондер", "веркелек". Even though, it's the Katakanisation, Cyrillisation, not exactly "transliteration", it's the idea how textbooks, phrasebooks, dictionaries often look. There are various romanisation and transliteration system. Orthographic isn't so popular in dictionaries, textbooks and phrasebooks. When you see a word in an unknown script, you have to struggle twice - 1. understand the symbols, 2. know the reading rules. E.g. สวัสดี (sà-wàt-dii) (sawàtdee) is made of (sɔ̌ɔ), (wɔɔ), , (sɔ̌ɔ), (dɔɔ), . The first letter (sɔ̌ɔ) (s) doesn't have a vowel symbol to make the sound "a", it's a (predictable) exception and "swàtdee" (without "a") would only confuse readers. Particle ไหม (mǎi) is supposed to be pronounced with with a rising tone, it is pronounced with a high tone (letter ห converts the following consonant into a "high-class" consonant, which, by another rule should have a rising tone on a "live syllable"). It's an unpredictable exception.
If "лёгкого (ljóxkovo)" (also spelled "легкого") is transliterated "ljogkogo" or even "legkogo" (based on the usual relaxed spelling - "ё" as "е"), it doesn't help anyone. To know the letters, one can look up each letter in the alphabet. The difference between transliterating Roman based languages and foreign script languages is that quite a number of systems make adjustments and use fully or partially phonetic methods. The purpose is not to teach the symbols but to teach words, that's why "ljóxkovo" is the best way to transliterate both "лёгкого" and "легкого". --Anatoli (обсудить/вклад) 11:48, 10 December 2013 (UTC)
ljogkogo by itself is not helpful, no. But neither is the Russian spelling лёгкого. The only reason Russian speakers know that лёгкого is irregular is because they speak Russian in the first place. But what about, say, an Ukrainian or Bulgarian speaker? They would not be helped any more by лёгкого than by ljogkogo. I think you should also make a distinction between transliterating into a language and transliterating into a script. In Russian, you would write бизондер, but in Ukrainian this would be бізондер, and in Bulgarian as бизондър. But there is no reason that a single Cyrillicization scheme couldn't be created for Dutch, one that is language-independent or based on established conventions for one or more languages, without specifically catering to one of them. The scientific transliteration of Russian (which ours is based on) is precisely that; it's a language-independent way of transliterating Russian, and therefore should not try to match the spelling conventions of any one target language. It has its own spelling conventions. Consider also that we transliterate the old letter ѣ differently from е. That's purely an orthographic difference in Russian, yet we still represent it. So if we should show pronunciation differences, then both of these should be "e". —CodeCat 14:42, 10 December 2013 (UTC)
A good example of a language-independent transliteration scheme is Pinyin / Palladius for Chinese. Neither of these schemes follow the conventions of any language, they make up their own, even if they seem counterintuitive for speakers of any language. Because of this, there is only one type of Pinyin and only one type of Palladius, no matter what the language is. We can do the same for Russian, and we can do so by following the established standard scientific transliteration. —CodeCat 14:48, 10 December 2013 (UTC)
That's why we need to transliterate exceptions differently, so that Ukrainian or Bulgarian readers didn't make wrong assumptions. In Polish, "light" is "lekki" and they can assume that "лёгкий" ljókkij". The ending "-ого", "-его" has also cognates in most Slavic languages, bug they pronounce "g" or "h", only Russian, rather recently has got "v". Yes, Cyrillic alphabets differ across languages but that's beside the point. The point is that in Russian transliteration it wouldn't be "бийзондер" but phonetic, employing additional symbols, if necessary. I've got a Russian textbook of Arabic, which uses various additional Cyrillic letters. --Anatoli (обсудить/вклад) 21:04, 10 December 2013 (UTC)
They won't be confused if there's a usage note. --WikiTiki89 21:11, 10 December 2013 (UTC)
I think you misunderstood a bit. Using "бийзондер" would only make sense if we transliterate each letter distinctly but that's not what I am arguing for. There's no reason an orthographic transliteration couldn't have a rule to transliterate Dutch "ij" as "ей", so I think "бейзондер" would not be a bad way to transliterate this word pronunciation-agnostically, even if Russians would not pronounce that right. Correct pronunciation isn't the point after all; how would you transliterate /œy/ into Russian? There's little chance any combination of Cyrillic letters would help Russians pronounce that right, so the emphasis should be on choosing a single unique representation, not on trying to write something that will make Russians pronounce it right. So something like "ёю" or "ёу" would be fine, even if Russians would totally mispronounce it. To give a similar example, how many English speakers would pronounce Pinyin Qín or Xī'ān correctly? —CodeCat 23:44, 10 December 2013 (UTC)
Unlike other alphabets, the Latin alphabet has been optimized for transliterating and transcribing virtually any language in the past couple centuries, so it is not helpful to compare transliteration from English to Japanese, which cannot possibly come close to representing English orthography. Exceptions do not need to be transliterated exceptionally, as you seem to think. When we refer to the English words great and meat, we don't have to write them as great (grejt) and meat (mīt) all the time. Their pronunciation is handled in the pronunciation section, as it should be for all other languages. --WikiTiki89 15:08, 10 December 2013 (UTC)
Sure but not all countries know about this, he-he. To teach Latin letters they don't use Latin letters. Japanese or Chinese transliteration may look very funny but it's heavily used to help young learners to master foreign script.Russian and Arabic textbooks use it too, to various degrees. "Should be" is your personal opinion, transliteration is quite common and popular. I'm not asking to transliterate English or any other Latin based language, nobody does. --Anatoli (обсудить/вклад) 21:04, 10 December 2013 (UTC)
Those are not transliterations but phonetic transcriptions. And you're right, it's very hard to teach English without phonetic transcriptions, which is why we have pronunciation sections. --WikiTiki89 21:11, 10 December 2013 (UTC)

Transliterations should be phoneticEdit

Each character, or combination of characters, in the target script should match, either phonetically or using a more fundamental underlying analysis, the pronunciation of the source word. The transliteration should not necessarily reflect nuances in the source spelling, if these aren't also reflected by pronunciation differences.

  1. (I'm not a fundamentalist about it. But this is what we've ended up with for Hebrew, and I've been pretty happy with it overall.) (Actually with Hebrew we've been doing something slightly different — we've been giving transliterations that reflect the most common Modern Israeli accent, which is obviously not quite the same thing as "the pronunciation of the source word" when the transliteration is e.g. of a Bible quotation — but it's in the same general category as this.) (Actually we're not quite doing that either, in that I, at least, have generally gone with transliterations of the supposedly "correct" pronunciation, even when I rather doubt that that's what most people would say, e.g. in cases of vowel changes in clitics. But I've been doubting the merits of doing that, and if some mandate came from on high telling me that I should use educated-normal-person pronunciation rather than pedantic-new-immigrant pronunciation, I'd happily do so, my only qualm being over the verifiability of said, since most references deal in pedantry-newness-immigration.) —RuakhTALK 05:30, 10 December 2013 (UTC)
The rules for Arabic and Persian have been rather relaxed too. In a way that dialectal and loanwords are transliterated differently, absence of diacritics (which are always absent in Persian and Urdu) is not a hurdle to transliterate words phonetically. Arabic has occasional silent ا (in grammatical forms) and Persian has occasional silent و. Rules will get stricter or more standardised but it's just impossible to force abjad languages into purely orthographic transliteration. Arabic will probably be transliterated the Hans Wehr way. --Anatoli (обсудить/вклад) 12:10, 10 December 2013 (UTC)


  1. It needs elements of each. We should stick to orthography as much as possible but in many cases the original orthography is lacking and certain features should be added to the transliteration. Also, not every detail of the original orthography needs to be distinguished if such a distinction would be difficult to represent and would not add anything valuable. --WikiTiki89 23:06, 9 December 2013 (UTC)

The reality is. It's neither fully phonetic like IPA or fully orthographic. It can be fully orthographic for some languages though. My comments on other languages are usually ignored because Russian is being the target of the criticism.

  1. In Thai, Lao many consonants are represented by different letters, depending on the "consonant class" or semantics. It's impossible and not necessary to make each letter to be represented by the same letter or transliterate backwards. In Thai there are silent letters and traditional spelling, orthographic is very unhelpful and is not used.
  2. Korean consonant change their values a lot, depending on the position. Various transliterations systems reflect this change.
  3. Japanese おう (or any syllabic letter with -o) represent either "ou" or "ō", the reading depends on the meaning. Japanese particles は and へ are "wa" and "e", which differs from the expected "ha" and "he". This also is used by most transliteration systems.
  4. Arabic, Persian, Urdu and Hebrew without diacritics would be a mess, if we tried to represent them orthographically.
    Arabic و and ي can be vowels or consonants. ū/ī or w/y.
    The letter ا can bear no sound, just a hamza carrier or represent vowels, usually unwritten vowels are attached to it or it can be a long vowel ā.
    There are numerous relaxed spellings, like writing ى instead of final ي or ه instead of ة. In the strict spellings they represent completely different sounds.

I could continue but not sure if anyone is reading. --Anatoli (обсудить/вклад) 23:10, 9 December 2013 (UTC)

I'll give you an answer. For Japanese, your first part is a case of what I just mentioned above (the original orthography is lacking). The second part, は and へ, it wouldn't be a problem to transliterate them as "he" and "he" with a usage note saying that they are not pronounced that way (such a usage note should be there regardless of our transliteration scheme). For Thai and Lao, we don't have to distinguish between all the letters that produce the same sound. And for positional variants (applying to Korean as well), they should be learned, not distinguished in transliteration. For Abjad languages, what you mention is not a problem at all. If we wanted to do a consonantal transliteration, words such as مبروك‎ can easily be, and frequently are, transliterated as "mbrwk" (details such as how to transliterate alif or other special letters would have to be worked out, but they are not dead end). Of course, we try to include the vocalization in both the original and the transliteration, so there is no problem. --WikiTiki89 23:26, 9 December 2013 (UTC)
Forgot to mention that for cases like "traditional spelling" in Thai, there does not need to be an exception, just like in English when words like colonel and indictment maintain traditional pronunciations, but with the spelling re-borrowed. We don't need to include an extra English transliteration as coronel or inditement. --WikiTiki89 23:33, 9 December 2013 (UTC)
Re: your answer about Thai, so, you suggest that Thai exceptions like ไหน (nǎi) should be transliterated as "hnǎi" instead of "nǎi" and อยู่ (yùu) as "ʔyū" instead of "yū" because you want ALL symbols in a foreign script to be graphically represented even if they are silent or have a different value - half or completely unpredictable? Is it OK to misrepresent tones as well? Thai tones are governed by complex rules but there are exceptions, which are consistent and inconsistent. Both Thai and Korean have finals transliterated and pronounced differently from initials. Should this rule be cancelled as well? --Anatoli (обсудить/вклад) 13:04, 10 December 2013 (UTC)
In short, yes. If the Thai people care enough about the silent letter " (hɔ̌ɔ)" in ไหน (nǎi) in their own orthography, then why shouldn't we transliterate it? As for tones, if they are written explicitly in the orthography, then we should transliterate as written, if not then we should follow all the tone rules (and exceptions) of the language. As for positional variants (finals, initials, medials, etc.), for Russian, we transliterate "хлеб" as "xleb" (which I know you agree with). We do not transliterate it "xlep" just because the "б" is final. Same should be for Thai and Korean. Phonological variants such as this should be learned, not written (but they still should be indicated in the pronunciation section, which does not have to be only IPA as you seem to think, so I don't see any problem). —This comment was unsigned.
I forgot to tell you the whole truth about ไหน. The first symbol is actually a vowel, so it's "ǎihn". Will you make an exception for vowels that are physically written in the reverse order or even surround consonants, or you are going to let Thai teachers know that from now on the term is transliterate as "ǎihn"? Russian devoicing and Korean, Thai, etc., clipped final consonants are a different thing. There are more changes with finals, like "ch", "s" become "t", etc. So, are you going to challenge Korean, Thai, Arabic, Japanese, etc. standard and scientific rules of transliteration, which are generally accepted and favoured, many are supported and used by our editors, just to make your point about Russian transliteration, gain support and force the change? --Anatoli (обсудить/вклад) 21:23, 10 December 2013 (UTC)
I realized that when I looked up the individual letters. And no, I'm not saying we should write "ǎihn" instead, otherwise, I would have demanded that we transliterate Arabic vowels on top of the consonants. The purpose of transliteration is to convert a script to a more familiar script. If we leave the letters in the weird order of the original script, that doesn't really do the job. And to your last question, no, I'm not challenging the "standard and scientific rules of transliteration", I'm giving you my point of view on what I would consider a useful transliteration for me, as someone who knows very little about Thai. --WikiTiki89 21:36, 10 December 2013 (UTC)
But that's the thing, you always have to compromise and make a decisions on what you're going to render, what to add and what to change. (h) is not only a consonant, it's a means to make the following consonant to change the tone and is silent in this case, at the beginning of a syllable (not physical beginning in this case) before a consonant. You'll have to make compromises and decisions all the time. คน (kon) only uses two consonants - (k) + (n) but is pronounced "kon", not "kn". It's another rule for monosyllabic words.
Hindi relaxed spelling (not unlike Arabic or Russian) may spell फ़िल्म (film) as फिल्म (philm). The symbol () (nuqta) is often omitted like dots in Russian or Arabic. Both spellings are pronounced "film", even though फिल्म is spelled "philma". The "a" is inherent to Devanagari but chopped in Hindi and many other languages.--Anatoli (обсудить/вклад) 22:16, 10 December 2013 (UTC)
Maybe I misunderstood. If the purpose of the written ห is to change the tone, then we could write it as a change of tone rather than as "h". What I thought you meant was that in this particular word, the "h" is silent. As for the Hindi thing, just like Arabic, when diacritics are not written, they are inferred. It doesn't actually say "philma", it's just people being too lazy to write the diacritic that cancels the "a" sound. --WikiTiki89 22:21, 10 December 2013 (UTC)
The Sanskrit diacritic () (virāma or halant]) is hardly used in Hindi, only in Sanskrit words or when unable to produce a ligature (made of consonants). All Devanagari consonants are syllabic, where "a" is inherent - na, ra, ka, etc. Dropping "a" is an easy reading rule in Hindi. People just know when to drop "a" but it wasn't the case in Sanskrit, so they used a diacritic. Nuqta wasn't used in Sanskrit, it just didn't have those sounds but Hindi required it for Arabic, Persian, Urdu and English words. Lack of nuqta has created alternative forms in pronunciation and just new pronunciation, cases of overcorrection. The word "film" is just well-known, so Hindi speakers pronounce it correctly, even if they usually don't use nuqta. Bengali doesn't have nuqta and doesn't have f, z and other sounds but they know how to pronounce loanwords. In short, Hindi Devanagari is almost phonetic but not 100% and some words need to be transliterated manually. --Anatoli (обсудить/вклад) 22:48, 11 December 2013 (UTC)
You're describing your preference but not the real transliteration or even standard or scientific or what users/learners really want or what dictionaries do. Even rather conservative w:Revised Romanization of Korean reflects sound changes. South Korea published guides for the phonetic transliteration for practical purposes, implemented in Korean Wiki (partially) and fully in French wiki. Module:ko-translit/testcases is an attempt to follow the official transcription guide. Google Translate attempted to transliterate Arabic like this - "mbrwk" (BS, IMHO) but then abandoned, since it's useless BS. Persian and Urdu are seldom vocalised if at all. So vowels ALWAYS need to be inserted in transliteration to make it meaningful. People learn alphabets quickly what they want is the missing/unwritten vowels in abjad languages. は and へ (as particles) or おう combinations are not transliterated with usage notes, they are exceptions, which are handled manually. "Tōkyō" is the standard transliteration of 東京 (とうきょう), not "toukyou". My strong opinion is, transliteration should be practical, call it romanisation or transcription. It shouldn't aim at representing each letter one-to-one or try to allow backward transliteration. --Anatoli (обсудить/вклад) 23:39, 9 December 2013 (UTC)
Maybe I wasn't clear. For Japanese, I was agreeing with you about おう, but disagreeing only about は and へ as particles. For Abjads, I did not suggest that we use transliterations such as "mbrwk", but only that they are not total BS as you seem to think. They are used by scholars when the consonantal spelling of a word is important. But I agree that we need to insert vowels into Arabic, Persian, Urdu, Hebrew, etc. If there is a native scheme available, which Arabic and Hebrew have and Persian and Urdu I think have, then we should use it and transliterate their scheme, otherwise we need to insert our own vowels because the native orthography is lacking. --WikiTiki89 23:48, 9 December 2013 (UTC)
Regarding your recent changes to your post, for Korean I think it's ok to transliterate the component parts differently depending on their position within a character, but each character as a whole should always be transliterated the same way. --WikiTiki89 23:58, 9 December 2013 (UTC)
I cannot agree with you or we don't understand each other. Would you suggest that こんにちは should be transliterated as "konnichi ha", rather than "konnichi wa"? As for おう (or any combination of こ, と, そ, etc. + う), I don't know what you mean. These are two hiragana symbols - o + u but whether they are pronounced and transliterated as "ou" or "ō" depends whether they belong to different morphemes and semantics. The same, more or less, applies to long vowels ああ, いい, うう ā/aa, ī/ii, ū/uu. Automatic transliteration tools take a simple approach but we don't want it, it's not standard either.
Your answer about Korean doesn't make sense to me or you need to know more on the topic, including RR standard. E.g., a simple example: 녹말 (nongmal) consists of two syllables/two hangeul characters (nok) + (mal), six jamo components: , , , , , - (n-o-g-m-a-l). Separately, they are transliterated as "nok" and "mal", together as "nongmal", if you add a particle starting with a vowel (e.g. (e)), it becomes "nongmar-", e.g. 녹말 (nongmare). (These are not my inventions, it's a standard transliteration). There are various changes with jamo , which can be g, k, ng, depending on the position. Even more changes with . (You can ask TAKASUGI Shinji (talkcontribs) more on the topic but most of the simple stuff is on w:Revised Romanization of Korean). --Anatoli (обсудить/вклад) 11:06, 10 December 2013 (UTC)
The reason you can't understand me is maybe because you don't realize that I meant what I said. For Japanese, yes こんにちは should be transliterated as "konnichi ha" (the /wa/ pronunciation should be indicated in the pronunciation section and probably as a usage note due to the nature of the exception). As for ō/ou, I don't know why I have to keep repeating myself that I agree with you. For Korean, I know more about Hangeul than you seem to think. 녹말 (nongmal) should be "nog-mar". The fact that /g/ is pronounced as /k/ finally and as /ŋ/ before /m/ or that /r/ is pronounced /l/ finally or before a consonant is a phonological feature, not an orthographic one and it's a regular rule, just like Russian final devoicing. Pronunciation sections are where we can put phonetic transliterations like the Revised Romanization of Korean. --WikiTiki89 14:50, 10 December 2013 (UTC)
You're using this argument to express your opposition to the Russian transliteration. How about checking opinions of other editors about Japanese and Korean transliteration you're suggesting? You're probably don't realise it but this is BS, which contradicts various standards and conventions. --Anatoli (обсудить/вклад) 21:23, 10 December 2013 (UTC)
  1. (In response to the section header rather than what's just above this. And without having read all the discussion above the straw poll.) I'd definitely want a foreign-script foreign-language Wiktionary with our purview to transliterate the middle schwas of celebrate and separate differently (if feasible in that language), which I guess lands me in the "orthography" camp. I also definitely want us to transliterate broad and small kamatz in Hebrew differently from one another, which I guess lands me in the "phonetics" camp. Go figure.​—msh210 (talk) 08:58, 10 December 2013 (UTC)
  • Wikitiki, I respect your opinions, but I'm finding you to be quite unreasonable on this issue. You seem to want us to abandon established standards, like Hepburn for Japanese, which is so ridiculous that I cannot even believe that you really think that this is a wise course of action. I feel confident that not a single Japanese editor here would accept that. Moreover, you should realise that the Thai script, for example, was not even designed for Thai but for an unrelated language (Pali), and our transliterations serve an important role in making this more accessible based on Thai pronunciation; the orthography is so far beyond the point that bijectively transliterating it serves no purpose, because anyone who wants to study Thai that seriously will use the native script directly. —Μετάknowledgediscuss/deeds 07:00, 11 December 2013 (UTC)
As a person not knowing Hebrew but wanting to know more, I also support separate transliteration for gadol and qaṭan kamatz, so that e.g. תָּכְנִית‎ is transliterated as something like "tokhnit". --Anatoli (обсудить/вклад) 12:03, 10 December 2013 (UTC)
I believe a phonetic transliteration is most useful for people, who have dificulties reading the original script. However we should try to stick to some standard transliteration scheme rather than making up our own. This way people, who are already somehow familiar with the standard transliterations of a language they're trying to learn, don't have to learn something new and get less confused. Matthias Buchmeier (talk) 19:49, 11 December 2013 (UTC)

Idea for compromiseEdit

As a compromise I want to suggest having a dual-transliteration system for some (not all languages). One of these would be pronunciation-based, and the other, orthographic. Here are a few examples, some of which have two possible variants of how it could be done (please note that these are only examples of how I currently imagine it, the details do not have to be exactly like this):

As you can see, for some languages this seems to be more useful than for others. For Arabic and Hebrew especially, I think that this would help readers out a lot. For Japanese and Russian, I think it is unnecessary and would cause too much duplication, but for Russian it beats the hell out of having just "ljóxkovo", in my opinion. I think it would be helpful for Korean, but it is certainly not necessary. I can't speak for Thai as I don't know the script that well, but it may be helpful in certain cases.

Note that if we do this, it would have to be decided on a per-language basis. We could even have a system of only doing it for certain problematic words, but not for the majority of words in a language (I think that would fit well with Russian).

I would like to know you guys think:

  • Is this a good idea that can be implemented?
  • Is this a good idea in theory, but would be too difficult to implement?
  • Is this a bad idea?

--WikiTiki89 16:35, 11 December 2013 (UTC)

A main purpose of transliteration is to uniquely identify a written word, by its form that contrasts with other words’ forms (at least, uniquely within this dictionary, and perhaps also among other sources that use a similar transliteration scheme).
Having a transliteration plus transcription, without any label, would be confusing. More so as in many cases they would be identical. Still more so when some languages have one and others both. And when a transcription missing because an editor hasn’t entered one yet. Michael Z. 2013-12-11 18:55 z
Who said anything about not being labeled? The above is just an example, if we want it labeled we can do that. The case of being identical in some languages is solved by only doing this in exceptional cases in such languages. --WikiTiki89 18:59, 11 December 2013 (UTC)
So let’s see it with labels, but I am still skeptical that complicating the page is any improvement.
But just to be clear, what is the intent here?
  1. Using two forms of romanization: orthographic-based plus transcription-based, for non-alphabetic languages?
  2. Adding pronunciation for all words with non-phonemic spelling? If so, then are we adding pronunciation for Latin-script words too? For English words?
  3. Accommodating novel ideas that reject the utility of transliterating Russian the way the whole world does it?
 Michael Z. 2013-12-11 19:17 z
Michael, since we are a dictionary, can you kindly give an example of how "the whole world" or rather a published dictionary transliterates common Russian words such as "что", "конечно" (or any word where ч=ш), "сегодня", "его" (or any word or form where г=в), "лёгкий/мягкий" (just these two or derived adverbs where г=х), "тест" (or any word form where е=э), "лёд" (or any word with "ё"), "мед" (i. e. мёд) (or any common word where е=ё)? Please note, there is no point in referring to published standards themselves, since they are not used to transliterate a few exceptions listed. Surnames and place names with "ё" generally become Roman "e", that's why I specified common nouns. --Anatoli (обсудить/вклад) 23:56, 11 December 2013 (UTC)
The OED First Edition entries appear to use scientific transliteration (kopejka, kopjë in “copeck”). Second Edition uses British Standard transliteration (sovétskoe khozyáĭstvo in “sovkhoz”, kësterit in “kësterite”, krȳzhanovskit in “kryzhanovskite”, smert′ shpionam in “SMERSH”). We do see sovét naródnovo khozyáĭstva in “sovnarkhoz”, but I can only find the one example of a transcription, so it may be an error. The Third edition entries have gone back to scientific transliteration (megrelec, megrel′skij in “Megrel”, mončeit in “moncheite”, nenadkevičit in “nenadkevichite”, rodit′, rod″ in “Rodinia”, special′nogo in “Spetsnaz”). Michael Z. 2013-12-12 07:47 z
Thanks. I didn't understand some words or transliterations you have provided, though. It seems like a mixture of everything. It's not surprising, Russian is transliterated in too many ways, including well-known respectable dictionaries but they are not specialised for Russian. As I said, Russian dictionaries usually don't use transliteration. I've also seen "naródnovo", "jo" or "yo" for "ё" in Google Books and dictionaries. --Anatoli (обсудить/вклад) 05:17, 13 December 2013 (UTC)
The Oxford English Dictionary’s transliteration is not a mix of everything. It is two standardized transliteration systems used over a 130-year publication history. Their 2013 transliterations are consistent with their 1884 ones, and they are updating all of their 2nd-edition entries to match. Pretty good compared to Wiktionary, which can’t seem to go two weeks without amending our so-called transliteration system.
Do you think we will ever have even the one word его transliterated consistently on our pages? I don’t think it can happen unless we start by adopting a real transliteration standard. Michael Z. 2013-12-14 18:48 z
It IS a mix of everything. "ТАСС" (Телегра́фное Аге́нтство Сове́тского Сою́за is given as "Telegrafnoye Agentstvo Sovetskovo Soyuza". With "sovetskovo", "narodnovo", "russkovo", etc. used quite frequently. None of the Oxford dictionaries are dedicated Russian-English or English-Russian dictionaries, which used some kind of transliteration, anyway. So, the Russian terms with irregular pronunciations are not really demonstrated in the Anglophone published dictionaries. --Anatoli (обсудить/вклад) 06:06, 16 December 2013 (UTC)
In case you didn't notice, Michael already pointed out that there are two systems used by the OED. The original system, used in the first and third editions, and the phonetic system they only used in the second edition (which is the latest fully published one). Your example of "TACC" is from the second edition. The fact that they switched back to the original system for the third addition should mean something to you. --WikiTiki89 13:23, 16 December 2013 (UTC)
Have a closer look, Anatoli. You are re-quoting the London Times of 1925. That 1986 OED entry uses British Standard transliteration in its etymology. Michael Z. 2013-12-16 15:36 z
The idea is a compromise. There are too many people here that believe that "лёгкого" should be "ljóxkovo". Since I disagree with that, I'm suggesting what I think is a reasonable compromise. --WikiTiki89 19:28, 11 December 2013 (UTC)
I have counted exactly 2 people favouring the current Titarev-Brown romanization system for Russian. --Vahag (talk) 19:31, 11 December 2013 (UTC)
You mean the two big guns of Russian, the two who really care? --Anatoli (обсудить/вклад) 23:42, 11 December 2013 (UTC)
It’s not all about size, you know. Michael Z. 2013-12-14 19:41 z
The authors of textbooks or dictionaries publish transliteration conventions used in the books they personally edit. They do what suits the work they do best. No system is perfect and can be liked by everybody. If you and other opponents of the phonetic transliteration have it your way, it doesn't necessarily mean this method will be adopted by the actual contributors into Russian. --Anatoli (обсудить/вклад) 06:12, 16 December 2013 (UTC)
Yes, you are entering pronunciations according to your own complicated rules where Wiktionary calls for transliterations. So we already know you can do whatever you want to please yourselves, and damn consistency, logic, the community, and the readers. But that makes you “big guns” the opponents. Michael Z. 2013-12-16 15:50 z
Don't forget that I'm proposing it on a per-language basis. As I said above, this would be most beneficial to Abjads like Hebrew and Arabic. --WikiTiki89 19:37, 11 December 2013 (UTC)

It looks like a compromise but it's obvious that the whole discussion is about challenging Russian transliteration. The Roman "ë" has just graphical resemblance with the Cyrillic "ё", it doesn't convey /ʲo/, /jo/ or /o/. I don't know if you can challenge the rules for other languages, whose editors are not participating and have already done the hard work of choosing transliteration, it's especially the case for Japanese. In defence of "Titarev-Brown romanization system" I should say that no published dictionary used the official, formal, standard or scientific transliteration system for Russian and dictionaries, textbooks and phrasebooks that do, use their own, customised or adjusted methods, respelling. I chose "other", a mixture of orthographic and phonetic for all languages, which is used by published standards in most cases. Cyrillic and Greek alphabets are considered "easy" and dictionaries don't transliterate these at all but if choose to transliterate, we have to allow for a small percentage of exceptions, which are also present, apart from above mentioned languages in Greek. Maybe we shouldn't use transliteration for Russian if published dictionaries don't use them and the current practice is so hated. Scientific or standard transliterations of Russian don't deal with inflected forms like "лёгкого", genitive endings, conjunctions and pronouns - "что", "чтобы", any of these transliteration has a very limited usage, even passport offices or world maps use them to render geographical names in Russia. I am undecided about the proposal, although it looks interesting. Not sure if dual transliteration will be accepted by other editors and the public. If we do this way, we should probably also label/display separately non-dictionary ("relaxed") common forms like "легкого" (for "лёгкого"), "عربى" (for "عربي"). --Anatoli (обсудить/вклад) 23:42, 11 December 2013 (UTC)

Re "it's obvious that the whole discussion is about challenging Russian transliteration": That is entirely not the case. I really do want to discuss the other languages. Don't forget that is just discussion and no policies are being made here (yet), nor will they be without a vote in each language that wants to do so. Regarding Greek and Cyrillic being easy, that is true if you are learning the language, but in etymology sections foreign scripts are encountered by people who are not learning the language, and that is who are transliterations should be geared for (as I've said countless times, people learning a language don't need transliterations at all, with the exception of logographic scripts). --WikiTiki89 00:10, 12 December 2013 (UTC)
@Wikitiki89 Just to clarify, I'd like to know your view on the alternative common forms (with no
Adding to the list of phonetically transliterated languages (sorry for the quick format)

How should those be displayed in your proposition? --Anatoli (обсудить/вклад) 00:26, 12 December 2013 (UTC)

You haven't really explained why we need them. I want to point out that the letter "ё" is more analogous to Arabic vowels than to the ى vs ي question. When we right عَرَبِيٌّ(ʿarabiyyun), it is implied that the word can also be written without the vowel diacritics, similarly, when you write лёгкого (ljóxkovo), it is implied that it can be written without the two dots. --WikiTiki89 00:39, 12 December 2013 (UTC)
Diacritics and other letters used are not exactly the same thing, even if related. Dotless "ى" for "ي" and "е" for "ё" are transliterated differently - graphically or phonetically depending whether you want to stress the graphic form, the phonetic or just point to the dictionary form. It's just more complex, that's all and people can have more opinions about these. Is it ā or y for "ى", is it e or jo for "е", etc. --Anatoli (обсудить/вклад) 00:52, 12 December 2013 (UTC)
I meant they are totally different in terms of usage. In Egypt the dotless form is used always (formally and informally), while in the Levant the dotted form is used always (formally and informally). In Russian, ё is used much more ambiguously, more similar to Arabic shadda and fathataan, which are commonly but not always encountered in written text. Regardless, with the letter "ё" it is always implied that it can be written without the dots. --WikiTiki89 01:05, 12 December 2013 (UTC)
That's why I'm asking you to consider the display for these forms as well, so that they can be seen. Mzajac suggests to transliterate as "e" when "ё" has no dots. --Anatoli (обсудить/вклад) 01:15, 12 December 2013 (UTC)
Citations should be quoted and transliterated as written – the point is to demonstrate actual usage, not “improve” quotations. WT:QUOTE is clear about this. Furthermore, it would be impossible in every case to know whether, e.g., “все” really means vsë, or whether a mentioned Mr. Yozhin is really Yezhin. Michael Z. 2013-12-12 20:32 z
I was actually referring to alternative forms Wikitiki89 has provided and he has answered my question. In transliteration, especially in the Russian entries, improved transliteration is being provided with word stresses, even if the quote may retain the original spelling. Native speakers are seldom confused whether it's "все" or "всё" (in some cases, the absence of dots may slow down your reading or understanding for a second, though) or any other usage of "ё" but in case of a doubt "e" can be used as default translit. (I'd personally check for the exact meaning, if I wanted to know if the person's surname was "Ежин" or "Ёжин"). It's again the issue we can't agree on - phonetic or graphical transliteration, not so much about the graphical photographic exactness of quotes.
This too (WT:QUOTE), can be challenged for Russian or Arabic, IMO. It doesn't matter if a word is spelled with "ё" or "е" in the published book. We can provide stress marks and dots over "ё", which serve the same purpose as stress marks for the sake of learners, without considering that "лёгкий" and "легкий" are different words. It's not quite the same as English "café" and "cafe". "ё" is provided as an indicator of pronunciation, just like the stress mark. Arabs too, can write same texts with either "ى" or "ي". I may raise this discussion but anyway, it's not related to our current discussion. My point is, if a quote is is exact, the transliteration doesn't have to be photographic (if a phonetic transliteration is chosen), it's a linguistic dictionary, about learning words, not facts. Someone may want to recite Tolstoy or Pushkin but can get stuck on pronunciation and it's often easier to provide quotation rather than make up own user examples. --Anatoli (обсудить/вклад) 05:17, 13 December 2013 (UTC)
I hope you don't mean that we should add accent marks to quotations. For transliterations, that's ok, but not for the original quotation (unless it's poetry, where I think it would be acceptable to add accents to the original). --WikiTiki89 14:01, 13 December 2013 (UTC)
I say it can be challenged for foreign scripts to an extent and I don't see why not. The same quotes without dots and stress marks will appear with dots and stress marks designed for young or foreign learners--Anatoli (обсудить/вклад) 06:06, 16 December 2013 (UTC)
I guess легкого (ljóxkovo [legkogo]). Though for Arabic I think they should be transliterated exactly the same way. --WikiTiki89 01:18, 12 December 2013 (UTC)
I see. Do you see the letter ى‎ to have two graphical transliteration forms depending on whether it's an alif maqṣūra or a dotless yāʾ? If yes, what are they? --Anatoli (обсудить/вклад) 01:26, 12 December 2013 (UTC)
I was thinking something like "y" for yaa and "ỳ" for alif maqsura (which as an idea I got from the ISO romanization scheme). The trickiest thing will be all the forms of hamza (ء ئ ؤ أ إ آ) and the plain alif. --WikiTiki89 01:45, 12 December 2013 (UTC)

If we were to present two romanizations, then the first ought to follow a standardized system or generally accepted principles. (So should the second, really, but if we are putting forward the product of our unreviewed personal theories, then it should be secondary.) If a romanization is a phonemic or phonetic transcription, then it ought to be in /slashes/ or [square brackets] to convey its nature. Michael Z. 2013-12-19 15:35 z

Distinguishing words borrowed from American English, British English, etcEdit

I noticed diff and it made me wonder: is this something we want to do more widely — distinguish foreign-language (e.g. Japanese) terms derived from one variety of English from those derived from another variety?. I could see how it could be useful, and we already do something similar for some languages where the ISO gave codes to the dialects; e.g. we categorize words derived from Cajun French different from those derived from European French. OTOH, I can also see how {{etyl|en|ja}} could be sufficient.
If we do want words to be in (subcategories of) Category:Terms derived from English, should they also simultaneously be in subcategories of Category:Terms derived from English?
Relatedly, do we want to use {{etyl}} on, and categorize, Australian English terms derived from British English, British English terms derived from American English, and so forth? See diff. I'm on the fence about categorizing foreign-language terms based on dialect of origin, but I lean towards opposing things like Category:English terms derived from British English. - -sche (discuss) 22:16, 8 December 2013 (UTC)

If we decide to do this, any regional or temporal sublects should be codified in their own module similar to Module:languages. DTLHS (talk) 22:37, 8 December 2013 (UTC)
What's wrong with "From American English." (From American {{etyl|en|-}})? --WikiTiki89 22:54, 8 December 2013 (UTC)
The same reason we don't use "From American English" instead of "From American English"- it would be impossible to track, categorize, or standardize. DTLHS (talk) 23:08, 8 December 2013 (UTC)
{{etyl|en-US|fr}} gives <span class="etyl">[[w:American English|American English]][[Category:French terms derived from American English]]</span> and IINM has done so (or similarly) for a long time.​—msh210 (talk) 08:56, 10 December 2013 (UTC)
@DTLHS: in fact, even if we don't decide to allow {{etyl|American English}}, we should possibly still migrate all the regional and temporal codes we do have (all in Category:Etymology-only language code templates, except Template:gkm) into a module. - -sche (discuss) 00:33, 9 December 2013 (UTC)
They've already been migrated to Module:etymology language/data. But the templates can't be deleted because {{derivcatboiler}} still needs them. I don't know why I didn't make a template-invokable function in Module:etymology language along the lines of "lookup_language" in Module:language utilities, so that this could be fixed. I'll do that tonight unless someone else wants to do it instead. —CodeCat 12:36, 9 December 2013 (UTC)

Personally, I think to distinguish foreign-language terms derived from one variety of English from another variety is beneficial, simply because, for example, 'lift' meaning 'elevator' wouldn't be used in American English, and vice versa for British English. CokeHanx (talk) 18:06, 26 December 2013 (UTC)

Rhymes categories againEdit

Discussion moved from Wiktionary:Grease pit/2013/December#Rhymes categories again.
Previous discussion: Wiktionary:Grease_pit/2013/October#Convert rhyme pages into categories?

In October I proposed changing {{rhymes}} so that it adds entries to a category for each kind of rhyme. Some people supported, but those that opposed did so because they wanted rhymes to be organised by the number of syllables in the word. I don't see this as a particularly useful distinction; a rhyme is a rhyme and it doesn't matter which word you're using to rhyme, only the length of the rhyme itself is relevant. I would like to ask again what benefits of organising the words by syllables are so strong that they counter the benefits of using categories. Because I don't really see it, and I would still like to do this as a trial for one language, to see how well it works in practice. —CodeCat 15:28, 7 December 2013 (UTC)

  • I very much oppose, as I did at Wiktionary:Grease_pit/2013/October#Convert_rhyme_pages_into_categories.3F. To repeat myself: From browsing a random selection of Special:PrefixIndex/Rhymes:Dutch, I only found pages with one or two sections for the number of syllables. There quite possibly are pages with more sections, but my random browsing has not found them. Here is a random sample, in the collection of which I have discarded each found page that had less than 7 items: Rhymes:Dutch:-ʏs, Rhymes:Dutch:-ɑst, Rhymes:Dutch:-ɑŋk, Rhymes:Dutch:-ɪk, Rhymes:Dutch:-ɔrst, Rhymes:Dutch:-ɛi̯dən, Rhymes:Dutch:-ʏkt. I don't think one can appreciate the value of separation by the number of syllables from these Dutch rhyme pages. I don't think that the pages reflect the hypothesis that Dutch rhymes always have to match the number of syllables very closely; more likely, whoever has created the pages has not found a way of identifying more extensive lists of items that would include a larger variety of numbers of syllables, as has originally happened to me when I started to create Czech rhyme pages such as Rhymes:Czech:-atʃ or Rhymes:Czech:-ats. Again, to appreciate the separation by the number of syllables--and not by the letter of alphabet as categories automatically do--, check e.g. Rhymes:Czech:-ajiː or Rhymes:English:-eɪʃən. An aside: the Czech page uses a markup to create three column layout in Firefox and Chrome, which the English page does not.

    Furthermore, categories are limited to showing 200 items, whereas a rhyme page can easily host 5000 items, which is much more convenient. For instance, Rhymes:Czech:-ɛɲiː has over 3000 items, which requires 15 times clicking on "next" in a category; forget about pressing Control + F on a category page and finding what you want. --Dan Polansky (talk) 15:40, 7 December 2013 (UTC)

    • And you think that Rhymes:Czech:-ajiː is somehow useful with such a giant wall of text? I don't get it. If I were to be faced with such a page I'd just give up. —CodeCat 17:22, 7 December 2013 (UTC)
      • Sure it is useful. If you want to focus on a smaller section of the page, you can: you can for instance focus on the 19 items that have two syllables. The 3-syllable items of Rhymes:Czech:-ajiː fit to two and half screens on my laptop; I am using Firefox and seeing a 3-column layout there. If you put the content of the page into a category as you propose, you will see the first 200 items alphabetically sorted, regardless of the number of syllables. A category instead of a rhyme page does not make the volume belonging to the rhyme any less giant; it merely splits it to subpages, and removes the useful organizing principles that creates subsections: the number of syllables. --Dan Polansky (talk) 17:42, 7 December 2013 (UTC)
        • I agree with your proposal to delete all categories with more than 200 entries. DTLHS (talk) 17:51, 7 December 2013 (UTC)
          • Since I made no such proposal, should I ignore your statement? Or do you care to rephrase your remark into a coherent argument? --Dan Polansky (talk) 18:02, 7 December 2013 (UTC)
        • If the size of the categories is the problem, why not break them up into subcategories? That seems like a pretty simple solution. Break them up when needed, while keeping them unified if they're small enough, like for the Dutch categories you named. —CodeCat 18:10, 7 December 2013 (UTC)
          • If you create subcategories like Rhymes:Czech:-ajiː/2-syllable, Rhymes:Czech:-ajiː/3-syllable, Rhymes:Czech:-ajiː/4-syllable etc., the user won't be able to see e.g. 2-syllable and 3-syllable items at a glance. Having a single pages makes it possible for the user to visually scan large number of items if desired, or concentrate on subsections created by headings. I can't see a reason why the reader would not want to see 2-syllable and 3-syllable items at a glance; I for one surely want to see them at a glance. In some pages with smaller number of items, I want to see 2-syllable, 3-syllable and 4-syllable items at a glance. --Dan Polansky (talk) 18:19, 7 December 2013 (UTC)
I see clear benefits to automatically populated categories vs hand-maintained (or unmaintained) pages.
The notion that having to page through a category 200 entries at a time is somehow more inconvenient than having to scroll through 5000 items on a Rhymes page (to use the number which was used above) is strange.
To address the point that some people might want to look at all a word's rhymes at once while others would only be interested in e.g. two-syllable rhymes, the templates could automatically place entries into a general category and a syllable-count category.
I await the input of someone technically adept as to whether or not it would be possible to have the same template that adds the rhymes categories add a sort key that would sort the entry, within the rhymes category, by its number of syllables. For example, prey might sort as 1-prey, array as 2-array, assay as 2-assay, etc, and thus the words would be sorted first by their number of syllables and then alphabetically among other words with the same number of syllables.
There is one thing Rhymes pages can do that categories cannot, and that is: have {{qualifier}}s saying "only for the noun, not the verb", "only for the verb meaning be gloomy, not for the verb meaning make low", etc. (That last is taken from Rhymes:English:-aʊ.ə(ɹ).) - -sche (discuss) 22:47, 7 December 2013 (UTC)
Re: "The notion that having to page through a category 200 entries at a time is somehow more inconvenient than having to scroll through 5000 items on a Rhymes page (to use the number which was used above) is strange.": Is the notion strange or is it factually incorrect? If it is factually correct, what does it matter that it is strange to your mind? Fact is, it is crystal clear that it is more inconvenient and this is even admitted by two other editors in the #Category pages thread below on the page. A further point is that the first 200 items in a category are not going to be ordered by the number of syllables, so they are not going to be the first 200 most significant items, they are not going to be such items that, if you only were to see 200 of the whole set, they would be it. --Dan Polansky (talk) 07:54, 8 December 2013 (UTC)
Inconvenience is subjective (opinion), not objective (fact). - -sche (discuss) 18:11, 8 December 2013 (UTC)

I think it would be beneficial if entries were added to specific named rhyme categories but not sure about entries with different numbers of syllables. --Anatoli (обсудить/вклад) 06:27, 8 December 2013 (UTC)

@CodeCat: If you are going to notify people of this discussion[1], you have to do it in a fair manner, using a fair selection method of whom to notify. Instead of notifying just Atitarev, -sche, and Ruakh, you should have at least also notified msh210 who took part on the previous discussion and expressed less favorable views to your proposal. I generally cry foul about your behavior. If this thread were not noticed by me, you might have declared consensus one day after its start, as was your recent practice; the previous discussion took place less than two months ago. I herewith strenuously oppose any attempt to switch rhymes to categories without a vote, whether for a single language or several languages. --Dan Polansky (talk) 07:54, 8 December 2013 (UTC)

FWIW, I think that she notified Anatoli, -sche, and myself because we're three editors who did not participate in the previous discussion; that is, I think she was trying to get new input, rather than canvassing known supporters. So while I largely agree with you, I think we can at least A a bit more GF in this case. (Though admittedly, DCDuring also didn't participate, and she didn't notify him, either, for reasons about which we can only speculate.) —RuakhTALK 08:31, 8 December 2013 (UTC)

On another note, I do not think this should be a Grease pit discussion; this should be a Beer parlour discussion. This is not about finding technical means to do something that the community at large agrees should be done; this is about fundamentally changing the manner in which rhymes on Wiktionary are organized and presented. --Dan Polansky (talk) 07:59, 8 December 2013 (UTC)

Yes, I agree. In fact, the specific question that CodeCat poses here is "what benefits of organising the words by syllables are so strong that they counter the benefits of using categories", which doesn't seem to have any technical dimension whatsoever. I assume that she posted it in the GP either by mistake, or else simply because the previous discussion was in the GP. You should feel free to move it to the BP. —RuakhTALK 08:31, 8 December 2013 (UTC)
I didn't participated in this discussion and CodeCat didn't know my opinion. I don't see any problem in notifying and not notifying anyone. I also think {{audio}} should categorise entries, for examples - entries with audio links. What is wrong with this and what is wrong with you, Dan? It's just a proposal, whether this is accepted is up to participants. I don't see a problem wit a vote either. --Anatoli (обсудить/вклад) 08:39, 8 December 2013 (UTC)
Sure you don't see any problem, as your sense of fair processes is less than optimal, as evidenced in RFV. To explain it to you: if someone wants to use a discussion as evidence of consensus sufficient for an action that drastically changes how things are done in Rhymes, the choice of discussion venue and the means by which people are notified of existence of this discussion are of utter importance, since they stand a chance of significantly impacting the outcome of the discussion. --Dan Polansky (talk) 09:32, 8 December 2013 (UTC)
As far as assuming good faith on CodeCat's part, I did not make any statement about whether CodeCat choose this method of notification to acquire unjust power or by omission; I have merely pointed out that the procedure chosen appears unfair, or at least poorly thought out. Later: I am wrong, since I posted "I generally cry foul about your behavior", which suggests lack of assumption of good faith. Indeed, while I assume good faith as far as CodeCat's desire to build a good dictionary goes, I do not assume good faith as far as the choice of just methods for doing so goes, since there is enough evidence in the public record to the contrary.
Since I insist that this issue is only executed via a vote, moving this discussion to Beer parlour now is less critical, unless someone decides that this thread shows enough of a consensus that transfer to categories can start, which I am afraid is not unthinkable given recent actions of CodeCat. --Dan Polansky (talk) 09:58, 8 December 2013 (UTC)
  • Let me repeat one argument not made in this thread, that of redlinks. The Czech rhyme pages that I have created consist largely of redlinks, so they cannot be created by automatic categorization of mainspace pages. This will be the case for those languages for which no one is interested yet to create inflected-form pages. Redlinks can be manually pasted to category pages, but then in the manual manner identical or similar to placing them to rhyme pages. I surmise that I know what I am talking about, having created Category:Czech rhymes with 1,368 pages, many of which feature significant number of items, standing in stark contrast to the pages in Category:Dutch rhymes. On a related note, Category:Dutch rhymes features a category subdivision inconvenient to browse, whereas Rhymes:Czech contains tables providing a nice overview from which to jump to rhyme pages. To get an impression, try to randomly click items linked from the table in Rhymes:Czech#Disyllables. --Dan Polansky (talk) 09:45, 8 December 2013 (UTC)
I'm going to side with Dan here. His argument of red links is infallible. -- Liliana 12:56, 8 December 2013 (UTC)

Consensus at WikipediaEdit

Header was: Consensus

If there is a widely-held consensus for a page title on wikipedia, should that consensus apply to wiktonary too? Or is that a grey area? Pass a Method (talk) 17:42, 9 December 2013 (UTC)

Absolutely not. It's worth considering, because the Wikipedia consensus probably exists for a reason, but we should not blindly accept it. --WikiTiki89 17:45, 9 December 2013 (UTC)
Absolutely not. It's not even a gray area. (And pace Wikitiki89, I think it's probably not even worth considering, except in the general sense that all options are worth considering.) —RuakhTALK 17:55, 9 December 2013 (UTC)
By "worth considering", I meant that it is enough of a reason to have a discussion about it. --WikiTiki89 18:02, 9 December 2013 (UTC)
I understand, and I disagree. If someone here wants to discuss a given entry-title, then by all means; but I don't think there's any need to start a discussion here just because Wikipedia made some decision, and I'm davka enough that if "this is what Wikipedia decided" features prominently as an argument for something, then I will consider it an argument against it. :-P   —RuakhTALK 19:44, 9 December 2013 (UTC)
Yeah it's totally irrelevant. Mglovesfun (talk) 19:49, 9 December 2013 (UTC)
I agree that it's not a real argument for something, but that doesn't mean we should go out of our way to do the opposite of what Wikipedia does. (I didn't know that "davka" can also be an adjective, I always thought it was just an adverb.) --WikiTiki89 19:50, 9 December 2013 (UTC)
What is "a consensus for a page title"? That it be included in WP? That it be spelled a certain way? We have our own CFI and we accept all attestable spellings. What other matters are relevant for us? DCDuring TALK 20:51, 9 December 2013 (UTC)

"Proper" codes for etymology-only languagesEdit

In the past we've used some special "codes" for languages that are used only for etymologies ({{etyl}} and {{derivcatboiler}}), like "Vulgar Latin". But because these were so long we ended up resorting to shortcuts, which were implemented in templates as redirects, but which had their own problems. The Lua replacement Module:etymology language/data had to support both the full names and the shortcuts to account for all the existing entries. But I think we might as well look at how we can start using codes for these languages that match our existing ones:

  • Austrian German / AG. > "de-AT"
  • Viennese German / VG. > "de-AT-vie"
  • British English > "en-GB"
  • American English/AE. > "en-US"
  • Old Northern French/ONF. > "fro-nor" (note that "Modern Norman" is "roa-nor")
  • Canadian French/CF. > "fr-CA"
  • Old Latin/OL. > "la-old"
  • Late Latin/LL. > "la-lat"
  • Vulgar Latin/VL. > "la-vul"
  • Medieval Latin/ML. > "la-med"
  • Ecclesiastical Latin/EL. > "la-ecc"
  • Renaissance Latin/RL. > "la-ren"
  • New Latin/NL. > "la-new"
  • Pre-Greek/pregrc > "qfa-sub-grc" ("qfa-sub" is the family code we use for substrate languages)
  • Koine > "grc-koi"
  • Medieval Greek > "gkm" (this is actually a real existing language code; I submitted this to RFM)
  • Middle Iranian/MIr. > "ira-mid" (MIr. could also stand for Middle Irish...)
  • Old Iranian/OIr. > "ira-old" (same)
  • Lunfardo > "es-lun"
  • Sha. (Shanghainese) > "wuu-sha"

In each case, the code starts with the "parent language". I've used ISO country codes to identify national varieties of languages where I could, but that doesn't work for time periods nor subnational divisions. Is this proposal worth implementing? —CodeCat 21:51, 9 December 2013 (UTC)

Besides filling in gaps or reducing the number of keystrokes, what advantage is there to this for user input? Eg, why is "fro-nor" superior to "ONF."? Can't an extra column in a Lua table manage this? DCDuring TALK 22:05, 9 December 2013 (UTC)
Why can't we have both "fro-nor" and "ONF."? --WikiTiki89 22:09, 9 December 2013 (UTC)
In theory we can, but I don't think there is much benefit in having multiple codes for the same thing, and it causes a lot of headaches and complications. There was a time when we allowed both "en" and "eng" to stand for English (ISO 639-1 and -3 codes), but the longer codes were later deleted because many of our templates couldn't handle it without lots of extra code that wasn't worth the effort. The same would apply to modules as well. I think the only reason these "redirects" have survived is that they're not used by many templates, so there was little chance of things breaking. But even these did cause some problems: we had to create redirects for all the subpages too. So this kind of aliasing is not the best thing and we should only really do it if we have a proper need for it. —CodeCat 22:26, 9 December 2013 (UTC)
It would not need extra code at all, just this:
m["ONF."] = m["fro-nor"]
. I don't see any downsides, while the upside is a smoother transition to the new codes. --WikiTiki89 22:36, 9 December 2013 (UTC)
If you're proposing to use both side by side as a way to ease the transition then it's not a problem. But I don't really like saying that we're going to use both side by side indefinitely. Technical issues aside (they would arise if, for whatever reason, we want to compare two codes for equality, or wanted to use the codes as part of the output text like many templates do), it would also be harder for people to remember two codes for the same thing. —CodeCat 22:52, 9 December 2013 (UTC)
Having two codes for the same language will make it easier for experienced editors (but only when they're working in familiar territory) — BUT they make it hard for newcomers, they have enough to get their heads around without having to wonder if two codes are two forms of a language or not. Once we've mastered the change unique codes will make life easier for everyone. Saltmarsh (talk) 14:29, 13 December 2013 (UTC)
To DCDuring: Can you please think further than how many letters you have to type? I understand that this might really be a concern for you, but from my point of view it's kind of shallow and nitpicky. I can't show you the advantages if you just have different priorities and different ideas of what improves Wiktionary. All I can say is that maybe you should discuss your priorities more openly with others so that, instead of getting upset when your needs are not met, you understand what other people think is important and why they don't pay much attention to what you think is important. It would reduce a lot of the friction you're causing with me, and I suspect that others may also be a bit frustrated with your attempts to oppose and complain about many of the recent changes to templates and modules, even if I don't know who because they haven't spoken out about it yet. —CodeCat 22:26, 9 December 2013 (UTC)
You're right. My expectations that, without any effort on my part, only changes would be made that had no bad consequence or some good consequence for my contributions to Wiktionary and some benefit to other contributors or users were shaped by some years of just that experience here. I thought that the topical-category edifice was a mistake not likely to be be repeated. Clearly my expectations were unrealistic. I never should have hoped that such a blissful state of affairs would continue. I stand disabused. DCDuring TALK 00:31, 10 December 2013 (UTC)
What topical category edifice would that be? —CodeCat 03:15, 10 December 2013 (UTC)
NB, we also have exceptional codes for "pre-Roman (Iberia)" (und-ibe) and "pre-Roman (Balkans)" (und-bal), which declared qfa-not as their family until I updated them just now to use qfa-sub. They're currently located in Module:languages/datax, but they should probably be moved into the etym-only-language module, since AFAICT they are not allowed L2s any more than pre-Greek is. And perhaps their codes and pre-Greek's should be formed the same way, i.e. pre-Greek should be renamed und-grc or the pre-Romans should be renamed qfa-sub-. - -sche (discuss) 06:04, 10 December 2013 (UTC)

Mong and Cyrl MongolianEdit

Hippietrail (talkcontribs) seems to be desperately trying to add traditional script to Mongolian entries, translations or more likely numerous requests but "mn" is currently reserved for Cyrillic with automatic transliteration and other things in modules (I am to blame for this too). I think a solution is to have a separate language code for traditional Mongolian and have nested translations, even though we don't have editors skilled in traditional Mongolian, resources are extremely poor and his own interest maybe fleeting. --Anatoli (обсудить/вклад) 00:49, 11 December 2013 (UTC)

Is Mongolian written in traditional script a separate language though? Or is it more like the situation with SC? DTLHS (talk) 00:55, 11 December 2013 (UTC)
It's not classified as separate AFAIK but there are some or big differences in Inner and Outer Mongolian forms. I think they're also understudied. Cyrillic Mongolian really needs not only automatic but mandatory translit (which I have implemented) but traditional needs separate, manual transliteration or no transliteration. There is often no one-to-one correspondence between the two forms. They're presumably mutually understandable. Alternatively, a script detection for trad. Mongolian should be used, so that auto-translit is disabled. --Anatoli (обсудить/вклад) 01:15, 11 December 2013 (UTC)
Technicalities: In Module:translations/data "has_auto_translit" is set to "true" for "mn", which makes Module:mn-translit mandatory. Module:languages/data2 has translit_module = "mn-translit", which allows transliteration. --Anatoli (обсудить/вклад) 01:24, 11 December 2013 (UTC)
If they are the same language, then there is no need to create a separate language code. We have plenty of languages that allow multiple scripts. --WikiTiki89 01:57, 11 December 2013 (UTC)
A different variety of Mongolian is used as "standard" in Inner Mongolia vs Outer Mongolia. I'm sure there are some who regard them as different languages and some who regard them as dialects. I am informed by locals that they understand each other.
I do not know if the script has undergone any changes in China. They have simplified other minority writing systems so it's not out of the question. In Mongolia it's not currently widely used and definitely preserves old spellings not reflected in modern pronunciation.
If we did split the two into different languages then Inner Mongolian entries would need to be Mongolian script only as Cyrillic is not used, or known by Inner Mongolians. For Outer Mongolian (Khalkha) both scripts are needed, as the traditional has great historical and cultural importance while the Cyrillic is known and used by everybody in the present day.
I've bought dictionaries in both countries and none claim to be for a different language, standard, or dialect, just "Mongolian". — hippietrail (talk) 02:16, 11 December 2013 (UTC)
What about transliteration issue? I don't want to disable automatic transliteration for Cyrillic Mongolian for a couple of traditional mentions. If modules can handle them differently, it's fine. --Anatoli (обсудить/вклад) 02:11, 11 December 2013 (UTC)
We'll have to change the modules to allow specifying automatic transliteration per script (unless this can already be done). --WikiTiki89 02:35, 11 December 2013 (UTC)

Info from hippietrail disturbed by edit conflictEdit

Hi there. There was a little bit of traditional Mongolian script here that I came across before I started. Mostly red links in translation entries but I didn't keep track of any of them.
I also started a thread in one of the Mongolian script template talk pages but nobody responded yet. I brought up a few issues there.
The most important point is that while the newer Cyrillic script used in Mongolia is pretty phonetic, the older Mongolian script used in Inner Mongolia preserves old spellings that don't reflect modern pronunciations. A bit like English.
This means one cannot be automatically generated from the other. It also means that for the traditional script we must decide whether we want a true transliteration, or a phonetic transcription.
If we go with a transliteration it will not match the transliteration for the Cyrillic, but it will be able to be automatically generated. It will also help users know which characters make up a word as it's a complex cursive script where some letters look alike in medial position like with Arabic, but which requires some invisible Unicode control characters like some Indic/Southeast Asian scripts.
If we go with a phonetic transcription it would need to be manually provided for each entry , would belie all the documentation which uses the term "transliteration", and will be redundant for entries which also have a pronunciation section.
I did try to look at how the current transliterating is done but couldn't find where it's documented. I also don't know which approaches the other multi-script languages take (I bet they each have a different strategy). We either need to make the "mn" transliterator smarter to detect "Mong" vs "Cyrl" and invoke different code, or to ditch "mn" for "mn-Cyrl" and "mn-Mong". Or come up with something new. This is probably a topic for the Grease Pit.
This turns out to be very timely for just a few days ago, traditional Mongolian calligraphy won a place as a UNESCO-protected item of intangible world cultural heritage! Yesterday in Ulaanbaatar I coincidentally bumped into the former Mongolian Minister for Culture who told me this while celebrating with vodka shots with his friend. I impressed him by showing him the Mongolian script textbook I bought secondhand a few days earlier.
There is one good introductory website on traditional Mongolian script and how it is not phonetic, and a few which go into some of the Unicode quirks required to encode and render Mongolian correctly.
So far I only have the aforementioned textbook and a dictionary in traditional script. I'm on the lookout for somebody to teach me a bit while I'm here but maybe I won't find anybody until I get back to Hohhot around Christmas.
hippietrail (talk) 02:10, 11 December 2013 (UTC)
I would support the orthographic transliteration, but I'm sure Anatoli will disagree. --WikiTiki89 02:35, 11 December 2013 (UTC)
Yes, for contextual forms, there are dual readings o/u, ö/ü, etc. anyway, which have to be determined by the context, none of the reading is better than the other. Due to the lack of native knowledge, they can be assigned specific values for the time being with the ability to manually override when the method of transliteration is dtermined. --Anatoli (обсудить/вклад) 02:44, 11 December 2013 (UTC)
Actually there are bigger issues that dual readings for a single letter. There are also silent letters. Traditional script is taught via how to write all the possible syllables and all the agglutinative endings. It's not taught phonetically letter by letter. Or if it is that comes as a later step. — hippietrail (talk) 05:15, 11 December 2013 (UTC)

The code 'mn' should be reserved for the classical Mongolian script, which is used by a much bigger population than the Cyrillic alphabet. And mn-translit should be renamed. The two scripts are not easily convertible (the traditional script essentially writes the Mongolian language several centuries ago). It's very difficult for Cyrillic users to deduce the traditional script form, but it's much easier the other way - speakers using the traditional script know how to pronounce words in the traditional script in their native dialects. Wyang (talk) 08:31, 11 December 2013 (UTC)

Isn't this the same situation as Old English and Old Norse where Latin script replaced runes? Mglovesfun (talk) 10:54, 11 December 2013 (UTC)
Compare "ᠮᠤᠨᠭᠭᠤᠯ ᠬᠡᠯᠡ" with "монгол хэл" in Google. Hohhot, Inner Mongolia is the place people come to learn standard Mandarin and look at traditional Mongolian culture, Mongolians are in minority there but they come to Ulaanbaatar to study Mongolian. There is not much support for traditional Mongolian script by any software giant and the web penetration is close to zero. Hippietrail disagrees, there maybe a surge in the future. It's not a very phonetic but a traditional script, though. --Anatoli (обсудить/вклад) 11:45, 11 December 2013 (UTC)
@Wyang, We do not reserve language codes for specific scripts. Any scripts used to write a language can be used to add entries here, which is why we have script codes in addition to language codes. We have many languages here that use multiple scripts. The preference is to have entries for all variants. --WikiTiki89 16:11, 11 December 2013 (UTC)
I agree with this. It doesn't seem that Mongolians in Mongolia cherish their script that much, though. It is complicated and there's little support. We definitely need to cater for both but we don't have access to online dictionaries, any texts with transliteration. The situation is worse than with Sinhalese. --Anatoli (обсудить/вклад) 23:49, 11 December 2013 (UTC)
Are you sure you're not confusing "cherishing" a script with "using" it online? --WikiTiki89 00:16, 12 December 2013 (UTC)
No, I meant Mongolians in Mongolia (Outer Mongolia). They haven't converted back to the traditional script and it's not clear if they ever will. Mandarin is dominating in Inner Mongolia, not Mongolian. I suspect it's a big language loss there despite what Hippietrail said. You can't blame them. Buryats can't use Buryat or Kalmyk can't use Kalmyk in all life situations. I could be wrong and my opinion doesn't change anything, we need to support both forms of Mongolian if it's technically possible but it's hard, we'll have to provide edit tools, there don't seem to be an input available. —This unsigned comment was added by Atitarev (talkcontribs).
I just think "cherish" it the wrong word, I can't imagine that they don't cherish it, even if they don't know it very well. --WikiTiki89 00:50, 12 December 2013 (UTC)
  • We can use one code for Mongolian and still have entries and nested lines in translations tables for both the Mongolian-script and the Cyrillic-script forms. We have both Latin- and Arabic-script Malay and Afrikaans, for example, and we have both Latin- and Cyrillic-script Romanian, and Serbo-Croatian. (NB only Serbo-Croatian has a large number of entries in multiple scripts; in the other languages, only a few entries are not in the Latin script.) Splitting Inner and Outer Mongolian would not solve the script issue, since all varieties have been written, in the historical period, in both scripts. - -sche (discuss) 20:59, 11 December 2013 (UTC)
    • Mandarin and Japanese also have many entries in multiple scripts. Romanizations count too, because they don't work any differently from a technical point of view. —CodeCat 21:29, 11 December 2013 (UTC)
      Is there any way to enable/disable automatic transliterations on a per-script basis within a language? --WikiTiki89 21:32, 11 December 2013 (UTC)
        • Not directly, but we could introduce the convention of returning "nil" from the transliteration module if no transliteration can be done. —CodeCat 00:35, 12 December 2013 (UTC)
Kephir (talkcontribs) has done it already see Module:mn-translit. We should thank him for that and think of the transliteration options for traditional Mongolian. --Anatoli (обсудить/вклад) 23:49, 11 December 2013 (UTC)

Either restore glosses for pinyin entries, or delete them entirelyEdit

The pinyin entries we have now are very much useless. I tried to look up a phrase that someone gave, and came across shāo. And I just stopped. It wasn't of any use for me at all. What was I supposed to do? I have raised this complaint before but I am kind of upset that nobody has made any effort to remedy it. Our pinyin entries have no point anymore, because they make it prohibitively inefficient to find the information you need. So they should either have the glosses added back in, or be deleted altogether. —CodeCat 23:18, 12 December 2013 (UTC)

Chinese people would also struggle in many cases to convert pinyin into hanzi without context, it's especially hard to translate names. All depends on the words, of course. You can try posting the entire string at Wiktionary:Translation requests. I am curious, what are you looking for? I can try and Wyang is a native speaker. --Anatoli (обсудить/вклад) 23:39, 12 December 2013 (UTC)
But I did have context. I knew that the phrase was some kind of greeting. I heard it only spoken, and only by a non-native speaker, so I couldn't clearly make out the tones and just had to kind of guess. The amount of searching I had to do made it pretty much impossible for me to find out what was being said. It would have been much easier to do if it had been, say, Russian. —CodeCat 23:44, 12 December 2013 (UTC)
Delete them entirely. Wyang (talk) 23:20, 12 December 2013 (UTC)
I don't see how you can think deleting them would be better than the current situation. The problem is that there is a paradox: the more characters have X as a pinyin reading, the more useful glosses would be at X and the more difficult it would be to maintain the glosses listed at X. --WikiTiki89 23:23, 12 December 2013 (UTC)
Transliteration/transcription entries should not even exist in the first place. For non-native speakers' sake, they should be put in the appendix at most, like this. Wyang (talk) 23:29, 12 December 2013 (UTC)
Delete them, our software is clearly not built for this kind of query and it likely won't be for a long time. DTLHS (talk) 23:25, 12 December 2013 (UTC)
It's impossible to keep pinyin entries accurate. Any published dictionary has the same method of listing pinyin and characters only, to check the meaning, need to look up individual characters. Pinyin serves as an index only. Any mini or full definition will require maintenance and keeping in synch with character entries. It's the character definition, which need to be updated. monosyllabic pinyin entries wer imported once and never checked. It was full of ambiguity and inaccuracies, misleading people e.g. that () means "coffee", if it's only a component, never used on its own.
To delete them entirely you need a new vote, for the structure of pinyin entries we had one - Wiktionary:Votes/2011-07/Pinyin entries.--Anatoli (обсудить/вклад) 23:31, 12 December 2013 (UTC)
As far as I can see, the vote actually supports adding glosses back in. After all, the current amount of information isn't the "modicum" needed. It didn't help me find which of the dozen "shāo" characters I was supposed to look at. —CodeCat 23:36, 12 December 2013 (UTC)
By modicum was meant links to hanzi, no definition, synonym, pronunciation, see also, etc. The model was yánlì, which had only links. --Anatoli (обсудить/вклад) 23:43, 12 December 2013 (UTC)
  • I support restoring the glosses, now as before. We already had glosses; they were feasible. --Dan Polansky (talk) 20:32, 13 December 2013 (UTC)

A vote on allowing or forbidding Jyutping has begunEdit

Pursuant to this discussion, I drafted a vote on allowing or forbidding Jyutping: wt:Votes/2013-11/Jyutping. That vote is now open. - -sche (discuss) 00:08, 13 December 2013 (UTC)

Webster 1913 headwords completedEdit

I have finished this Webster 1913 import project after a couple of years! The last few entries I merged in (because of their frightening length) were go, hold, strike, set, on, take, and run. Next I might take a look at all the derived terms. Equinox 13:14, 13 December 2013 (UTC)

Congratulations. The end product is wonderful, particularly for the older terms that we would otherwise never have gotten around to this decade. The derived terms project will also add many expressions that we lack, both the dated and the current. You are a true Stakhanovite. Thanks. DCDuring TALK 13:37, 13 December 2013 (UTC)
Awesome! —RuakhTALK 19:32, 13 December 2013 (UTC)

You rock. Michael Z. 2013-12-14 20:12 z


Pronunciations are currently largely manually entered by editors. This leads to

  • misuse of notation, with various editors using different transcription systems; e.g. using different symbols for the same phoneme in different entries
  • misuse of [] and //. I've seen entries where a word in one language is given several different phonemic transcriptions, which is absurd.
  • lack of qualifiers (is it standard pronunciation, editor's native vernacular, or something else)

Pronunciations should all be Luafied like transliterations. These would be easy for languages whose alphabets are mostly phonemic/phonetic, but even for languages such as English or Russian which use etymological orthography generating IPA transcriptions from some "canonical transcription" being fed as input to a template would prove beneficial. Benefits:

  • unified use of symbols in IPA transcriptions with no possibility of errors.
  • extensibility. Suppose editors decided to add some regional pronunciation. It would simply be a matter of generating another row with appropriate label on the basis of well-defined rules encoded in Lua.
  • automatically categorizing words which are somehow anomalous in pronunciation (e.g. unexpectedly silent or non-silent letters)
  • easier interface (typing IPA symbols is impossible, whereas supplying the word itself as a parameter to a template call, perhaps with some modifications, is much easier).

--Ivan Štambuk (talk) 04:42, 14 December 2013 (UTC)

Not all languages have regular pronunciations. In fact, I'm willing to say that most don't. A lot of languages that supposedly have phonetic orthography really have a lot more exceptions than people realize. Phonetic orthographies are just a non-existent ideal. --WikiTiki89 04:50, 14 December 2013 (UTC)
I don't really see how this proposal would improve anything. IPA is the only fully reliable pronunciation transcription. That's why we use it. Trying to invent some new kind of intermediate form for each language wouldn't help. IPA is already widely known so we should stick with it. —CodeCat 04:56, 14 December 2013 (UTC)
@Wikitiki89: Most languages have (mostly) phonetic/phonemic orthographies. Many "big languages" don't for historical reasons, unless they have been reformed/(re)codified recently. But even in such cases where special transcription scheme would be needed the listed benefits are still valid. It's something to think about in the long term - making appropriate architectural arrangements now so that we don't have to deal with tens of thousands of entries that should be cleaned up in the future. --Ivan Štambuk (talk) 05:13, 14 December 2013 (UTC)
Re: "I've seen entries where a word in one language is given several different phonemic transcriptions, which is absurd": I don't see what's absurd about it. It's actually pretty common that a word will have multiple different pronunciations; for example, schedule may be pronounced with either /sk/ or /ʃ/, and privacy with either /aɪ/ or /ɪ/; these are not just different realizations of the same underlying phoneme(s), since in other cases there exist shared minimal pairs (skip vs. ship, live (adjective) vs. live (verb)). Likewise, it's pretty common that different dialects will have incompatible phoneme inventories; for example, some accents distinguish /oəɹ/ from /ɔːɹ/ (e.g. hoarse from horse), while others have only a single /ɔːɹ/. —RuakhTALK 07:28, 14 December 2013 (UTC)
Multiple co-existing pronunciations reflecting different origin are a different thing and they should be transcribed differently. They are actually different words, despite them being spelled the same. Though I'm not sure about the [aɪ] : [ɪ] realizations in privacy - if they have a common ancestor from some earlier form of English than they do represent the same underlying phoneme and they should be transcribed the same, or within []. If dialects have different phonemic inventories, then they're not really dialects - they're different languages. English is in a special position due to extreme geographical distribution in many varieties and the lack of some standard idiom which is usually present in other languages, on which phonemic transcriptions are based and against which "dialects" are compared. But consider this - if regional varieties of English have predictable variations in pronunciations and those "phonemic inventories" regularly map, simply by providing a single canonical input we could simultaneously generate a dozen or so regional pronunciations. It would increase uniformization, reduce errors and inconsistencies by eliminating human factor, and provide an opportunity to catch and categorize anomalies such as pronunciation being other than the expected one and so on. --Ivan Štambuk (talk) 09:10, 14 December 2013 (UTC)
[after e/c] Your comments about privacy sound like nonsense to me. To clarify that specific case — there the related form private has /aɪ/ everywhere, and the related form privy has /ɪ/ everywhere, so presumably privacy either originally had free variation that drove to fixation differently in different places, or else it originally had one vowel in all dialects until in one dialect it became contaminated by a related form. (I am guessing, for a few reasons, that it was originally /aɪ/ and then got contaminated by privy in U.K. English; but I really don't know.) Regardless, there certainly is not a shared underlying phoneme, unless you want to be very dogmatic, require a priori that there must be a shared underlying phoneme, and then invent fascinating realization rules to account for the current distribution.
Re: "If dialects have different phonemic inventories, then they're not really dialects - they're different languages": O.K., this just is nonsense. No hedge-words required. :-)   Maybe that's true in Serbo-Croatian, but in U.S. English two people can actually have nearly identical accents except that only one has a certain merger. One of my friends in college (in Ohio) was a New Englander who nonetheless sounded perfectly local, but who found it hilarious that I did not distinguish <Aaron> from <Erin>. (He never noticed that most of the locals lacked that distinction, too. See English-language vowel changes before historic /r/#Mary–marry–merry merger.)
RuakhTALK 19:34, 14 December 2013 (UTC)
For "privacy", see w:Trisyllabic laxing. —CodeCat 21:37, 14 December 2013 (UTC)
I only argued that they were the same underlying phoneme if they had the same origin. If we're dealing with different origin, such as "contamination" that you mention, then they're really different words. They should be in fact listed under different etymology sections.
Regarding your other example - If you want to analyze those two forms of English within a single phonological system (i.e. a single language), then they must be transcribed the same using phonemic transcription. Since I doubt that such trivial mergers warrant separate language/variant treatment, giving them different transcriptions is wrong and misleading. --Ivan Štambuk (talk) 23:01, 14 December 2013 (UTC)
You want our entry for privacy to say Alternative form of privacy? That wouldn't make much sense to me, and I think most editors would remove that, thinking it a mistake. —CodeCat 23:03, 14 December 2013 (UTC)
Assuming it's correct, the separate origin of those two different transcriptions should be in some way indicated. Ideally, input to dictionary would not only be spelling-based, but also transcription-based, so such redirects would be possible. /praɪvəsi/ being alternative form of /prɪvəsi/ does not look so absurd. The problem is that spelling : pronunciation/transcription : etymology are mutually orthogonal and we're stuck with two-dimensional hierarchy (spelling + indentation) and various practical concessions (every separate meaning is itself a separate "etymology" in a strict sense..). --Ivan Štambuk (talk) 23:21, 14 December 2013 (UTC)
Wiktionary is primarily spelling-based. We divide by spelling first, then by etymology, and then by pronunciation. This has worked for most cases. Trying to divide by pronunciation first doesn't really work, because you can't treat the two pronunciations of "privacy" as entirely separate things. They're clearly related because they have the exact same meaning and are in free variation with one another (alternative forms, like you said). So if we want to make any indication of the origin of these two pronunciations, then that should be handled in one and the same etymology section. This isn't as hard as it might seem... Etymology Online often discusses the etymologies of individual senses too, and also says when they were first attested. We can do the same, and we can do it for varying pronunciations too if needed. —CodeCat 23:29, 14 December 2013 (UTC)
Users are not interested in spellings, but particular meanings of words. Lexicographers OTOH are. Those L4 headers (derived terms, *nyms, translations) are all attached to individual meanings. And similarly etymologies, which are currently a mixture of phonological history of a word (which is common to all meanings) and etymologies of individual meanings (when they were invented/attested and by whom, how their semantics evolved etc.). Ideally there would be a way to relate those orthogonal concepts in a free manner, with a more specific markup (e.g. XML, and not wiki header indentation), independent of presentation. But none of that can be done without upgrading the platform (MW), so we're stuck with writing a dictionary with a software designed to write encyclopedic articles. Judging from this discussion, the traditional approach introduces more confusion than clarification. --Ivan Štambuk (talk) 02:00, 15 December 2013 (UTC)
Phonemic inventory isn't a reliable indicator of dialect-ness. In fact nothing is, otherwise people would have settled the issue by now. You're oversimplifying things. —CodeCat 19:10, 14 December 2013 (UTC)
The point was that these different "dialects" of English are not really dialects, but different languages, pluricentric variants, whatever you'd like to call them. Phonemic analysis presupposes that language is taken as a whole. By definition, different phonemic transcription should never occur for one word in one language, unless we're dealing with different analyses (e.g. disputed phonemes). But even then there are benefits from the suggested approach, because various regional variants, notations etc. regularly map into each other. --Ivan Štambuk (talk) 19:31, 14 December 2013 (UTC)
Sorry, but you're simply mistaken. —RuakhTALK 19:34, 14 December 2013 (UTC)
(after edit conflict) Except that that's not true. Take for example the word exactly. It is sometimes pronounced /ɪɡˈzæktli/ and sometimes /ɪɡˈzækli/ (often even by the same people). Yet the /t/ in intactly /ɪnˈtæktli/ is almost never dropped even in a situation where the /t/ in exactly would have been. --WikiTiki89 19:40, 14 December 2013 (UTC)
Eliding sounds in fast-paced speech has absolutely nothing to do with phonemic transcription. // is not supposed to convey exactly how the word is pronounced, only phonemic oppositions. /ɪɡˈzæktli/ and /ɪɡˈzækli/ represent two different words. --Ivan Štambuk (talk) 22:49, 14 December 2013 (UTC)
First of all, fast speech has nothing to do with it. People say /ɪɡˈzækli/ when speaking slowly. Second of all, how would you indicate the difference between a /t/ that can be elided and one that can't in a phonemic transcription if you don't consider eliding sounds to be phonemic? --WikiTiki89 22:54, 14 December 2013 (UTC)
I never said that eliding sounds is not phonemic. It could be subphonemic - but in this case it isn't. What we have here is two different words. If English spelling were not so defective, it would've been more obvious. --Ivan Štambuk (talk) 23:06, 14 December 2013 (UTC)
I think I understand your point now. But we are building a dictionary here, not a linguistic database of utterances. It would be stupid to have two separate entries for the word "exactly", when they are used exactly the same way. --WikiTiki89 23:24, 14 December 2013 (UTC)

We already have a “canonical transcription” for English, in Appendix:English pronunciation. If it is being misapplied, then maybe it needs to be clearer or more prominent. If IPA is difficult to use, then there is already a text-only version, SAMPA/X-SAMPA. Established standards may not be perfect, but launching any development of an ever-changing, committee-maintained novel system would be far worse. Michael Z. 2013-12-14 22:24 z

I don't think it was meant that IPA is too difficult to type, but that it is too complicated a system. --WikiTiki89 22:30, 14 December 2013 (UTC)
Well Ivan did mention that “typing IPA symbols is impossible.” IPA is too complicated how? In Appendix:English pronunciation we have defined a subset that is hardly more complicated than our AHD/enPR pronunciation system, but only because we have chosen to show the distinction between different accents. Michael Z. 2013-12-16 04:10 z
Too complicated in the sense that there are a lot of symbols that can be confused or misused (and not everyone regularly consults Appendix:English pronunciation). --WikiTiki89 04:12, 16 December 2013 (UTC)
What simpler alternatives are there? Finding an alternative system that is simpler but not used by everyone in the world? Launching a project to create a simpler alternative for IPA? Different systems for each language? (We do use AHD/enPR as a supplement for English, scientific transcription for Slavic languages.) None of these is a viable replacement for IPA. It is the worst system in the world, except for all of the others. Michael Z. 2013-12-16 18:31 z
I never said there are any. If I told you that the human brain is very complicated to understand, would you ask, "What simpler alternatives are there?" --WikiTiki89 18:43, 16 December 2013 (UTC)
Regarding the generation of pronunciation info from words' written forms, or "canonical transcriptions": I imagine it'd be possible to generate pronunciations automatically from spellings in artificial languages like Esperanto and Ido. For natural languages with mostly regular orthographies (especially those with regulatory bodies, like French), we might decide that the benefit of a template that auto-generated standard pronunciations (while allowing them to be overwritten, or supplemented with additional pronunciations) outweighed the risk of error. For languages like English, automatic generation of pronunciation info would be unacceptably error-prone, and the introduction of some other pronunciation transcription system besides IPA would be, IMO, unnecessary and undesirable.
Regarding the rest, including the frankly bewildering suggestion that words cannot have more than one phonemic realization and transcription: what CodeCat and Ruakh and Wikitiki said. - -sche (discuss) 22:27, 14 December 2013 (UTC)
(@ Ivan) The main problem is that real language isn't so tidy: it's a constant, dynamic negotiation between idiolects, with the language as a whole being what's common between them- sort of. You're trying to pin down these ephemera to something that Lua can convert to symbols deterministically on a one-to-one basis.
The literature is full of unresolved debates about what the underlying phonemes are in any number of lexemes in any number of languages, but you're proposing that we base all our pronunciations on some arcane derivation from them. It might be interesting to apply your analysis to a few entries in an appendix, but applying it to millions of entries on the fly doesn't sound workable- especially since you're talking about dealing with separate phonemes as the basis for separate lects.
What do you propose to do about multiple sound changes which cause redistribution of phonemes, but whose boundaries don't coincide? Are we going to have separate lects for A and B and C and D vs. A and b and C and d vs. a and B and C and D, etc? Sure, there are standards such as Received Pronunciation and General American, but we're a descriptive dictionary, and those don't accurately describe pronunciation for an awful lot of people.
The only way we could make your proposal work would be to either create an intricate system of data structures/classifications for Lua to run off of, which would require all the work of the current system while eliminating almost all of the people who are actually doing the work, or to simplify everything into a very crude simulation with all the interesting stuff removed. I'm not buying it, either way. Chuck Entz (talk) 00:21, 15 December 2013 (UTC)
Input for languages like English which use non-phonological orthography could be a "shallow" phonetic IPA transcription, which would then be used to generate regional pronunciations, RP, GA and so on. It's suppose to automate what is already done by hand, by editors. It's all a matter of will and looking up sources to abstract away differences among different systems, and will probably involve inventing some special symbols/notation for some ambiguous cases. I'm convinced that such system will be devised sooner or later, but by then there will be too much cleanup to do (like currently with automatically vs. manually provided transliterations). --Ivan Štambuk (talk) 01:47, 15 December 2013 (UTC)
Automatic IPA generation is definitely worth considering for standard and common pronunciations of various languages, even if it may not work for all languages and all variants of pronunciation. I find what Ivan Štambuk (talkcontribs) did for Ukrainian and Kephir (talkcontribs) for Polish quite interesting if not amazing. Not sure if automatic Polish IPA generated is considered complete (I can't find it anymore).
If a language has irregular pronunciations, the IPA may be generated from the phonetic transliterations or respellings for Latin-based languages. Not sure if it's possible to generate/find a respelling for each English word but the efforts shouldn't stop if it's not possible to do it for English.
If it were not possible, technologies such as text to speech didn't happen. We're not even trying to reproduce the sounds, just the graphical representations.
Admittedly, it's not easy and it may cause errors. For this, test cases can be generated. (I don't have the skills myself for this job but I'm a keen observer). --Anatoli (обсудить/вклад) 01:33, 16 December 2013 (UTC)
It reminds me of the proposal to use Google Translate to generate audio for Latvian entries when recordings aren't available: is it worth having a decent pronunciation most of the time if you run the risk of bits of sheer nonsense popping up unpredictably here and there?
As for English: the only way to have it deal properly with cases like wind, and slough is to come up with a detailed meta-notation (SAMPA, maybe?) so that Lua can convert it to IPA. You still have the problem of getting people to enter pronunciation, but now you've excluded those who know and use IPA well, but haven't learned the meta-notation. It won't solve the problem of people adding invalid pronunciation, either, because a good percentage will continue to enter what they've been entering all along no matter what alternative you provide them. Besides, if you train people to enter the new meta-notation, what's to keep them from screwing up the meta-notation just like they were screwing up the IPA before? How do you keep it from being just one more thing for people to get wrong? It looks to me like we're just adding another layer of complexity without much difference in the actual results.
I'm not saying we should never have automated generation of pronunciation when the orthography is reliably phonemic, but I doubt that's where we're having the problems Ivan wants to solve. It also would be nice to have a convenient way to enter IPA, but only as an option- it shouldn't be mandatory. We've had enough uproar over requiring "lang=" in a few templates- do we really want to force people to completely relearn everything they know about entering pronunciations? Chuck Entz (talk) 04:02, 16 December 2013 (UTC)
You've got a point but it's similar to transliteration modules. It works 100%, for others it doesn't or you need to take care of exceptions. For many there's still work to be done, even if modules can potentially be created. Some editors got carried away and removed manual transliterations together with word stresses for Slavic languages. I think it can work for some languages but care should be taken not to allow sheer nonsense stuff. --Anatoli (обсудить/вклад) 04:41, 16 December 2013 (UTC)
There is no risk - there are test cases, and all pronunciations are generated by an algorithm. There is no "unpredictability". For most languages there are publicly available comprehensive lists of words from which we can easily extract those that fit a particular purpose, to test various corner cases. For example, for Ukrainian I have a 30MB XML file which contains 260k lemmas which can be loaded and queried very easily. For major languages such as English and German there are lists of transcribed pronunciations that can be mined for patterns and module's output verified against.
It would be possible to use some form of "sufficiently detailed" IPA as the input. As well as SAMPA, spelled pronunciation, or whatever - as long as these forms are mutually convertible. Yes people uncomfortable to make the switch (and willing to waste time by making needless keystrokes instead..) should continue to use the old manual notation, but eventually even them will sooner or later realize the benefits that an automated approach provides, or other editors who specialize on pronunciations will convert it for them.
There are less opportunities for "screwing up the meta-notation" because of the well-defined input alphabet and the possibility to do sanity checking of the input string.
Yes it's another layer of complexities. But think of the benefits: 1) uniform transcription across all entries without misuse of notation and symbols (this discussion alone has demonstrated that many people don't even understand what phonemic transcription stands for) 2) extensible architecture that can generate additional pronunciations varied by region/period simply by encoding more rules 3) easier maintenance - instead of modifying thousands of entries a single module can be edited, should it be decided to change something (e.g. mark some allophone, change to different IPA symbol etc.) 4) better average quality - I'd rather trust somebody who has thoroughly consulted the available literature and devised such system rather than a random editor that has skimmed Wikipedia article on the phonology of language X (which are usually ridden with errors and OR), and decided to provide transcriptions on the basis of such limited knowledge.
Yes it should be done first for languages which are written with phonemic/phonetic alphabets. But all should switch eventually, even English. --Ivan Štambuk (talk) 23:23, 16 December 2013 (UTC)
Re: "this discussion alone has demonstrated that many people don't even understand what phonemic transcription stands for": Well, it's demonstrated that you don't; but how can you be so sure that "many people" share your confusions? —RuakhTALK 23:41, 16 December 2013 (UTC)
No, it's genuine cluelessness. You don't even know what accent means. --Ivan Štambuk (talk) 23:59, 16 December 2013 (UTC)
Maybe let’s write less about one another, and more about the topic. Michael Z. 2013-12-17 00:36 z
Ivan seems to have gone ahead with this idea despite the overwhelming opposition in this discussion. See {{uk-pron}}. He's added this to some entries, replacing the existing pronunciations there even though I'm not at all pleased with the ones the template generates. Look at diff for example. It was better before the change. —CodeCat 01:56, 16 December 2013 (UTC)
This particular word is incorrect. I hope he notices and fixes. It's work in progress, IMHO. Ivan seems to be a responsible editor, most of the time. Having occasional and fixable errors is probably better than neglecting Ukrainian entries altogether. I agree that he should seek approval, though. Was the opposition specific to Ukrainian? It may work for certain languages. --Anatoli (обсудить/вклад) 02:04, 16 December 2013 (UTC)


Second thoughts. The current IPA for вухо may be even more correct, Ivan is using some good resources. It needs more checking. See also w:Ukrainian phonology. --Anatoli (обсудить/вклад) 02:28, 16 December 2013 (UTC)
What exactly is the error? --Ivan Štambuk (talk) 02:24, 16 December 2013 (UTC)
I'm not sure any more. I'm used to Russian influenced Ukrainian. --Anatoli (обсудить/вклад) 02:28, 16 December 2013 (UTC)
Ivan, your module generates correct standard Ukrainian pronunciation, as it seems! I'll keep an eye on it every now and again. You seem to have better resources than I do, though. --Anatoli (обсудить/вклад) 02:32, 16 December 2013 (UTC)
Btw, {{vi-pron}} also exists (see, for example, use in yêu nhiều thì ốm, ôm nhiều thì yếu or nhiễm sắc thể). Wyang (talk) 03:41, 16 December 2013 (UTC)
I notice that parameters are basically word letters. With a bit of Lua the template could be used without any parameters. It could be added by bot on all of the vi entries. --Ivan Štambuk (talk) 03:52, 16 December 2013 (UTC)
Yes. Was too lazy to change it. {{vi new}} is able to generate the {{vi-pron}} code for new Vietnamese entries. Wyang (talk) 03:54, 16 December 2013 (UTC)
@Wyang I was going to mention your work but you beat me :). I am a fan of your template {{cmn new}} for Mandarin pronunciation generated from toned pinyin, which I and two other users use, see 纯洁 for example. It doesn't cater for erhua and has limitation on the number of syllables but that's okey, you should move it to Lua. --Anatoli (обсудить/вклад) 03:53, 16 December 2013 (UTC)
At the Vietnamese Wiktionary, all Vietnamese entries use vi:Bản mẫu:vie-pron, which is just a front end for vi:Mô đun:ViePron. It doesn't requires no parameters and displays pronunciations for six dialects: Hanoi, Hue, Saigon, Vinh, Thanh Chương, and Hà Tĩnh. (The former three are the major dialects; the latter three also happen to have unique tone systems. A few additional dialects are disabled due to insufficient documentation.) The module isn't perfect, but I can port it over to this wiki if there's interest. – Minh Nguyễn (talk, contribs) 07:22, 3 January 2014 (UTC)

There is also {{ka-IPA}} for Georgian, made by User:Dixtosa. Standard Eastern Armenian too has an almost regular pronunciation and I would like to switch to Ivan's system if it allows rare exceptions to be handled manually. --Vahag (talk) 15:18, 16 December 2013 (UTC)

Appendix:Glossary of collective nouns by subjectEdit

Maybe this should go under RFD, but I think this is a collection of unattested rubbish made up by smart-alecs; probably copyvio'd from somewhere like here. Some of the terms are completely unattested in real language: I could find no non-collective-noun-guide cites for a nobility of beasts and an entrance of actresses. Of those that are attested, a large portion are not real collective terms; they have specific meanings and it would be silly to use them to denote general collectives. E.g. quite apart from the fact that I couldn't find much attestation for 'alpha computer' as a common noun (some pretty firm evidence of copyvio from somewhere), cluster has a specific meaning (Wikipedia: Computer cluster). Same goes for a cache of ammunition, a choir of angels, a quiver of arrows, a belt of asteroids, a bushel of apples, a culture of bacteria. While some of the animal collective terms may be worth keeping if they can be attested, I would advocate deleting most of this.

Hyarmendacil (talk) 08:49, 15 December 2013 (UTC)

German possessivesEdit

Hmm. I've just come across the word Bettendorfschen and, as far as I can make out, it is a possessive meaning "Bettendorf's".

  • Does it meet our CfI? (I would think "yes" (all words in all languages)
  • Is it a noun, proper noun or adjective?
  • Should I define it as "# Bettendorf (attributive)"? (the "Bettendorf test" is a simple test for arsenic)
  • Do they inflect? (perhaps I ought to look at a German grammar) SemperBlotto (talk) 10:29, 16 December 2013 (UTC)
  • It's not a diminutive form is it? — Saltmarsh (talk) 11:15, 16 December 2013 (UTC)
It's a noun created by nominalising the adjective Bettendorfsch, and it's inflected like the adjective is, so the nominative singular would be Bettendorfscher. I wouldn't call it a possessive, more like general association with something. After all, Bettendorf doesn't "own" anything. German has another suffix -er with the same meaning, which is found in for example Hamburger, but I think that suffix doesn't inflect. English has cognates to both of these suffixes, -ish (English) and -er (Londoner). —CodeCat 14:37, 16 December 2013 (UTC)
  • OK. I've added the base adjective at Bettendorfsch. The bot, in the course of time, will generate Bettendorfschen as an adjective form. Perhaps I ought to then add some sort of note to it. SemperBlotto (talk) 15:11, 16 December 2013 (UTC)

Durably archiving the webEdit

This is a two part question:

  1. If we find a durable archive for the web, should we add extra rules for how to determine whether a mistake was likely? Due to the size of the web, should we also require that, for example, one book source is equivalent to three web sources?
  2. Should we allow as durable archive for the web?

--WikiTiki89 18:55, 17 December 2013 (UTC)

Frankly, the whole issue is a mess as de facto we use it to mean published works and Usenet. I'd rather remove the durably archived bit all together because of the possibility of complete nonsense being durably archived. But my idea isn't popular. We debate this every few months with no consensus. Mglovesfun (talk) 12:51, 18 December 2013 (UTC)
We use to mean "published works and Usenet" because so far published works and Usenet are the only things allowed. As far as I can see, Usenet holds no special place on the web other than the fact that we have decided that it is durably archived. Usenet is just as likely to have junk as any other web source. Therefore as long as we come up with rules on how to determine whether a source is valid or complete junk, there is no reason we can't use citations from the rest of the web. Things such as how independent particular sources are will have to be part of those rules. The advantage, though, will be that we will get better resources for the latest slang. --WikiTiki89 15:09, 18 December 2013 (UTC)
A characteristic of published works and Usenet is that once published, their matter is fixed and unchangeable.
Web pages and apps are also published works, but potentially ever-changing and ephemeral. Their very form and content may depend on your browser’s user-agent string, your logged-in status, whether you are Google, the time of day, or the weather. It is debatable whether they can be archived at all. Michael Z. 2013-12-18 15:46 z
My first question assumes we already have a durable archive. So you must be responding to the second question. Do you not agree that a web page archived by is durable (durable implies lasting and unchanging)? --WikiTiki89 15:52, 18 December 2013 (UTC)
I don’t agree that it necessarily represents the original web page or app. Michael Z. 2013-12-18 16:18 z
Firstly, no ones said anything about "apps". Secondly, we don't need to represent the original web page, we only need the text. --WikiTiki89 16:40, 18 December 2013 (UTC)
Firstly, no ones said anything about "apps". There is no clear distinction between websites and apps. Wiktionary, for example, is both. So is every blog and major news site, Facebook and Twitter.
Print publications and Usenet news posts are durably archived in their original form, as books on bookshelves, or as text-only message headers and bodies – anyone can verify the text by referring to the original item. The original form of a website potentially comprises HTML, CSS, JavaScript, interactive code on the server, and page state. Is a Facebook page that only shows its text to logged-in members, or only to the author’s “friends” a published source? Even Wiktionary has in the past had things like CSS-based bracket or apostrophe choosers that make “only the text” a relative quantity (did we get rid of that yet?).
“we only need the text” – Yeah, we can scrape text from web pages. That comprises a quotation, not an archive of the original item. We need to understand how it is fundamentally different from the durable archives that we have accepted to date. Michael Z. 2013-12-18 20:38 z
You're talking a lot of nonsense. A web page is a page on the web. A web app is an application with a web interface, which usually uses web pages, and which is completely irrelevant to the topic here. If Usenet is available in print, that's news to me (of course, I myself don't know a whole lot about Usenet). The original form of a website is also irrelevant here because the form of the website that we will be using is the archived form. Essentially all the problems you just mentioned are solved by completely ignoring the original site and only looking at and referring to the archive. Of course, we have not yet had any consensus as to what counts as durably archive for the web, so all of this is theoretical, but we can use the as an example. --WikiTiki89 20:50, 18 December 2013 (UTC)
completely ignoring the original site – Like I said, your proposal relies on a fundamentally different definition of “archived,” relying on a transformed copy rather than the original material. Michael Z. 2013-12-18 21:08 z
And your point is? The transformed copy should be true enough to the original in terms of text, and if not, we won't use it. --WikiTiki89 21:11, 18 December 2013 (UTC)
Uh huh. What we consider a durable archive is the archived original, making it verifiable. You want to change that. Michael Z. 2013-12-18 21:22 z
The percentage of archivable web pages whose textual content would be heavily modified by an archive is tiny. --WikiTiki89 21:26, 18 December 2013 (UTC)
Verifiability is about certainty, not percentages. Once the original is gone, the chance of determining whether a particular web page was heavily modified or not will be zero. Michael Z. 2013-12-18 21:31 z
In that case nothing is durably archived. --WikiTiki89 21:32, 18 December 2013 (UTC)
I can open a book and verify with certainty that a quotation exists in it. You are proposing that we use a website copied to instead of an original web page. If you can’t acknowledge that there are several fundamental differences, then we can’t even discuss the ramifications of your proposal. Michael Z. 2013-12-18 23:19 z is not durable. People can add a robots.txt to their page to have the archived pages removed. — Ungoliant (falai) 16:33, 18 December 2013 (UTC)
Does that remove old archives or only prevent new archives? --WikiTiki89 16:40, 18 December 2013 (UTC)
It removes the old pages. — Ungoliant (falai) 16:54, 18 December 2013 (UTC)
Perhaps we can maintain a list of websites that are very unlikely to be removed from the archive. For example, government websites, websites of large organizations, etc. --WikiTiki89 17:09, 18 December 2013 (UTC)
The British Conservative Party just deleted their archive of speeches from the Internet Archive; news article. I'm skeptical of any such list.--Prosfilaes (talk) 21:15, 18 December 2013 (UTC)
A political party is not the same thing as a government (unless it's a one-party system). --WikiTiki89 21:19, 18 December 2013 (UTC)
They're certainly a large organization. And I completely fail to see why we expect a government not to do the exact same thing; a new president gets elected, and gets a bunch of pages wiped and robots.txt amended to tell search engines they can forget all about ever seeing them again.--Prosfilaes (talk) 00:00, 19 December 2013 (UTC)
Not likely. I don't think the President has the power to singlehandedly wipe archives of And I didn't mean to imply that we should automatically trust all large organizations, but that we should consider the websites of large organizations. A political party's website is hardly going to be considered durable. --WikiTiki89 00:12, 19 December 2013 (UTC)
Most US government publications are public domain. Many of them end up recopied and archived. has a very permissive copyright statement.
This is a special case. Most other governments retain copyright, and there’s no special reason that their publications should remain available longer than any other website. I have noticed that Canadian federal, provincial, and municipal government websites rack up hundreds of dead links every few years when they are redesigned. Michael Z. 2013-12-19 15:21 z
Just like I said about large organizations, I did not mean to imply that we should automatically trust all governments' websites. --WikiTiki89 15:36, 19 December 2013 (UTC)
You don't think the president of the United States of America has the right to control the webpage that's basically his face to the world? Seriously? If we can't trust all governments' websites, or the websites of large organizations, your plan is pointless. It boils down to we should argue over every case as to whether or not it's durable. I don't know a single large organization or government website that I consider durable; they all move things around all the time. They pretty much all delete webpages on dead projects, on things they no longer consider reliable.--Prosfilaes (talk) 20:44, 20 December 2013 (UTC)
Deleting and rearranging does not affect the archives. Only specifically requesting the archives to be deleted through robots.txt will do that. --WikiTiki89 21:00, 20 December 2013 (UTC)
Which any website can do at any time. So is not a reliable durable archive of websites. Is there an exception for material released into the public domain or under a free licence, or is robots.txt the ultimate delete switch? Michael Z. 2013-12-22 16:27 z
  • Google Books can be blocked by ISPs or outlawed at any time, and 99% of Wiktionary's "durably archived" sources become practically unverifiable. Usenet is no different than any other web site - the only difference is that it doesn't have a single point of failure, and its distributed nature makes it theoretically less unreliable. We do have a way for 100% reliable durably archived storage - it's called Wiktionary citations namespace. The only thing needed is 1) find a way to verify citations from the Internet, should the source cease to exist (e.g. one person copies them, another verifies it) 2) agree on theoretical minimum of citations needed to ensure that citated sources represent actual speech, and were not messed with, e.g. by increasing the minimum of different sources and the time span among them to a significantly higher limit than printed sources. This whole printed sources bias is a vestige of traditional lexicography which is inapplicable in modern world where 99% of written communication is not done on paper anymore. --Ivan Štambuk (talk) 19:56, 18 December 2013 (UTC)
    Well said! --WikiTiki89 20:01, 18 December 2013 (UTC)
    Google Books being blocked or outlawed won’t cause every copy of the books it displays to suddenly burn to ashes. — Ungoliant (falai) 20:05, 18 December 2013 (UTC)
    But it would make rarer books be completely inaccessible to 99.9999999% of the world. --WikiTiki89 20:10, 18 December 2013 (UTC)
    OTOH one could easily make up citations for some obscure word, supposedly from some rare and inaccessible books which are not available on BGC or anywhere else on the Internet, and it would still pass according to your interpretation of durably archived. I'd go as far as forbidding making citations from anything not available over the Internet, or in digital form (e.g. e-books), unless published in e.g. last 50-70 years or some other similar period after which copyright of most works expires. (because it then stands a chance of being obtainable in practical terms). --Ivan Štambuk (talk) 20:31, 18 December 2013 (UTC)
You folks know about interlibrary loan?
I find myself feeling less secure about the verifiability of a source in Google Books that looks like it comes from a self-publishing service, has no page numbers, and probably went right from MS Word into PDF. Michael Z. 2013-12-18 20:47 z
Indeed. Google Books itself is not a permanent archival service, and not everything on Google Books is actually durably archived. Google Books is a great tool for finding quotations that are durably archived, but it can't be trusted blindly. —RuakhTALK 21:24, 18 December 2013 (UTC)
Agreed. I had thought of G Books as evidence that something is durably archived. But maybe that is only true of scanned print books, and not materials from digital workflows. This is counterintuitive, since digital materials tend to be the best reproduced, and it will be tempting to rely on them as their proportion of new works grows. Michael Z. 2013-12-18 21:49 z
  • Accepting citations from ephemera doesn't support the building of a long-term dictionary. And what we copy into Wiktionary citations namespace is often not enough to clarify exactly what a word means.--Prosfilaes (talk) 21:18, 18 December 2013 (UTC)
    Paper books are ephemera. Digital form is the only form that is permanent. It's ironic that what is argued as "durably archived" are in fact digital versions of paper books. Furthermore, paper has the potential to become illegal in the future due to its immoral and unecological nature. Nothing prevents us from having 1000 citations per meaning if we want to. It's all a matter of agreeing on what constitutes sufficient attestation. --Ivan Štambuk (talk) 22:04, 18 December 2013 (UTC)
What a strange, sweeping statement. Digital may not be "permanent"; formats can become almost unreadable (see e.g. BBC Domesday Project, and things like CDs, DVDs, servers don't last forever any more than books do. Equinox 22:09, 18 December 2013 (UTC)
With all due respect, I highly doubt paper will become illegal. Digital storage is no more permanent than paper (in fact it is significantly less so). It is however more accessible and easier to duplicate, and therefore a more feasible durable system can exist on top of digital storage than can exist on paper (see w:Cloud storage). --WikiTiki89 22:16, 18 December 2013 (UTC)
Original, permanent books are durably archived on library shelves. Google Books merely serves as convenient evidence of this fact, and only for some of them. Michael Z. 2013-12-18 23:04 z
  • Digital or paper, a lot of stuff is ephemera;, whether the digital form or the book of ads they distribute every month, is ephemera. However, anything in print has a chance to survive, since all it takes is one owner to toss it in a pile instead of the trash can. Digital stuff is not distributed to people and is a lot more likely to completely disappear when the author stops making a copy. If you want a used book, Amazon can hook you up; you want a Kindle book that's no longer in print, you're pretty much out of luck.--Prosfilaes (talk) 00:00, 19 December 2013 (UTC)
The WebCite vote had a decent amount of support. With some changes (like a clause to prevent rubbish from being considered valid) it might pass. — Ungoliant (falai) 22:51, 18 December 2013 (UTC)
From what I can see, the issue is really whether sources have been tampered with. Because we're a wiki, anyone could insert a cite and call it valid, so we require the durable archiving so that anyone has the ability to verify the citation from the original source. There is another way we can work around this, though. We can "lock down" the cite once it has been peer reviewed, which would prevent anyone from messing with it. That way it would become harder for anyone to change it. I don't know if MediaWiki itself provides strong enough protection for this. Maybe a cryptographic hash would be a good way to secure it. —CodeCat 23:20, 18 December 2013 (UTC)
I was thinking the same thing. I doubt that MediaWiki would already have such a feature. A cryptographic hash would not help unless we can "lock down" the cryptographic hash, at which point we might as well lock down the quote itself. Without a lockdown feature, a cryptographic hash would only be able to tell us that it was tampered with accidentally (or by someone who doesn't know about the hash). --WikiTiki89 23:31, 18 December 2013 (UTC)
In the same way that a balanced corpus must draw from diverse sources, a dictionary must cite diverse sources to accurately reflect the real use of language. The question should be what methods we trust most to archive them durably. Using WebCite sounds reasonable, and is very good. So yes to question number two. Locking down a cite brings up a good point because we also have to trust ourselves to preserve the cite once it's made. I'm sure we've all observed cases of anonymous users who take it upon themselves to "fix" a quote. Maybe it's done in good faith but it's a persistent strain of vandalism.
I try avoid most of the issues with sources by using Wikisource as much as possible. Haplogy () 00:08, 19 December 2013 (UTC)
  • FWIW, I raised last year the possibility of allowing online dictionaries of LDLs to be cited as references if a number of reviewers (probably admins) "signed on an entry's talk page that they had found the entry in the dictionary [ that] if the dictionary later went offline or removed the entry, we'd have the record that it had nonetheless had the entry at the time it had been used as a reference." As precedent, I pointed out that our sister project "Wikimedia Commons copies appropriately-licensed pictures from Flickr, and is able to keep them even if a Flickr user subsequently changes a picture's license, because Commons has a record which notes that the picture had a valid license at the time it was copied." In principle, the same kind of system could allow reviewers/admins to sign off that a particular quotation using a word was an accurate quotation of a webpage. - -sche (discuss) 00:53, 19 December 2013 (UTC)
  • If we really want to grab arbitrary web sources, why don't we grab the low hanging fruit? is about as reliable a link as any we'll found on the net; if Wikipedia disappears, we'll probably go first, and Wikimedia doesn't have a habit of rearranging pages at that level. Besides which, tarballs of Wikipedia are available, so it can be archived by someone.--Prosfilaes (talk) 20:47, 20 December 2013 (UTC)
  • Excellent point. There may be reasons not to allow Wikipedia articles, but durable archival is probably not a big concern. I think the oldid links work even if the page is moved; the only time they would break is if the page or revision is actually deleted. —RuakhTALK 21:43, 20 December 2013 (UTC)
Any transclusions on the page would not necessarily reflect their content as of the archive date. Michael Z. 2013-12-21 18:08 z

accelerated creation of Mandarin pinyin and Japanese romajiEdit

This may belong on Wiktionary:News for editors. I'm not sure. Here it is: editors of Mandarin entries can now use accelerated entry creation to create Mandarin pinyin entries, and editors of Japanese entries can do it with romaji. Let me know if there are any problems, requests, etc. Haplogy () 03:58, 18 December 2013 (UTC)

Great job, Hap! Users should be reminded of the limited structure of romaji and pinyin entries.
I'm eager to have Russian inflected form created through the accelerator, e.g. genitive singular, nominative plural forms for nouns, comparative forms for adverbs, etc. --Anatoli (обсудить/вклад) 04:14, 18 December 2013 (UTC)

How useful is Category:Alternative forms by language?Edit

How useful are these categories, really? Entries are placed there by {{alternative form of}}, but we only use that template to avoid having to duplicate the whole entry. There's nothing lexically significant about being an "alternative form", because the two forms are equal and interchangeable. Which one we at Wiktionary happen to call the alternative one is arbitrary, and doesn't really mean much to our users, we could have chosen the other one just as easily. I think these categories might be better off being deleted. —CodeCat 21:38, 18 December 2013 (UTC)

Japanese appears to be one of the top foreign members along with Latin, Old French, Finnish, and maybe others. I usually use it for the less common form of a term, and typically the less common forms are far less common. I don't think they're interchangeable in the case of Japanese because authors will often use an alternative form to give a term a deeper nuance or a more intellectual flavor the same way that authors of English works will choose a less common or older term, typically from Latin, for the same effect. The readings of literary words are the same as those of conversational words in the these cases but the characters are different, sort of like writing aquiline"eaglelike". Sometimes the nuance is just vaguely "deeper" but sometimes there are more specific differences, and saying "alternative form" is kind of a placeholder in an entry until somebody can write detailed definitions for every rare literary term. Maybe I've been using the wrong format for these terms. Is there a more appropriate way? Haplogy () 03:06, 20 December 2013 (UTC)
Relative uncommonness isn't guaranteed for alternative forms, and there shouldn't be a major difference. If there is a big difference, then "rare" is more appropriate. Alternative forms should be more or less on equal terms, not have significant differences in nuance (other than dialect), be essentially interchangeable, and should be etymologically related (it's an alternative form of the same lemma, not a different lexeme altogether). See Wiktionary:Forms and spellings#Alternative forms. So if you're using the label to find uncommon forms, you're actually not using it for what it really means. —CodeCat 03:24, 20 December 2013 (UTC)
I think Japanese has more alternatives than other languages, in extreme cases - various kanji + different okurigana (hiragana attached to the end) + hiragana + katakana.
In Russian alternative forms include words with Cyrillic "е" (common) spelled as "ё" (dictionary, encyclopaedic and strict style) (the reverse is also possible - свекла/свёкла), rare, less common or older terms (but not misspellings), forms that are used in different positions - ли/ль, бы/б, terms that can be spelled with a space, hyphen or solid (same as English). Yes, basically having alternative terms saves duplications.
Mandarin erhua terms are added to categories, they are just different forms of writing and pronouncing, no need to duplicate. Erhua terms Category:Mandarin erhua terms. I don't see why alternative forms like 台灣 or 臺灣 (both traditional) couldn't be included there)
Arabic alternative forms include variations in writing styles - with or without hamza, dotted/dotless yāʾ, tāʾ marbūṭa/hāʾ.
I think the category is useful but it could be further subdivided. --Anatoli (обсудить/вклад) 05:04, 20 December 2013 (UTC)
But the way you describe it, each language has its own distinct types of alternative, and that is too specific to put it in a general category. —CodeCat 05:14, 20 December 2013 (UTC)
I could still describe most of them as 1. alternative spellings (with no effect on pronunciation), btw. Mandarin erhua are more specific with pronunciation, e.g. 哥们儿 and 哥们 are pronounced the same way by northern Mandarin speakers. 2) positional (not sure if it's the right word), e.g. Ukrainian вчитель vs учитель, Belarusian ўзяць vs узяць, Russian ли/ль, бы/б, -ся/-сь. The specifics are obviously possible and often done on individual entries. The styles could be regional, strict, colloquial, etc. --Anatoli (обсудить/вклад) 05:27, 20 December 2013 (UTC)
Something systematic like в/ў and у, or ё and е, could easily be put in its own category. For example Category:Russian spellings with е instead of ё. That's actually much more useful than the vague "alternative forms" category. —CodeCat 16:50, 20 December 2013 (UTC)
  • In cases where an alternative form is equal to another entry, selectively including one of them to that category is really confusing for the native language speaker also. Even in cases where one form is used idiomatically or rarely or in special purposes. If the alternative form is a special case in that language (ex. esperanto "x-systemo" or Russian "е" - "ё") than a new parameter could include that entry to specific "alternative form" category. --Xoristzatziki (talk) 05:55, 20 December 2013 (UTC)
  • I agree that these categories are not lexically meaningful, and therefore not useful to readers. They should either be hidden (if they're useful to editors) or deleted (if not). —RuakhTALK 06:14, 20 December 2013 (UTC)
It's never been useful to me, at least. Haplogy () 06:19, 20 December 2013 (UTC)
  • They are not chosen arbitrarily. (Well, perhaps for English, such as British/American spellings, which shouldn't be alternative forms in the first place but equally treated words, but that's another topic altogether.). I want to be able to see a list of entries being obsolete spellings, dialectal/regional/substandard forms, abbreviations, terms of endearment, belonging to a particular register, orthographical standard, script and so on. Furthermore, sometimes there isn't a standard on which particular form to lemmatize, and with all forms being a part of the same paradigm (in an extended sense of that term, but still sharing the same definitions) it makes sense to chose one as the base lemma, and soft-redirect others to it. This is the case with many ancient languages when rules of the grammar were not fixed, and words could freely undergo several inflectional patterns (again, still meaning the same thing). The term used as a base lemma should be the one the one most widely used. For pluricentric languages however, there shouldn't be any redirection and all variant forms should be full-blown entries (which mostly seems to be the practice, apart from English). --Ivan Štambuk (talk) 06:27, 20 December 2013 (UTC)
  • Don't worry, no one is suggesting eliminating the categories for obsolete forms, regionalisms, colloquialisms, etc. —RuakhTALK 06:30, 20 December 2013 (UTC)
    Obsolete forms is a subcategory of alternative forms. CodeCat is suggesting that we eliminate the entire hierarchy. There shouldn't be any entries inside "Category:X alternative forms", only categories. {{alternative form of}} should be phased out in favor of specific templates that indicate what kind of alternative form we're dealing with. But it's OK to have such catch-all template for initial edits, when editors do not know or do not want to bother with why exactly is some alternative form, an alternative form. Many of the entries inside those categories there are not even alternative forms, but have context labels such as (uncommon) that wrongly categorize them as such ("rare forms" instead of "terms with rare senses"). Context labels are meaning-bound, and meanings marked as colloquialisms, regionalisms, obsolete etc. categorize differently and do not appear and should not appear inside the alternative forms hierarchy. Full definitions are needed only for those entries that do not have standard/modern/"default" modern-day equivalent (in spelling, script, orthography etc.). I don't see a reason why alternative forms shouldn't categorize by their criteria of redirection. --Ivan Štambuk (talk) 07:02, 20 December 2013 (UTC)
  • Re: "CodeCat is suggesting that we eliminate the entire hierarchy": No she's not. She's suggesting we eliminate the 'Alternative forms' categories. She's said nothing about other categories that happen to be nested within the 'Alternative forms' categories. —RuakhTALK 07:09, 20 December 2013 (UTC)
    And where exactly are alternative form category by criteria such as obsolescence, spelling and so on be categorized? Perhaps in a category called...alternative forms? --Ivan Štambuk (talk) 08:09, 20 December 2013 (UTC)
    Convert to parent category only. Sure, why not? Mglovesfun (talk) 11:23, 20 December 2013 (UTC)
  • Category:Obsolete forms by language (for example) is already in Category:All lexicons, just as Category:Alternative forms by language is. —RuakhTALK 16:07, 20 December 2013 (UTC)
    Well that's a circular categorization that needs to be eliminated. (Someone should map all category hierarchies into graph and search for cycles.) But in general, I agree that alternative forms should be phased out in favor of more specific soft-redirection templates. But these mostly do not exist, and alternative form serves as a general catch-all redirection template. Subcats should either be moved to lexicons, and remaining tens of thousands of entries categorized. --Ivan Štambuk (talk) 05:15, 23 December 2013 (UTC)
The whole point is, do these need to be categorized? Certainly they should be hidden categories, but not categorizing at all is fine too. Special:WhatLinksHere can tell us what entries use the templates in question. Using AWB this sort of request takes a few seconds, so doing it for multiple templates ({{alternative spelling of}}, {{alternative capitalization of}} for example) still only takes 30 seconds or less. Mglovesfun (talk) 11:50, 23 December 2013 (UTC)
I've made the change to {{alternative form of}}, and to the two templates Mglovesfun mentioned. They don't categorise anymore, so all the direct subcategories of Category:Alternative forms by language should become empty eventually. —CodeCat 22:42, 12 January 2014 (UTC)

Printable contentEdit

I think it would be nice to have some general guidelines for how our pages should look when printing (Tools->Printable Version), similar to w:Help:Printable. I think we have a few more concerns than enwiki as we have more dynamic content. Some questions are:

  1. Should dynamically hidden "main" content (quotations, tabbed languages) be shown, hidden, or left alone by default. Currently, in the printable version users can selectively show/hide regions. What about stuff like "Choose your target language" in translation bars? What about audio templates, videos, or image alt-text?
  2. Should any navigation templates be shown? Most are not, but {{also}} is printed. What about internal reference/appendix links such as the IPA "key" and glossary links?
  3. I think we're currently hiding all editing functionality, which I agree with.
  4. Should we change any other stylings, like background colors in tables and dropdown boxes? Do want them to be blank backgrounds? Do we use font colors anywhere? For pages that are actually printed can we customize for whether they are color or bw?

Since I believe few people print their pages this is a low priority, but we can start the discussion and then find a home for it (maybe expand the line at WT:CSS or a new page). --Bequw τ 16:22, 20 December 2013 (UTC)

A feature I'd like to see is to generate per-language printable content in PDF of some specific list of lemmas, or all lemmas in selected language, category or group of categories, with bookmarks for entries generated and grouped under respective letters of the alphabet. (i.e. acting like an index). Without translations, categories and inflection tables, but with citations and usage examples, and expanded tables for derived terms and similar. And wikilinks for existing entries hyperlinked in PDF as well. --Ivan Štambuk (talk) 04:48, 23 December 2013 (UTC)

Entries that mix form-of definitions and lemmasEdit

Some entries have a proper definition alongside a form-of definition. For example, ackers, which is a plural form entry that also has a definition. I asked a question about this at WT:ID, but as I've been working more with this I'm starting to think this is a bad idea. Someone who knows enough about the language's grammar will automatically be able to "undo" the inflection, but they may not realise that certain senses are restricted to certain forms of the word. So they will never think to look up the inflected form, they will go straight to the lemma form, and will therefore miss what they are actually looking for. For example, an English speaker wouldn't go to ackers if they saw the word, they'd go to acker because the singular is where the lemma is expected to be. It also leads to duplication of information when people aren't aware of this. For example, dozen has a definition labelled "as plural only", but dozens had a very similar definition listed before I removed it (because it was duplicated). I definitely think that putting all definitions at the lemma form, with labels like "in the plural", is something we should properly adopt as common practice/policy. This means that an entry can have either a form-of definition, or a real lemma-like definition, but they can't be mixed. —CodeCat 18:40, 20 December 2013 (UTC)

Then what would we do with plural-only entries, since a singular may be added in the future? --WikiTiki89 18:49, 20 December 2013 (UTC)
Then the plural-only entry becomes the lemma, and the singular links to it instead. —CodeCat 18:51, 20 December 2013 (UTC)
Hm, that's reasonable. Template:singular of does exist, though it has seen surprisingly little use. - -sche (discuss) 20:53, 20 December 2013 (UTC)
I agree with this, and have suggested it in the past on an ad hoc basis. I'd be glad to see it made official and added to language considerations pages (e.g. Wiktionary:About English). —RuakhTALK 21:57, 20 December 2013 (UTC)
Like Ruakh, I agree with this and have advocated it in the past. In cases where it hasn't seemed acceptable to centralise content, I've added Usage notes to make sure viewers of the singular entries were apprised of the plural entries (see e.g. [[message]], [[messages]] and [[pontifical]], [[pontificals]]). - -sche (discuss) 22:31, 20 December 2013 (UTC)
I don't see why plural-only or usually plural-only senses shouldn't be defined in plural entries. Singular senses should instead use {{singular of}} when plural form has much more usage. They are different words and should be treated separately. --Ivan Štambuk (talk) 04:39, 23 December 2013 (UTC)
I prefer to keep all senses at the singular (which people are more likely to look up) and use the "mostly plural" gloss if needed. Equinox 04:43, 23 December 2013 (UTC)
I don't have problems with putting definitions on plural entries, as long as all the definitions are there. We shouldn't spread definitions for the same lemma across different forms, that's just confusing. So either put them all at the singular, or all at the plural, with notes for context. —CodeCat 04:48, 23 December 2013 (UTC)
Plural-only dozens and countable dozen are really two different lemmas. From the usability perspective it also makes sense to keep the definitions at their respective most-used forms. In many languages even countable nouns are lemmatized in plural because they are most commonly found in that form (e.g. names of nationalities). Definitions should be where user most likely expects them, not where it "makes sense". --Ivan Štambuk (talk) 05:02, 23 December 2013 (UTC)
The main issue is to avoid duplication while at the same time making sure that users can find the information they're looking for. Putting all the information at the singular solves that, but it might be awkward in some cases. So what if we make a template that displays something like "for senses applying only to the plural, see (link)"? That way, users can immediately see that what they need might be found by following that link, while at the same time we avoid duplicating the information. —CodeCat 22:50, 24 December 2013 (UTC)
Yes, we should do that, at a minimum. Indeed, I suggested something similar in August, and the other users who commented, DCDuring and Furius, were supportive. - -sche (discuss) 01:54, 25 December 2013 (UTC)

Dictionary of American Regional English (DARE) onlineEdit

Take a look This could be a valuable resource for us. —Justin (koavf)TCM 18:20, 21 December 2013 (UTC)

At only $150/year the introductory rate looks like a steal. You should sign up. DCDuring TALK 22:53, 21 December 2013 (UTC)

Usenet, a possible way forwardEdit

From WT:RFV#homomarriage:

Just Usenet conversations? Does that even count? Ƿidsiþ 07:18, 21 December 2013 (UTC)
Since I've been here it always has. Mglovesfun (talk) 12:16, 21 December 2013 (UTC)
It doesn't really fill you with confidence though does it. I don't mind if one or two of them are from Usenet, but when all of them are it's a bit dispiriting. Ƿidsiþ 20:25, 21 December 2013 (UTC)
I totally agree but I have no solution to offer. Mglovesfun (talk) 22:03, 22 December 2013 (UTC)

Well, why not just allow a maximum of two Usenet cites ergo one from a published work. I still say the whole 'durably archived' system is a mess and needs reviewing. But I'd consider this a positive step. Mglovesfun (talk) 12:44, 23 December 2013 (UTC)

  • Usenet is the best written source of abundant evidence of how somewhat normal people actually use words now. There is at least as much reason to limit "literary" citations as to limit usenet citations, unless we are just interested in documenting a restricted slice of edited language. I understand that we don't really like actual, error-ful speech, either because it makes lexicography and linguistics harder, because of personal taste, or for other, even less defensible reasons, but I think we would be better off seeking out more usable sources of evidence for actual spoken language rather than restricting use of the proxy we have. DCDuring TALK 14:31, 23 December 2013 (UTC)
    • Usenet is also a source of abundant typos, erroneous usages, even outright fabrications. Printed publications, on the other hand, are far more likely to have gone through some editing process designed to weed out errors. bd2412 T 15:06, 23 December 2013 (UTC)
      It's a shame that it is so hard for us to do the job we claim to be doing. DCDuring TALK 15:50, 23 December 2013 (UTC)
      • Isn't part of the job we are doing to exclude proposed words that are not in actual use? bd2412 T 17:00, 23 December 2013 (UTC)
        Absolutely. That's why we shouldn't allow, for example, hapax legomena whose sole use is a single one in a "well-known work". There is at most one genuine use (if not an error), followed by mention in commentary. But we do. I think this betrays a kind of sublimated elitism or prescriptivism that is rampant but unacknowledged. DCDuring TALK 18:04, 23 December 2013 (UTC)
Basically yes I'm worried about being able to cite things that are 'errors' instead of 'words' or 'idioms'. Our slogan is all words in all languages, not all words and all mistakes in all languages. Mglovesfun (talk) 18:11, 23 December 2013 (UTC)
But how do you tell the difference between a mistake and a word? Look at Old English for example. Many West Saxon scribes wrote ciegan. So what if one eventually wrote cigan? Is that an error, or just a new spelling? Similarly, if an Old Dutch writer can't decide whether to write -ei- or -e- in a certain word, is that also an error? If there are "errors", there must also be a "correct" way, but in English there is no authority on spelling. People give dictionaries authority, but we are a dictionary too so that gives us the leeway to decide what to include. And we decided to be descriptive, so we should also describe uncommon spellings. —CodeCat 18:31, 23 December 2013 (UTC)
Exactly. Basically, we need to decide whether we are going to be a corpus or a dictionary (which is inherently prescriptivist). DTLHS (talk) 18:33, 23 December 2013 (UTC)
I totally agree (just in case it appeared that I don't). Mglovesfun (talk) 18:38, 23 December 2013 (UTC)
For historical languages, we should definitely be a corpus. For current languages I don't know. —CodeCat 18:40, 23 December 2013 (UTC)
Errors are real of course, calling something an error is not purely prescriptive, it can describe what speakers think about their own language. We should get rid of Category:English misspellings since they are real and therefore can be described. Mglovesfun (talk) 18:42, 23 December 2013 (UTC)
We can also just change the description so that it's clear that the category reflects what speakers commonly consider a misspelling, and does not imply any judgement on Wiktionary's part. —CodeCat 19:49, 23 December 2013 (UTC)
Usenet does not offer a high-prestige form of the language. It is within the realm of a descriptivist dictionary to point out the distinction between what's used on Usenet and what people expect from formal publications.--Prosfilaes (talk) 19:40, 23 December 2013 (UTC)
It requires some work to come up with a data-based, objective determination that something should be called an error. We could try to make it easier by having some criteria that divided the universe of, say, spellings into those that are clearly universally accepted, those that are accepted in some places or registers, those that are universally considered errors, those for which the criteria give an indeterminate result. The current evidence says this is beyond what we can achieve. After all, we can't even agree on such a simple thing as what "common" means in "common misspelling". DCDuring TALK 21:31, 23 December 2013 (UTC)
We can come up with criteria that can (hypothetically) be applied objectively, but the criteria themselves must be subjective. That's because they will be created by human beings, and there's no way round that. But beyond that, no I don't think we've ever even been close to deciding what 'common' means in the phrase 'common misspelling' in a Wiktionary context. Mglovesfun (talk) 00:21, 24 December 2013 (UTC)
Why not just count three Usenet citations as equal to nine citations one citation? That way any Usenet-only word would need nine Usenet citations. --WikiTiki89 04:36, 26 December 2013 (UTC)
Do you mean to count three Usenet citations as equal to one citation? bd2412 T 19:18, 26 December 2013 (UTC)
Yes, that's what I meant. --WikiTiki89 20:21, 27 December 2013 (UTC)
This solution is one I’d support. I prefer them being worth half of a regular citation, but one third is acceptable. — Ungoliant (falai) 17:56, 29 December 2013 (UTC)
An old, related proposal that may be worth revisiting: [[Wt:Votes/pl-2007-12/Attestation criteria]].​—msh210 (talk) 04:55, 27 December 2013 (UTC)
I think that proposal is too complicated to be necessary. --WikiTiki89 20:21, 27 December 2013 (UTC)

Why should definitions not be punctuated?Edit

Especially for foreign terms, they sometimes bare no punctuation. While in most cases they may not be sentences, that is not a basis for depunctuating them. Surely we can find many non‐sentences in books that are still punctuated. Admittedly, I have found a few modern dictionaries that do this, but I am not certain how widespread this practice is today. (Certainly, it was not common centuries ago.) Is the consensus to imitate modern dictionaries in this regard, simply because they are now common?

We can:

1. Support punctuation for all definitions,

2. only punctuate sentences or

3. not care either way.

Are there justifications for leaving definitions unpunctuated? Or do we simply do this because it is (supposedly) common now? --Æ&Œ (talk) 03:18, 25 December 2013 (UTC)

Typographical conventions change over time. Requiring punctuation all the time, in the same way that century-old dictionaries did, will end up with the same effect: make Wiktionary look like a century-old dictionary. —CodeCat 03:31, 25 December 2013 (UTC)
Interesting explanation, but it sounds more like an esthetic issue rather than a practic one as I was expecting. I am still not so sure how this is wrong, unless you mean that it’s incongruous, and therefore, ‘wrong,’ for an otherwise up‐to‐date dictionary. --Æ&Œ (talk) 03:54, 25 December 2013 (UTC)
If definitions were templatized we could have it both ways. DTLHS (talk) 04:25, 25 December 2013 (UTC)
I did suggest it once, but I don't think many people liked it. —CodeCat 13:52, 25 December 2013 (UTC)
Oh, please let’s not keep going down the “customized content” road. That is just an admission of failure, an unnecessary complication. It means that we don’t know what our dictionary says and it can’t be quoted reliably.
I would be happy with option 1 or 2. Let’s be: not incorrect, applying some simple rules consistently, and using post–19th-century style. If we do these things, then the definitions will not bug our readers, and no one will care that much. Michael Z. 2013-12-25 21:07 z
Our practice seems to be:
  1. English definitions are mostly with initial capitals and terminal periods. They tend to be actual definitions, not one-word glosses.
  2. FL definitions are mostly without initial caps or terminal periods. The "definitions" are often not definitions, just a single word, often polysemic, quite often obsolete, archaic, dated, or rare.
Rather than working on standardization of poor content, why don't we work on upgrading the content? If a standard of initial capitals and terminal periods would cause better FL definitions or at least more careful one-word glosses, then we should adopt that standard. If we don't care about having unambiguous definitions for FL terms, then the current practice is good enough. If we would like to have more templates that cannot be edited without getting the queue above a million, we should obviously go ahead and templatize everything. I'm sure that our crack squad of templatizers and Luacizers can eventually produce templates and modules that will write adequate definitions automagically. DCDuring TALK 21:38, 25 December 2013 (UTC)
My point of view: A definition is not a sentence and therefore does not need to start with a capital letter or end with a period. --WikiTiki89 04:34, 26 December 2013 (UTC)
But a sentence fragment may have a capital and a period. And if that helps it not look out of place or erroneous where we might also see a sentence, then maybe it should. Michael Z. 2013-12-27 21:45 z
It won't be out of place if none of our definitions at all have capitals or periods. --WikiTiki89 21:47, 27 December 2013 (UTC)
Quite true, assuming that no definitions have sentences, and no definitions have multiple sentence fragments that require periods and capitals to separate them. (Many are fine with semicolons; is every one?) Do you know if that’s true? Otherwise, it is simpler and more consistent just to use sentence caps and full stops throughout. Michael Z. 2013-12-27 22:04 z

Nom characterEdit

Some entries use the header "Nom character", such as ⿰米頗, where it was added by a very reliable contributor of Vietnamese. KassadBot currently tags the header as "nonstandard" because it isn't on the list of approved headers. Does anyone object to approving it or have suggestions for what header Nom characters should use instead? - -sche (discuss) 02:25, 26 December 2013 (UTC)

Suggestion: you could use the natively spelled form of it at chữ Nôm as a header instead, but I've a feeling it would still confuse more people like what "Nom character" is doing. I might base it off the header {{ja-kanji}} (literally "Han character"), but since kanji has been transliterated from Japanese into English for so long, people would get less confused if you mentioned "kanji" instead of "Han character", which I find interesting. Something to think about. TeleComNasSprVen (talk) 02:40, 26 December 2013 (UTC)
Why not use the standard header "Character"? —CodeCat 02:58, 26 December 2013 (UTC)
I don't oppose that, but I do think we should be consistent. We distinguish ===Han character===s as such, so we should either rename the Han characters "===Character===" (with the definition line clarifying, as it already does, whether the character is Han or Nom), or allow Nom characters their own distinct header. - -sche (discuss) 06:11, 26 December 2013 (UTC)
I think that would be good yes. —CodeCat 21:48, 27 December 2013 (UTC)
(edit conflict) There's at least one chữ Nôm entry () that uses the "Han character" header, and one that, strangely, has both (軿), though chữ Nôm characters often only look like Han characters. I don't have a problem, anyway, with legitimizing the "Nom character" header, if it seems to make more sense to those who actually edit in Vietnamese. Chuck Entz (talk) 03:06, 26 December 2013 (UTC)
Well, I was trying to illustrate my point through the relationship between using as header the words "Han character" (literal translation of "kanji") versus "Kanji", and using as header the words "Nom character" (literal translation of chữ Nôm) vs "Chữ Nôm". I was also under the impression that although chữ Nôm is based off the Han writing system, chữ nôm was Nom character and chữ Hán was Han character. I have a hard time distinguishing Nom and Han myself anyway... TeleComNasSprVen (talk) 05:24, 26 December 2013 (UTC)
Okay, let's start over. Why do we decide to use "Han character" and "Nom character" as headers instead of their native equivalents "chữ Nôm" and "chữ nôm" as headers? Why do we decide to not use "Han character" as a header instead of its native equivalent "Kanji" as a header? TeleComNasSprVen (talk) 05:28, 26 December 2013 (UTC)
As this is the English Wiktionary, it is preferable to use English designations. It is also preferable to use headers which contain only ASCII characters rather than those which contain diacritics many users would find impossible to type. Rarely, and AFAIK only in the case of non-natural languages, there is no English designation for a particular part of speech, and a foreign-language term must be used (e.g. Lojban gismu). In this case, however, there exist several possible English designations for Nom characters, (including "Nom character"). It is worth noting that 'kanji', while clearly derived from another language, is an English term; it is used unitalicized in English texts and is defined in the Random House dictionary, the Collins English dictionary, Merriam-Webster's dictionary, etc. In contrast, "chữ Nôm" is an untypable, unnecessary foreignism. - -sche (discuss) 06:11, 26 December 2013 (UTC)

I think it's best to get some input from User:Mxn himself; he can certainly deal with and explain the Vietnamese entries, so he might have more information about the header inconsistency. TeleComNasSprVen (talk) 06:21, 27 December 2013 (UTC)

Honestly, I don't think any of the Vietnamese character entries are laid out well. I just didn't want to mess with templates at the time. No analogy is perfect, but a character may have both Hán-Việt and Nôm readings, just as it may have different Japanese on and kun readings, respectively. (There are also tens of thousands of "Nôm characters" coined in Vietnam for Vietnamese, analogous to Kokuji.) In , for example, there's a Readings section, followed by the usual etymology-POS-definition tree for kun readings.

I'd be in favor of something similar for Vietnamese: a Readings section, followed by a POS-definition tree for Nôm readings. Virtually all the definitions would be of the form {{vi-Nom form of|phở|pho soup}}, pointing users to the Latin-script entries. Most entries would start out as just the Readings section, and editors like me can add the definition sections as we discover glosses or quotes.

This layout would avoid the term "Han character", which is ambiguous even in Vietnamese: chữ Hán can refer to the script shared between the CJKV languages, or it could refer to just the characters used in Chinese (as opposed to "Nôm characters").

Does this sound like a workable solution? Perhaps if I have time, I could convert existing entries using Tildebot, my robot assistant from the Vietnamese Wiktionary.

 – Minh Nguyễn (talk, contribs) 07:58, 27 December 2013 (UTC)

Perhaps you could convert one or two existing Han and Nom entries to your proposed format so we could see what it was like before deciding whether to implement it widely?
What do you think of the suggestion of using just ===Character=== as the L3 header for both Han and Nom characters? The "definition" info in our Han and Nom characters seems to already clarify whether the character is Han or Nom (and can be made to clarify that in any cases where it doesn't already do so), meaning no information would be lost even if the header were shortened. - -sche (discuss) 22:01, 27 December 2013 (UTC)

Sure. I've reformatted , , 軿, ⿰米頗, 𨋣, and , using "Character". Imagine that these entries originally had only the "Character"/"Readings" section (taken from the WinVNKey database, say). Then I came along and added the senses and quotations.

If we just shorten "Han character" to "Character" but do nothing else, most of the affected entries would have one "Character" section that fails to distinguish between Hán-Việt and Nôm readings. A few would have one very repetitive list of "Han tu form of" and "Nom form of". But it's not like the Hán-Việt readings need anything more than the short Readings list at the top. (Hán-Việt readings are for reading Chinese text, not Vietnamese.) The existing entries mostly seem to have only Hán-Việt readings, but I'd prefer to tag them all for review, because we don't know whether any have lumped Hán-Việt and Nôm together under a "Han character" section.

– Minh Nguyễn (talk, contribs) 10:06, 28 December 2013 (UTC)

The proposed layout is now detailed at Wiktionary:About Chinese characters#Vietnamese. – Minh Nguyễn (talk, contribs) 08:16, 30 December 2013 (UTC)
I've gone through and reformatted or rewritten all the entries that used {{vi-chunom}} or that used {{vi-hantu}} with chu=Nom. I'm no expert on Hán-Nôm, but it looks like some users with limited or no understanding of Vietnamese contributed a large number of these entries over the years.
In particular, I think we should tag any Vietnamese entry contributed by Cehihin as needing review. Their methodology appears to have yielded totally incorrect readings and definitions in many cases, and many of the entries are flagged for needing cleanup. Some examples:
  • 𡮈𨳒: They appear to have just searched for Nôm characters for nhỏ and mọn, took two results at random, and combined them for a transliteration of nhỏ mọn.
  • 𨳊: One of the readings, cu, appears to be a translation of the Cantonese sense, and the others appear to be readings of other characters that have cu as a reading.
  • 𨳒: The definitions are just the Cantonese definitions, translated into Vietnamese. The Readings section is just a list of synonyms of the Vietnamese "definitions".
I spent the last few days rewriting Cehihin's contributions, but there are scores more.
– Minh Nguyễn (talk, contribs) 01:52, 2 January 2014 (UTC)
You might want to start a topic at WT:RFC with a title along the lines of "Vietnamese entries by [[User:Cehihin]]" so it doesn't get lost or forgotten here. Chuck Entz (talk) 10:20, 2 January 2014 (UTC)

Inflected forms of inflected forms and alternative formsEdit

Often inflected forms can inflected themselves - e.g.

  • participles or gerunds can have declensions
  • declensions of adjective comparatives and superlatives are in some languages given in the respective comparative or superlative entry (which don't contain definitions, but soft-redirect back to the positive form), and in other languages their inflection are given in the entry for the positive (which contains definitions)
  • terms treated as predictably derived such as diminutives, as well as inflected forms of alternative forms of various kind, all have their own inflection, with no definition but soft-redirecting back to the base lemma
  • In same languages verbal stem can also have secondary forms, such as causative ("to cause to do X") which again can be separately lemmatized, have its own participles which decline, in the worst case giving three levels of redirection when user inspects such inflected forms (e.g. A, genitive singular of B -> B, participle of C -> C, causative form of D -> D = "to <defintion>").

This all translates as bad user experience IMHO, and unless we start adding definition lines, usexes etc. to inflected forms, all of the inflected forms in all levels of indirection should point back to the base lemma. E.g. in the last case, definition line for A should be extended to include all those levels of indirection and directly soft-redirect from A to D. Similarly, inflection table of the base lemma can contain unobtrusive, collapsible inflection tables within itself, for individual inflected forms that inflect. With respect to infrastructure, this would require an upgrade of {{form of}} template that could accept arbitrary levels of such inflectional redirection. --Ivan Štambuk (talk) 03:52, 26 December 2013 (UTC)

Oppose. Those forms are separate lemmas with their own part of speech, and should be given inflection tables. We should not obscure that fact. —CodeCat 04:06, 26 December 2013 (UTC)
They are not lemma forms, and no dictionary lists those inflected forms as separate lemmas, unless they are irregular or special in some way. It's not about obscuring anything but usability. Too many mouse clicks. Inflected forms which differ in part of speech from their base lemma would still keep their own part of speech header. They are already treated as second-class citizens, and this would only confirm their inferior status. Why give them the privilege of having their own inflection? --Ivan Štambuk (talk) 04:50, 26 December 2013 (UTC)
I agree with CodeCat, but (to save argument) lets call them sub-lemmas. We are time-limited (not space-limited). To save too many back clicks form of templates could be more comprehensive so:
    Nominative feminine singular, absolute superlative form of βέβαιος (vévaios)
could eventually become something more like:
    Nominative feminine singular form of βεβαιότατος (vevaiótatos), the absolute superlative form of βέβαιος (vévaios)
Even later this would include a gloss and a {{usex}}, this would be a long way off for rare forms, but could be very soon for common ones. Our comprehensivity is only limted in the short term. — Saltmarsh (talk) 12:19, 26 December 2013 (UTC)
Some of our templates already work that way, and I think it's much clearer. For example {{got-nom form of}}, which includes a parameter to say it's a participle form of something. But I don't know if there's an easy way to add this to templates like {{inflection of}} or {{form of}}. We should probably start by doing this with more language-specific templates on a case-by-case basis. —CodeCat 14:32, 26 December 2013 (UTC)
Fwiw this is how some Hebrew templates do it: "First-person singular future (prefix conjugation) of שמר(shamár), with a second-person masculine singular pronoun suffixed as direct object" (for the -with-object form of the future form of a verb) and "{{he-noun form of|רגל|n=d|pg=f|pn=s|pp=3|s=i|tr=régel}}" (for the -with-possessor form of the plural form of a noun), e.g.​—msh210 (talk) 05:06, 27 December 2013 (UTC)
But if you disagree with my suggestion, why don't βεβαιότερος and βεβαιότατος contain their own separate inflections, and instead they soft-redirect to positive form βέβαιος which contains them?
It's interesting to see how unification (e.g. all definitions in one place when lemmas overlap) or unnecessary redundancy (e.g. alt. forms yes for British/American spellings, but duplication of definitions is OK for habeo : habens, habendus etc.) arguments get stretched to fit a particular purpose. --Ivan Štambuk (talk) 16:50, 26 December 2013 (UTC)
Our mission statement is to list all words in all languages. So the feminine plural of French adjectives (as an example of all the others) should certainly be listed here. SemperBlotto (talk) 08:22, 26 December 2013 (UTC)
Oh I'm not against having inflected forms - I'm just in favor of 1) centralizing inflections in one place 2) having inflected forms of inflected forms (of inflected forms...) always redirect to the base lemma (through intermediate steps), i.e. the one which contains definitions. Both approaches are used for both points for various languages currently. --Ivan Štambuk (talk) 16:50, 26 December 2013 (UTC)
You don't literally mean redirect, do you? I trust you mean that inflected forms of inflected forms should link back to the base lemma. I agree with that. —Aɴɢʀ (talk) 10:21, 27 December 2013 (UTC)
Yes I meant linking back to the base lemma (soft-redirecting). --Ivan Štambuk (talk) 14:41, 1 January 2014 (UTC)
Maybe we could come up with a system of breadcrumbs like we have in some of the category templates. We would want the template to do some of the heavy lifting, if possible, so contributors don't have to enter all of the links in the chain. My main concern, though, is complication and/or clutter of the lemma inflection section. If you look at some of the Turkish entries, you can find examples of collapsible boxes practically filling the screen even unexpanded. If you have a language where verbs have multiple nominal forms, each declined for gender, number and case, it could approach that level of complexity. Do we have the ability to nest collapsible boxes? And are there instances where the derived-form inflections aren't completely predictable from the primary form and we would have to use different inflection-table templates or supply more information? Adding inflections in Ancient Greek is tricky enough, because you have to know which template covers the accent pattern and stem-ending pattern, not to mention things like declensions/conjugations, etc. Chuck Entz (talk) 21:54, 27 December 2013 (UTC)
That's why I advocated that these be all put in a single collapsible inflection table, and collapsible sub-tables can be nested within. (it works, I tried). Basically I have no problems with having inflection listed both at the main lemma, and at "secondary lemmas" (lets call them that), i.e. at the non-definition entries for comparatives, participles and so on, but it's very annoying to have to traverse at their entries to see their inflected forms from the base lemma entry. Just as annoying as traveling back from the inflected form of the inflected form back to the base lemma to see what the word means. The approach of listing one collapsible box after the other (also notably used for Ancient Greek) is a result of gradual expansion of inflection when editors at first didn't know or didn't know how to list those additional inflections (additional tenses, possessive forms and so on). What is needed is a base inflection template to serve as a skeleton, in which various inflections of a lemma will be nested as subinflections. It's too much pain to fix existing thousands of entries, but is something to keep in mind for the treatment of new languages. --Ivan Štambuk (talk) 14:41, 1 January 2014 (UTC)
This is a bit radical (and who'd do the work?) but I think, unlike print-encumbered dictionaries, we are in a position to get hold of a good Web developer (anyone free at MediaWiki?) and come up with a genuinely new, innovative, attractive user interface for a dictionary. For example, zooming with the scroll-wheel could be applied to levels of detail somehow, just as it is with Google Maps (working from satellite photos to physical street views); this might correspond to "show me this tiny inflected form within a dense paragraph of its roots", versus "zoom right in and show me only that one term, with translation tables and derivations". Just a thought. Equinox 00:05, 28 December 2013 (UTC)

Can we delete SAMPA from Appendix:English pronunciation?Edit

I am pretty much a newbie to Wiktionary, though I've been messing around on Wikipedia for years. I noticed that Appendix:English pronunciation still lists (X-?)SAMPA representations, although November's News for editors says

27: SAMPA and X-SAMPA transcriptions are no longer included in pronunciation sections (vote).

So shouldn't they be taken off that page? Or, at the least, a prominent note added to it to that effect?

I posted this question on the talk page there. Then I saw that that was the first edit in almost a year, so I'm bringing it here.

If you want me in on this discussion, please ping me. ... Oops, that's not documented on wikt yet, though it's present. See Grease pit. --Thnidu (talk) 21:00, 26 December 2013 (UTC)

Yeah just do it. Mglovesfun (talk) 21:51, 26 December 2013 (UTC)
  Done -- Liliana 21:57, 26 December 2013 (UTC)

Default character set for CJK Han article title charactersEdit

Is there a reason why article titles in Wiktionary use the Japanese character set for most/all CJK Han characters? Shouldn't we use the Traditional Chinese character set for most of the characters? 00:33, 27 December 2013 (UTC)

I don't understand. Mglovesfun (talk) 23:55, 27 December 2013 (UTC)
Due to Han unification, we don't actually get to decide which "character set" to use; that's up to browsers (and other user agents). However, if desired, we can use something like {{DISPLAYTITLE:{{Hant|{{PAGENAME}}|lang=cmn}}}} to encourage browsers to select a rendering appropriate for Traditional Chinese. I don't know if we want to do that — and I don't know how well it would work if we did — but it's an option. —RuakhTALK 08:27, 29 December 2013 (UTC)

Plural of should not categorizeEdit

Proposition: the template {{plural of}} should not categorize.

Rationale: part of speech categorization is done in the headword (the bold word under the header) whereas in the inflection line, we do categorize for other things like archaic, obsolete, idiomatic, etc. The second reason is it creates awkward situations where Spanish doesn't have a Category:Spanish plurals but only Category:Spanish noun forms; {{plural of|word|lang=es}} categorizes automatically into Spanish plurals. There are a lot of categories we don't use [[Category:<languages> plurals]] but we do use {{plural of}}. Rather than fiddling about with nocat=1 why no just move the categorization into {{head}}, either as {{head|en|plural}} (example for English) or {{head|es|noun form}} (example for Spanish).

PS I believe CodeCat has already started implementing this by bot. A bit naughty but since the end outcome is the same I haven't bothered to object. Mglovesfun (talk) 00:02, 28 December 2013 (UTC)

{{head}} should not be used where it adds no value. DCDuring TALK 13:15, 28 December 2013 (UTC)
I would argue that improving categorization is adding value. Chuck Entz (talk) 13:32, 28 December 2013 (UTC)
It only adds value after the "plural of" categorization has been removed. Every time we add another instance of a template we increase the size of the queue when that template is changed. In this case we add {{head}} without even removing {{plural of}}, so the total number of transclusions increases. Such recreational use of templates is simply resource-wasteful. If it actually served some current, useful purpose for Wiktionary, I'd feel differently. Unless this is part of some secret plan. DCDuring TALK 14:31, 28 December 2013 (UTC)
I think the headword line should categorise into the relevant part-of-speech category. Different languages have different needs, and sometimes even the same language has different needs depending on what part of speech is being categorised. In the past, when creating entries for plural forms of e.g. adjectives, people have resorted to adding nocat=1 to the template, or even using {{form of}} with "plural" as the text. If people have to work around the limitations of our templates so often, then clearly something isn't right with our templates. In the past few weeks I've been trying to clean up the variety of form-of templates that we have, making them work and display more consistently, so that they all have the same parameters and things like that.
I've also removed categories when they weren't necessary, to avoid confusion over which form-of templates categorise and which don't. {{plural of}} categorises for example, while {{masculine plural of}} and {{inflection of}} do not, and that's obviously not very consistent and leads to errors when people end up adding the wrong category, or when they forget to add one altogether. There were countless adjectives in a variety of "(language) plurals" categories, for example, while that category is only for nouns (I suggested renaming those categories in RFM, but that's a separate issue). If we adopt the practice of never relying on the form-of template to provide a category, then I think it will reduce mistakes. —CodeCat 16:34, 28 December 2013 (UTC)
If there are no further objections I'd like to make this change to the template, then update all the entries. People who run bots or use other automatic means to create entries will need to fix things, though. How would we notify them? —CodeCat 15:06, 4 January 2014 (UTC)
I support this, go ahead. --WikiTiki89 15:49, 4 January 2014 (UTC)
Ok, it's done. I'll start removing the nocat=1 from the templates soon. We should also check Special:UncategorizedPages for a while to make sure there aren't any bots that forget to add a headword-line template with a category. —CodeCat 00:04, 8 January 2014 (UTC)
ACCEL currently still adds nocat= to entries (e.g. Reblochons), FYI. - -sche (discuss) 02:21, 9 January 2014 (UTC)
That's ok. It doesn't do any harm, and I have categories to keep track of which entries still have it. I'll remove it soon, but I'm waiting a bit so that I can check who is still adding entries without a category in the headword. I noticed Semper's bot was still creating entries like that for example. —CodeCat 02:28, 9 January 2014 (UTC)

the language code 'hmn'Edit

'hmn' is currently both a language code in Module:languages and a family code in Module:families. This is not ideal from a technical standpoint, as codes should be unique; it is also not sensible from a linguistic standpoint.

As indicated by the family code, the Hmong languages are a family of languages. Some are similar enough that they could be considered dialects (though they are almost all at least as distinct as Nynorsk and Bokmal, which we grant separate L2s). Others have tonal, phonemic, lexical and structural differences so great that they would be mutually unintelligible even if they were written in the same script, which they are not: some varieties use Pollard scripts, some use Phajhauj Hmoob interchangeably with one Latin standard, others use other Latin standards; other scripts may also be in use. In China, Robert Darrah Jenks noted that "students from different [Hmong] dialect groups addressed each other in Mandarin Chinese so that they would be understood" because their native languages were not mutually intelligible. There is, in other words, no unitary "Hmong language" we could use 'hmn' for.

In practice, the handful of words we have coded as 'hmn' are White Hmong words — but White Hmong has its own code, 'mww', which is already more widely used. (Template:mww was deleted following a RFD in which a total of one user participated; it was never deleted from Module:languages, and discussion on WT:T:AHMN subsequently determined that all Hmong varieties should be included pending any new merger discussions.)

For these reasons, I propose to remove 'hmn' from Module:languages, so that it is only a family code, no longer a language code. The few entries which use it can be switched to use 'mww', since that is the variety they belong to. - -sche (discuss) 08:04, 29 December 2013 (UTC)

Category:Requests for unblockEdit

Someone should clear out this category once in a while. (Or delete it if people no longer think it's necessary; some requests go back to 2012.) TeleComNasSprVen (talk) 21:14, 29 December 2013 (UTC)

Nevermind, cleared it out myself. TeleComNasSprVen (talk) 19:46, 30 December 2013 (UTC)


I'm a bit puzzled by this. Why is there a separate gender template, even though the headword template already supports genders? Why not {{head|de|noun form|g=m}}? And why is there even a gender at all in this case? Don't we normally avoid duplicating information that's already on the lemma? I'm not sure why entries like this are still being created. —CodeCat 00:17, 30 December 2013 (UTC)

Because Wiktionary:Requests_for_deletion/Others#Template:m? TeleComNasSprVen (talk) 00:26, 30 December 2013 (UTC)
In part, yes. There's really no reason to use {{m}} here, and I thought we had abandoned that practice by now. People argued to keep the template because it's used, but that's a chicken-and-egg problem. So is this (entry) what we want? I don't think so, it should be changed. —CodeCat 00:33, 30 December 2013 (UTC)

Namespace for gadgets?Edit

Right now we don't really have a good way to manage gadgets and other Javascript-based tools. They're spread all over the place, and most of them aren't categorised, so it's left to whoever possesses some arcane knowledge to find them. Many of them like User:Conrad.Irwin/creation.js are even personal user pages. That last point in particular is bad; users should be free to modify or delete their userspace pages without risking bringing down parts of the site. So I think a proper place should be found for these scripts. I don't know where they could be placed, though. Should there be a Gadget: or JavaScript: namespace, or would something else work? —CodeCat 21:35, 30 December 2013 (UTC)

What is wrong with MediaWiki:Gadget-stuff.js? Keφr 21:44, 30 December 2013 (UTC)
The MediaWiki namespace can only be edited by administrators, so it's not suited for general use. It also restricts us to one page per gadget, which may not be reasonable. —CodeCat 21:57, 30 December 2013 (UTC)
Only administrators can edit User:Conrad.Irwin/creation.js as well; I can't touch that if it's his. And how does it restrict to one page per gadget when you can also use importURI on a MediaWiki gadget page as well? Like MediaWiki:Gadget-HotCat.js for example (mw.loader.load is nearly the same anyway), you can set up any separate MediaWiki:Stuff.js page for separate pages per gadget (doesn't have to have Gadget- prefix). Second reason is, I'm not sure letting others like anons muck around with JS would be such a good idea. TeleComNasSprVen (talk) 23:25, 30 December 2013 (UTC)
We don't have to let anons edit it. And certainly some of the gadgets will be admin-only, but not all of them need to be. --WikiTiki89 23:29, 30 December 2013 (UTC)
I disagree. Only trusted editors should be able to modify JavaScript that is run by others, since said JavaScript is run with the privileges of those others. —RuakhTALK 08:24, 31 December 2013 (UTC)
That's a good point. I hadn't considered the security aspect of JS. —CodeCat 23:12, 6 January 2014 (UTC)
I like that idea. --WikiTiki89 23:09, 30 December 2013 (UTC)
  • This is actually planned for ResourceLoader v2, and gadgets-edit will be a separate userright. --Yair rand (talk) 00:25, 31 December 2013 (UTC)
    Excellent! --WikiTiki89 00:30, 31 December 2013 (UTC)
I would like to move User:Conrad.Irwin/creation.js into the main MediaWiki:Gadget-WiktAccFormCreation.js page (nobody needs to edit it, generally). But I don't know where User:Conrad.Irwin/creationrules.js could be moved to. Editing that should be restricted still, but it's meant to be editable, so putting it in the MediaWiki space doesn't make as much sense (to me at least). —CodeCat 22:46, 6 January 2014 (UTC)
I agree with moving it out of user-page space to somewhere more "official". No real opinions on where it should go! Equinox 23:07, 6 January 2014 (UTC)
I still think they should have their own namespace. The MediaWiki NS doesn't seem right. --WikiTiki89 23:21, 6 January 2014 (UTC)
I agree, definitely. But Ruakh is right, we wouldn't want people to be able to edit it. So it would become a rather small namespace, and only admin-editable. Is that currently possible to do with what we have? —CodeCat 23:27, 6 January 2014 (UTC)
I thought that any page in MediaWiki namespace suffixed by .js could run JavaScript? So you could move it to MediaWiki:Creationrules.js? Even if you move it to a gadget-prefixed page like MediaWiki:Gadget-creationrules.js it still is hidden because its not in MediaWiki:Gadgets-definition. Compare enwiki's MediaWiki:Common.js/IEFixes.js page which is imported into MediaWiki:Common.js but runs as a separate page. TeleComNasSprVen (talk) 03:18, 7 January 2014 (UTC)
What about documentation pages? Those aren't currently supported, but definitely needed. It should work like how it does for modules. —CodeCat 12:52, 7 January 2014 (UTC)
I'm sure editing rights for a namespace are configurable. --WikiTiki89 18:56, 7 January 2014 (UTC)

Best of 2013Edit

So, editors, what do you think were the best new WT bits on 2013? The most useful new gadgets/best news/favourite admins/best edits/sweetest pies etc. --ElisaVan (talk) 13:16, 31 December 2013 (UTC)

Best new bits for me: luafication of {{l}} and {{term}}, automatic transliteration, ability to search inside the content generated by templates. Favourite admin: User:ZxxZxxZ, with his knowledge of coding and ancient Iranian tongues. Best usage example, this one. --Vahag (talk) 18:13, 31 December 2013 (UTC)
The worst new thing was forcing us to type context everywhere. The best new thing was reducing it to cx. The sweetest pie is still the delicious stargazy pie from 2010. Editors are encouraged to try to add better pies. Equinox 23:02, 6 January 2014 (UTC)

Better headlines in Mandarin Pinyin entriesEdit

I will suggest that we change the headlines in pinyin entries as I have done here [2] so that it is clear what the sections are about. I think that later the syllable section could be expanded with which part of it is initial and which is final as you can read about here [3]. Kinamand (talk) 09:51, 1 January 2014 (UTC)

I prefer the way it is. It's supported by a vote and also unified with other Romanisation systems we allow - Japanese, Gothic and the proposed Romanisation for Cantonese. --Anatoli (обсудить/вклад) 21:08, 1 January 2014 (UTC)
Please give a link to the vote. Please give one entry where Japanese romanization have to sections with the headline Romanization. What makes the pinyin entries comfusing in my opinion is that there can be up to 3 sections with the headline Romanization which it is impossible for people to know the meaning of and how to add content to. Kinamand (talk) 08:53, 2 January 2014 (UTC)
It's because pinyin and pinyin syllable were both changed to romanization. Therefore entries with all three headers (romanization, pinyin, pinyin syllable) ended up with romanization three times. Mglovesfun (talk) 14:00, 5 January 2014 (UTC)
Yes but that is also the problem. A user on wiktionary cannot know what the three romanization sections means. Here [4] you can read that the idea perhaps was to merge two of the sections. The merge of pinyin and pinyin syllable would give problem because they have different templates. As it is now it is impossible for people to figure of that the meaning of the 3 romanization sections are and that is very bad because it prevents people from improving the pinyin entries without risks of breaking the formatting. The argument that we also have romanization sections in Japanese and gothic are not valid because they have only one section and therefore no confusion. Kinamand (talk) 18:42, 11 January 2014 (UTC)