Wiktionary:Beer parlour/2018/December - Wiktionary, the free dictionary

References for Vietnamese readings listed under Template:vi-readings

I would like to add superscript references for readings of Vietnamese Han characters using the following code as a suggestion:

{{vi-readings|rs=老04
| hanviet = giả - tdcn
| nom = giả - tdcn;gdhn, giã - tdcn, rả - tdcn, trả - gdhn;btcn, dã - gdhn
}}

The abbreviations used are: tdcn = {{vi-ref|Nguyen (2014).}} gdhn = {{vi-ref|Trần (2004).}} btcn = {{vi-ref|Hồ (1976).}}

The desired output using 者 as an example is as follows:

Han character

edit

者: Hán Việt readings: giả^[1]
者: Nôm readings: giả,^[1]^[2] giã,^[1], giở^[1], rả^[1], trả^[2]^[3] dã^[2]

References

edit

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 Nguyen (2014).
↑ ^2.0 ^2.1 ^2.2 Trần (2004).
^ Hồ (1976).

Currently, this is also achievable using the bulkier code below:

{{vi-readings|rs=老04
| hanviet = [[giả#Vietnamese|giả]]<ref name="tdcn">{{vi-ref|Nguyen (2014).}}</ref>
| nom = [[giả#Vietnamese|giả]],<ref name="tdcn"/><ref name="gdhn">{{vi-ref|Trần (2004).}}</ref> [[giã#Vietnamese|giã]],<ref name="tdcn"/>, [[giở#Vietnamese|giở]]<ref name="tdcn"/>, [[rả#Vietnamese|rả]]<ref name="tdcn"/>, [[trả#Vietnamese|trả]]<ref name="gdhn"/><ref name="btcn">{{vi-ref|Hồ (1976).}}</ref> [[dã#Vietnamese|dã]]<ref name="gdhn"/>
}}

If possible, could someone edit Module:vi so that the suggested code in the first paragraph would give the desired output? KevinUp (talk) 15:49, 1 December 2018 (UTC)[reply]

@Suzukaze-c Hi. If you have the time, would you mind comparing the desired output above with 者#Vietnamese? I can't figure out how to implement this within the module. KevinUp (talk) 06:50, 7 December 2018 (UTC)[reply]

Done, I think. —Suzukaze-c ◇◇ 07:01, 13 December 2018 (UTC)[reply]

Thank you very much! Also, I'd like to mention here that Template:vi-hantu is now officially deprecated and will be replaced by Template:vi-readings (The older template contains readings imported from the Unihan database, which fails to distinguish between Hán Việt and Nôm readings. Previous discussion can also be found here).

Also, I found that the Nom Foundation database contains some mistakes/unverified readings, such as hoả for 一 [1], which is why I wanted to list out readings based on what is found in the original reference source. All readings with eventually be given superscript references, but it will take some time for this to be done. KevinUp (talk) 14:34, 13 December 2018 (UTC)[reply]

@Mxn Hey, just want to update you on the latest developments. I am currently sorting through readings of Hán Nôm characters and adding references links to Wiktionary:About Vietnamese/references for each reading. Readings with higher number of references are arranged first. See 吟 for example, which has up to 16 readings.

This is work in progress and completed characters with verified readings can be found at Special:WhatLinksHere/Wiktionary:About_Vietnamese/references. KevinUp (talk) 18:49, 26 December 2018 (UTC)[reply]

unchanged plural

edit

What does "unchanged plural" exactly mean in the Usage note for craft? that's not the general terminology used in Wkt, is it? --Backinstadiums (talk) 16:38, 2 December 2018 (UTC)[reply]

I've changed it to be explicit: "The plural craft is used to refer to vehicles. All other senses use the plural crafts." Ultimateria (talk) 19:12, 2 December 2018 (UTC)[reply]

Inevitable discussion about reference works from non-Latin cultures

edit

Given the situation where issue |lang= in {{quote-web}} in the Grease-pit page of this month insinuates opening all reference templates it has become opportune to uniformize their content. It has caught my eye that there are lurking multiple fashions of displaying references for cases of a work published in a script that is not the one of the Romans, namely, the author name was written in a certain script and the title of course too, but to my great surprise and contrary to Wiktionary’s usual laudable Unicode- and internet-standard compliance I encountered that there were reference templates here already created that did not even include the original title but wrapped it in {{xlit}} so that only a transliteration of it remained, and the same has also been done with titles of their authors, so that I could not recognize any of the books and almost did not find already created templates by Wiktionary’s search function, being already prepared – in vain – to create the templates.

So I reasoned that, since we are in late 2018 and our letter case is unlimited in what concerns languages that have in the Modern Age been used for pursuing science, the templates must all be uniformized so that the original title is displayed, opining also that transliterations are to be discarded for scripts that are unambiguous since they are no gain for anyone (if you don’t know the language you don’t know the transcription either, short of negligible cases when one is literate in Latin script only but not the actual script for a non-Latin-script-written language one knows) and “En.Wiktionary entries already have too much wasted space”, as @-sche acutely observed on the Grease pit page of this month and also has been voiced as a cause of displeasure.

There might be little experience in reference sections in any works containing non-Latin references, but naively and naturally and looking at how my computer does it, I always ordered references by the Latin names first and then the Cyrillic ones, and so I have come to the belief that the original-script author names can be had easily. People might however be more appealed to by Latin-transliterated names, but even then I am apprehensive of those being less iconic, but this is of limited importance for names. It can very well be grave in logographic writing systems some of which are still in use, for particularly people’s names can have the most arbitrary characters and it would be utterly impossible to reconstruct the original name without browsing the web again only to find a name which a Wiktionary editor has needlessly left out. Currently the Japanese reference templates have all formats.

So how does Wiktionary look upon all these factoids? What should references have, perhaps with distinctions by writing systems? I’d like to see completely removed the transcriptions of the titles of alphabetic and syllabaric scripts because they have no non-theoretical uses and would sort references by Unicode (I don’t actually know how Chinese sort their Chinese reference sections, and perhaps one feels that Japanese titles transliterated could somehow help, I avoid talking about those scripts). Plus why have people even thought that |title= and the author parameters would be the correct place to put transliterations or transcriptions? This would easily be different parameters |tr-title=, |tr-author= and so on that can be expanded for those who need it (whose existence I deny), and this would make reference templates use more expected parameters. Which of course entails as a minimum that we have original-script titles – come on, are readers supposed to reverse-transliterate titles? Author titles perhaps in both since one might not know the script but the author from other publications in other, Latin-written languages? But this is not generally true, though there are often adapted author names around. Avicenna is quite iconic, no need for اِبْن سِينَا (ibn sīnā), but that’s more often for classics and applicable to quotation templates. What does iconicity tell us here? And I have not even mentioned how often title-translations should be done, which have a parameter already. There is still this issue around of quotation templates containing bare long titles, and there are a few “click to expand” solutions for these as I remember. Pinging some people I find interesting to hear or interested: @Sarri.greek, Eirikr, Sgconlaw, Dan Polansky. Fay Freak (talk) 00:29, 5 December 2018 (UTC)[reply]

I’m sorry, could you summarize all that? I’m having trouble understanding what your concerns are. — SGconlaw (talk) 01:51, 5 December 2018 (UTC)[reply]

@Sgconlaw I wanted to uniformize references of books written in a non-Latin alphabet a bit, pointing out the questions whether the original script of a) the author name b) the title should be shown, and c) whether transliterations of the author names should be shown d) whether transliterations of the titles should be shown. I was just formulating much pros and contras. My result has been to vehemently affirm b), deny d) (hardly valuable clutter), lean to a), I am rather open to c), but it would need to look good enough (like on the Chinese reference page KevinUp has linked it is great but we need |tr-author= for this I think). Fay Freak (talk) 19:35, 5 December 2018 (UTC)[reply]

I say show both the original and the transliteration, in the future we will be able to customize this to everyone's satisfaction with css magic. Crom daba (talk) 03:42, 5 December 2018 (UTC)[reply]

Yes, and at the least transliteration does not belong to |title=, otherwise there won’t be CSS magic. There need to be separate fields for original titles and author names and their transliterations, I don’t think I can be wrong here, @Sgconlaw. Now supra there are the arguments for displaying. The decision about display should not be influenced by limited forms of saving the information. Fay Freak (talk) 19:35, 5 December 2018 (UTC)[reply]

Here are the formats used for Chinese references: Wiktionary:About Chinese/references, Korean references: Wiktionary:About Korean/references and Vietnamese references: Wiktionary:About Vietnamese/references. Also, all Chinese quotations and usage examples (whether it is cited from a book, song, video or the web) are provided using Template:zh-x. A list of abbreviations for well known references used by this template can also be found at Module:zh-usex/data. KevinUp (talk) 04:08, 5 December 2018 (UTC)[reply]

@KevinUp The Chinese reference page is great. Until the point where I find: “Starostin, Sergei (1989). Rekonstrukcija drevnekitajskoj fonologicheskoj sistemy (A Reconstruction of the Phonological System of Old Chinese)”. Why is the Russian title not given in Russian script but the Chinese titles are given in Chinese script only (and not in Pinyin)? No logics.

There is also the issue of some titles being translated and some not, but that’s minor. Fay Freak (talk) 19:35, 5 December 2018 (UTC)[reply]

@Fay Freak: I'm not sure why the work by Sergei Starostin was not written in the Cyrillic script. I tried to trace the source of that work, and this is what I managed to find: [2]. Unfortunately I was unable to trace the original source. Perhaps someone else could help by looking up the bibliography of Sergei Starostin.

Phonological reconstructions for Early Zhou, Classical, and Middle Chinese are based on Sergei Starostin's version as originally published in: [Starostin, Sergei. Rekonstrukcija drevnekitajskoj fonologicheskoj sistemy [Reconstruction of the Phonological System of Old Chinese]. Moscow, 1989.] Particular reconstructions are transliterated into the UTS from S. Starostin's etymological database of Chinese characters (bigchina.dbf), available online at http://starling.rinet.ru.

As to why Chinese titles are given in Chinese script only and not in Pinyin, this may have been done to prevent a cluttered appearance of the reference works. Also, it seems that pinyin tone marks are omitted for Chinese reference works in Yale University Library's Quick Guide on Citation Style for Chinese, Japanese and Korean Sources: APA Examples. KevinUp (talk) 16:32, 6 December 2018 (UTC)[reply]

Transliteration of author and title can be useful. There are several scripts where I can read faster (and more accurately) names and titles in transliteration.

I am not sure if it was being suggested, but omitting transliteration in general would be a bad idea; Sanskrit look-up is already hampered by the ban on Sanskrit in the Roman script. --RichardW57 (talk) 23:01, 29 December 2018 (UTC)[reply]

Adding pinyin for numbers in Chinese (Mandarin?) example sentences

edit

@Dokurrat, KevinUp, Justinrleung, Suzukaze-c, Tooironic, Wyang & co. (alphabetically organized) I added Pinyin for the numbers in a Mandarin Chinese example sentence, and that pinyin was removed- see [3]. I think we should give the pinyin for the numbers (maybe?). I'm okay either way- in fact I don't think we need to do all sentences one way (no pinyin for numbers in example sentences) or all the other way (pinyin for all numbers in example sentences). But I'm not sure. idk. I'm just putting it out there for y'all to discuss. Any which way is fine to me. --Geographyinitiative (talk) 04:30, 5 December 2018 (UTC)[reply]

No, I don't think we should add pinyin for Arabic numerals. Dokurrat (talk) 04:41, 5 December 2018 (UTC)[reply]

I like the idea. I usually do it for Japanese. —Suzukaze-c ◇◇ 04:42, 5 December 2018 (UTC)[reply]

I'd like to see the numbers as pinyin, because they are read according to its Mandarin pronunciation. Also, depending on context, they can be read as cardinal numbers or standalone digits: 365天 ― sānbǎiliùshíwǔ tiān ― Three hundred and sixty five days.

員工365失踪了。 [MSC, trad.]
员工365失踪了。 [MSC, simp.]

Yuángōng sānliùwǔ shīzōng le. [Pinyin]

Employee no. 365 is missing.

KevinUp (talk) 05:04, 5 December 2018 (UTC)[reply]

^ this. —Suzukaze-c ◇◇ 05:18, 5 December 2018 (UTC)[reply]

Agreed that we should add pinyin conversion for Arabic numerals. ---> Tooironic (talk) 06:09, 8 December 2018 (UTC)[reply]

It has to be added manually, of course, otherwise we are asking for possible future errors in conversion. Perhaps re-transliterated numbers need to be displayed differently, so that e.g. sānbǎiliùshíwǔ for "365" is known to mean to stand for 三百六十五 (sānbǎiliùshíwǔ, “three hundred sixty five”) or 三六五 (sānliùwǔ, “three six five”). A different colour or underlined? Also, maybe a trick is needed to use a hidden "三百六十五"/"三六五" but display "365", so that a manual pinyin is not required? BTW, @KevinUp: I have suppressed the display of "365" in your example with @. --Anatoli T. ^{(обсудить}/^вклад) 07:15, 8 December 2018 (UTC)[reply]

@Atitarev: Automatic pinyin transliteration of Arabic numerals can be done by adding pronunciation data of 0-9 to data.polysyllable_pron_correction in Module:zh-usex/data. However, this would render "365" as 三六五 (sānliùwǔ, “three six five”). Manual input would still be needed if "365" is intended to be read as 三百六十五 (sānbǎiliùshíwǔ, “three hundred sixty five”). KevinUp (talk) 14:45, 8 December 2018 (UTC)[reply]

@KevinUp: I understand. As I said, what we need is, a new method in the module to use the transliteration of hidden characters, in this case "三百六十五" for transliteration purposes only - "sānbǎiliùshíwǔ" but display unlinked "365" in the Chinese text. --Anatoli T. ^{(обсудить}/^вклад) 04:16, 9 December 2018 (UTC)[reply]

This seems to be slightly complex, so we may have to add this to Wiktionary:About Chinese/tasks. KevinUp (talk) 04:25, 9 December 2018 (UTC)[reply]

Wiktionary lemmas written in a nonnative script

edit

As Wiktionary grows, I noticed some unusual entries written in a nonnative script such as 0.5#Chinese, の#Chinese that qualify for Wiktionary:Criteria for inclusion and may have also passed Wiktionary:Requests_for_verification due to its widespread used in a particular language or region. However, I think that it might be better to list such entries (that have passed RFV) in an appendix or separate namespace or to put a banner right below the language header to inform our readers that this lemma is written in a nonnative script along with categorization. KevinUp (talk) 15:14, 5 December 2018 (UTC)[reply]

Out of curiosity, do we have Arabic, Greek, Hebrew, Hindi, Russian lemmas that are written in the Latin script, for example? I've also found Category:Terms written in foreign scripts by language, but only Chinese, Japanese and Korean are listed in this category. KevinUp (talk) 15:24, 5 December 2018 (UTC)[reply]

Category:Chinese terms written in foreign scripts DTLHS (talk) 15:26, 5 December 2018 (UTC)[reply]

These entries are rather interesting: fighting#Chinese, friend#Chinese, part-time#Chinese. Yes, I've heard these terms used in real life, such as in TVB dramas, but I am surprised to see these entries included in Wiktionary. I would like to propose for such terms to be listed in an appendix or separate namespace, because such entries are more likely to be found in an informal dictionary such as an A-Z pocket slang dictionary, rather than a formal dictionary. KevinUp (talk) 15:55, 5 December 2018 (UTC)[reply]

The issue has come up before, with marketing being used (in Latin script) in Greek texts. Wiktionary:Beer parlour/2017/September § Modern Greek terms spelt with Latin characters. See also this revision history for a recent disagreement. I'm not comfortable at all with including that sort of things. Per utramque cavernam 16:15, 5 December 2018 (UTC)[reply]

Foreign script is a strong argument for code-switching. Even when it is used constantly in Greek it can be the case that it never passes into Greek, and it is no loss not to add it either because the English entry suffices (you read a Greek text, look up a word here but find it as English, that’s enough, you don’t expect anyway that all that you read is in the dictionary as Greek). Fay Freak (talk) 19:39, 5 December 2018 (UTC)[reply]

Script is secondary to the actual spoken language, and usage of words should be analyzed for codeswitching, and for what-language lexicon a word belongs to. French has fr:American way of life#Français and fr:web design#Français, and Japanese has サード (sādo, “third”) and ホエールウォッチング (hoēru wotchingu, “whale watching”); are these "acceptable"? —Suzukaze-c ◇◇ 19:42, 5 December 2018 (UTC)[reply]

Maybe we need to find a way to represent code-switching? It would seem like a common pattern for a foreign word to have a code-switched variant (with foreign pronunciation, in a foreign script) and a nativized one (being closer to the native language's phonology, spelled in the language's native script) with the first one being extremely common and the second at the edge of attestability, but due to our policies we only include the second one and create a distorted picture of actual usage patterns.

I remember @Vahagn Petrosyan having something to say about this. Crom daba (talk) 20:06, 5 December 2018 (UTC)[reply]

I create a Usage note, as in վարագույր (varaguyr). --Vahag (talk) 12:18, 6 December 2018 (UTC)[reply]

Yes, I think that we need to find a way to represent code-switching. Rather than using foreign script as an argument for code-switching it might be better to decide based on the pronunciation of the entry.

I would like to suggest for entries such as (1) part-time#Chinese, (2) PK#Chinese, (3) SUS#Japanese that have been nativized to become closer to the phonology of the language it was borrowed into (despite retaining its nonnative script) to be accepted as legit entries whereas entries such as (1) fighting#Chinese, (2) fr:American way of life#Français, (3) の#Chinese that are found mostly in written form but rarely in spoken conversations are to be put under some sort of banner to inform our readers that such entries are of unconventional usage and are mostly written for stylistic effect. KevinUp (talk) 16:32, 6 December 2018 (UTC)[reply]

Alternatively, we should set up some sort of guideline to decide whether or not an entry is considered code-switching or not. KevinUp (talk) 06:50, 7 December 2018 (UTC)[reply]

Yes, language-specific CFI are needed. --Anatoli T. ^{(обсудить}/^вклад) 07:17, 8 December 2018 (UTC)[reply]

I think that the issue of the script is a bit of a red herring. Take the originally English word online, which has become commonplace in many languages, including Serbian. Now when Danas, a major newspaper, uses the word, they write for example ”Srbi sve više kupuju online”. The Politika newspaper is also written in Serbian but uses Cyrillic script; when they use the word, they write for example “Политика Online”, as they in fact do on every page of their website. It would be strange to consider the use by Danas a loan word but the use by Politika a case of code switching, merely because one happens to use Roman script and the other Cyrillic for what is the same language. --Lambiam 17:30, 8 December 2018 (UTC)[reply]

In this particular case, the spelling is a strong indicator of code-switching, as Serbian orthography is phonemic and (unlike Croatian) strongly prefers transcribing foreign names and terms. You could consider onlajn (abundantly attested) a nativized variant, although arguably the choice between these spellings is a matter of personal style. Crom daba (talk) 18:00, 10 December 2018 (UTC)[reply]

For an example in English, Москва is citeable (Citations:Москва) but was deleted (Talk:Москва), and Citations:ἄρχων is also citeable (as are, I expect, Arabic-script forms of Allah and PBUH, etc). An older Chinese example is Talk:Thames河, deleted in 2011.) - -sche (discuss) 17:54, 8 December 2018 (UTC)[reply]

When I read, “With absolute confidence I can boast that my Frittelle di Fiori di Zucca are the best in the world”, I don’t think, “Oh, perhaps we should consider including an entry for the English term frittella di fiori di zucca. No, I think this is an instance of code switching, and in this case one of a very common type. I think we should not have an English entry oliebol either. Although the term can be found in English texts, it is obviously a Dutch word. There is a need for a test or criterium when the use of a foreign term is simply code switching, and when the term becomes part of the lexicon of a borrowing language. As I’ve tried to argue above, being written in a different script is not a litmus test. Being included in quotation marks is a strong indicator of not being seen as part of the lexicon, but not all authors will use these when code switching. When the imported term becomes subject to local inflection, or can serve as a component to form new compound words, this is a strong indicator of having become lexicalized, but as a test this does not work for analytic languages like Mandarin. --Lambiam 12:33, 9 December 2018 (UTC)[reply]

In personal experience, code switched fragments can very easily be inflected and are likely to be joined in compounds to attach them to native sentence structure. Also, lexicalized loans are likely to have defective inflection.

Pronunciation is also no good, since it is extremely speaker and context dependent, and lexicalized loans can themselves have a special phonology.Crom daba (talk) 18:07, 10 December 2018 (UTC)[reply]

I don't think we can have a coherent policy or test across different languages. Speakers of different languages will absolutely differ in their criteria for what counts as a native word. This is even more difficult with global languages like English where different communities are in contact with a huge variety of other languages from which to borrow from. DTLHS (talk) 18:23, 10 December 2018 (UTC)[reply]

  Good words, @Crom daba. I want to point out how language is really written on the internet: In printed works or works inspired by print practices there are many things that don’t happen but are unproblematic elsewhere, in unrestrained speech where people can develop their own standards or own morals, unspooked by societal expectation, so to speak in Stirnerian language: Remarkably, nowadays in Russian chats, and I mean those where discussions take place and people try to write correctly, one just writes some foreign words in foreign script and then immediately joins Russian endings in Cyrillic script to them. It’s also the way I think and do it: Writing Russian in Germany, referring to things in Germany without having a notion of a Russian equivalent, I just write German words or English words in Latin script and decline them Russian and in Cyrillic script (without space I mean, you understand; most iconic, I think), and this does not make them Russian. I often can think “Is this word Russian already”? There are some obvious ones that do exist, like everyone uses the word терми́н (termín) in reference to appointments in Germany, a word that does not exist in Russia, and I long did not even know that it doesn’t, it seems so indispensable. This middle ground of dubiosa (is this English or Latin, huh? Not English because of lacking spread) is only left out by me and other dictionary editors often because these words have limited relevance to a greater world and one would look up these words in German dictionaries anyway (as I said earlier, an entry in one language suffices, a Greek entry marketing is otiose), plus they are CFI-problematic (best one can do is quote them from fora and commentaries under articles, perhaps with archive links, but that’s it, these Soviets here don’t produce a corpus that would help to quote Russian as spoken in Germany). Separating the words is even more difficult if you look at inter-Slavic conversations: Like is Russian менто́вка (mentóvka, “mint liquor as popular in Bulgaria”) Russian? It is used in Russian texts here and there, and obviously with Russian endings then, but is it perceived as Russian? (With a German legalese term with no equivalent in English, how does the Verkehrsanschauung or Verkehrsansicht see it?) I have also read quite a lot strange words from Russian expats in Serbia and things like that, you could make large lists of such words if you wanted to; theoretically this could lead to having words in Russian written with Cyrillic characters we thought do not exist in Russian – I make here the strange observation that Latin words with foreign diacritics pass easier into texts of other languages but the Cyrillic languages tend more to transcribe all, i. e. having a Russian text with ђ is way more weird than Vietnamese diacritics, Semitic transcriptions and what you can imagine in English texts. And that’s only in Europe, elsewhere things become crazier, which others can describe better.

For the phonetical point, see that legit French words contain pharyngeal fricatives, like hebs (“prison, can”), hnouch (“popo, bacon”). Here we have also an issue arising if we know that a word has passed into French, English, and you can attest it from songs (like they have been printed on CD or are buyable as downloads or else unlikely to vanish, so durable). The flip side of words written in a non-native script are words which have passed but cannot or only with uncertainty be written in the native script. English example: gwop (“moolah”).

Normal dictionaries to a large part avoid such problems because they leave out exotisms, i. e. words for things that do not exist in an area where there is a community of the language documented. With this I lean towards an exclusion ground that is that if a word in English is for a foreign thing and the Verkehrsanschauung does not see the word as English then it is not English. Confer mesdemet! This is “not really English”. What does apply for abstracta then, what is Greek marketing then? This criterion I have just stated becomes difficult for foreign “ways of life”. Maybe Greek marketing is not actually Greek because he who uses such a word ceases to think like a Greek, regardless of the script it is written in. There are many gross things written and said in Arabic or Hindi texts that I would for this reason see as not-Arabic and not-Hindi. And the same criterion can apply to determine if a word has passed from German into Russian.

The issue gets complicated however because there is not only code-switching for Wiktionary but there is also Translingual: You could make a case for “marketing” being Translingual and not only English. I have argued already (User talk:Fay Freak § Translingual) for grammatical terms like genitivus absolutus, status constructus and the like being Translingual in the first place. Maybe “marketing” is translingual because teachers of business and marketing have made it so ex cathedra, which is why it is used in Greek, never able to become Greek. Fay Freak (talk) 20:02, 10 December 2018 (UTC)[reply]

Agree. Crom daba (talk) 23:18, 10 December 2018 (UTC)[reply]

To me, the better treatment is to put them in the main name space, and pile the obloquy on them there. They need to be found. Not everything in another script is foreign - there was a time when manuscripts of the Pentateuch contained the divine tetragrammaton in what is now called the Phoenician script while the rest of the text was in what is now called the Hebrew script (as a boy I learned to call it the Aramaic script). And if Western Arabic digits are not part of the Thai script, the native Thai interjection 555 (555) is another example. Indeed, a subtler issue is whether the '555' in Thailand-focused English language forums dominated by Europeans is now an English word borrowed from Thai.

One thing that can easily go missing is the pronunciation. I have to say that the Thai pronunciation of words taken from English is frequently far from obvious. Translingual terms seem not to have pronunciations. --RichardW57 (talk) 22:53, 29 December 2018 (UTC)[reply]

Linking elements of a term in {{en-noun}}

edit

At l'esprit de l'escalier, should the individual elements of the phrase in {{en-noun}} be linked to French words, like this: {{en-noun|head=[[l'#French|l']][[esprit#French|esprit]] [[de#French|de]] [[l'#French|l']][[escalier#French|escalier]]}}? (Pinging @Per utramque cavernam as we discussed this on the entry talk page.) — SGconlaw (talk) 17:11, 5 December 2018 (UTC)[reply]

No. They should be linked in the etymology. DTLHS (talk) 17:13, 5 December 2018 (UTC)[reply]

In the case at hand, the link is to an entire French term, esprit de l’escalier. Where should the individual elements be linked, or do we just not link them in this case? I was thinking that since the elements of a term in {{en-noun}} are usually linked by the template anyway, it makes sense to include the links to the French words manually. — SGconlaw (talk) 17:20, 5 December 2018 (UTC)[reply]

Those links are one click away. Theoretically it can be different if a French phrase exists only in English or an other language, the French not being CFI-compliant as French. Fay Freak (talk) 19:43, 5 December 2018 (UTC)[reply]

I usually link to component multi-word terms of a term if they reflect the sense of that term, eg, black sugar maple would link to black and sugar maple. And, as Fay Freak says, the individual words are just one more click away. It seems unhelpful to make a user guess at whether there are multiword components and which grouping leads to a possible entry. DCDuring (talk) 23:08, 5 December 2018 (UTC)[reply]

The {{en-noun}} template links to English terms, though. In this case, the terms are French, so it's not appropriate to link. —Rua (mew) 10:45, 6 December 2018 (UTC)[reply]

Generally, yes, but arguably not exclusively. For example, sometimes when an element is not present in the Wiktionary (for example, a person's name), I've seen a link to an English Wikipedia article. I see no reason why links can't be to other languages where appropriate. — SGconlaw (talk) 12:01, 6 December 2018 (UTC)[reply]

Because, again, {{en-noun}} creates English links. If you put a French word in there, it will still be an English link. A dead link, moreover. —Rua (mew) 18:43, 10 December 2018 (UTC)[reply]

No, it works fine. Try pasting {{en-noun|head=[[l'#French|l']][[esprit#French|esprit]] [[de#French|de]] [[l'#French|l']][[escalier#French|escalier]]}} at Wiktionary:Sandbox. — SGconlaw (talk) 07:07, 13 December 2018 (UTC)[reply]

New sinograph QIOU "poor and ugly"

edit

How should this situation be dealt with in terms of lexicography?

--Backinstadiums (talk) 00:26, 6 December 2018 (UTC)[reply]

The same way we deal with any other word or sinograph — add it if it is attested in durably archived media, spanning over a year, etc. (It doesn't look like this is.) —Μετάknowledge^{discuss/deeds} 02:01, 6 December 2018 (UTC)[reply]

quadrumanus

edit

quadrumanus appears in the Cambridge Grammar of the English Language, page 1663; is it a typo or a variant of quadrumanous --Backinstadiums (talk) 15:58, 6 December 2018 (UTC)[reply]

(This sounds like a Wiktionary:Tea room question. — SGconlaw (talk) 16:02, 6 December 2018 (UTC))[reply]

It is a taxonomic designation (as in Chiropsalmus quadrumanus). Highly unlikely to be an English adjective because of the spelling. Equinox ◑ 16:40, 6 December 2018 (UTC)[reply]

The authors were probably looking for a word that began with quadru and was not formed in Latin, as they are talking about "marginal vowels" as English morphological elements, which in the case of 'quadr' can be i, a, or u. Why they didn't choose quadrumane or quadrumanous for the purpose is beyond me. We could ask them. Maybe it is was typo. DCDuring (talk) 18:16, 6 December 2018 (UTC)[reply]

New Wikimedia password policy and requirements

edit

The Wikimedia Foundation security team is implementing a new password policy and requirements. You can learn more about the project on MediaWiki.org.

These new requirements will apply to new accounts and privileged accounts. New accounts will be required to create a password with a minimum length of 8 characters. Privileged accounts will be prompted to update their password to one that is at least 10 characters in length.

These changes are planned to be in effect on December 13th. If you think your work or tools will be affected by this change, please let us know on the talk page.

Thank you!

CKoerner (WMF) (talk) 20:02, 6 December 2018 (UTC)[reply]

Programming languages

edit

Since the Wiktionary includes all languages; Does it also include Programming languages? --2A01:112F:742:C00:14B9:E7A5:D1B3:F0B3 09:23, 8 December 2018 (UTC)[reply]

No, as they aren't human language (though a few words may rarely get borrowed into English grammar). Equinox ◑ 10:23, 8 December 2018 (UTC)[reply]

Is tlhIngan Hol a human language? --Lambiam 19:26, 8 December 2018 (UTC)[reply]

Eh, it's clearly a totally different kind of thing from a programming language. The only programming language I've ever seen that even inflects verbs is Inform 7. Equinox ◑ 19:34, 8 December 2018 (UTC)[reply]

Programming languages are determined by a language specification, not by usage. That falls under "documentation", not lexicography. DTLHS (talk) 17:31, 8 December 2018 (UTC)[reply]

But the reference manuals for a programming language use terms from that language as if they were English, French etc - so we really ought to have them somehow. SemperBlotto (talk) 14:23, 10 December 2018 (UTC)[reply]

We've had this discussion before. Early programming languages only had a few keywords, but now there are hundreds of frameworks with thousands of named classes (e.g. ExecutionEngineException, HttpMessageInvoker) and each class may have hundreds of named properties, methods and fields. These, too, are listed in manuals and guides. Equinox ◑ 14:30, 10 December 2018 (UTC)[reply]

See Wiktionary:Requests_for_verification/English#caddr as well. - TheDaveRoss 14:31, 10 December 2018 (UTC)[reply]

Take this sentence from a book on conversational French: “Bonjour is usually used until around six p.m., whereas bonsoir is used after six p.m.” In a book on French you can expect to find French words used as nouns in English sentences. Only, they are not used with their French meaning. They stand for themselves. So these sentences mention the word in the sense of the use–mention distinction. Likewise, the English sentence “esac is case spelled backward, rather like fi is if spelled backward” only mentions these keywords. To understand the sentence you don’t have to know the meaning of any of these words. On the other hand, grep, originally just another computer command, can be used as a verb (”I grep, he greps, we grepped”), so it clearly has become lexicalized and merits to be included. --Lambiam 18:12, 10 December 2018 (UTC)[reply]

Appendix:Reference detail

edit

According to the description: "This appendix provides detail to sources linked by Wiktionary. It is to be linked from reference templates." It contains three items, all created by User:Dan Polansky. Is this a new policy? The only reason I've noticed it is that Dan changed one of the Hungarian reference templates. I'd prefer to link directly from the template to its corresponding website and not to an appendix. Was there a Beer parlour discussion or vote on this? Panda10 (talk) 15:25, 9 December 2018 (UTC)[reply]

It is not a new policy, not anything mandatory and rigid. If you don't like my change in Template:R:TotfalusiEty 2005, please revert it. The point of the appendix is to provide more information than comfortably fits in the mainspace, e.g. English rendering of the title. Some reference templates link to Wikipedia, which is similar in that it does not lead to the main website of the reference. --Dan Polansky (talk) 15:31, 9 December 2018 (UTC)[reply]

Dan, thanks for your prompt reply. I do see your point, but for now, if you don't mind, I will revert the changes until it is decided by the community how to standardize reference templates. Panda10 (talk) 16:46, 9 December 2018 (UTC)[reply]

Thank you. I realized we could link to the appendix via "→Detail", without losing the immediate link to the dictionary website. I added the link as a proposal. --Dan Polansky (talk) 08:13, 15 December 2018 (UTC)[reply]

@Dan Polansky: I'm not sure. The only extra information the Appendix provides is the English translation of the book title. There has to be some other benefits of such an Appendix because it has to be maintained. Who will do it? I appreciate that you care about this but maybe in the future you could demonstrate the proposals using the Czech reference templates? I don't see any of them in the Appendix. :) Thanks! Panda10 (talk) 14:51, 15 December 2018 (UTC)[reply]

@Panda10:: The appendix needs to be maintained no less than the templates themselves. Furthermore, once correct information is entered, I do not see much of a need of further updates. As for Czech reference templates, I now added {{R:PSJC}} and {{R:SSJC}} to the appendix, and I am glad all the detail I added is not in the template display for the mainspace. In the appendix, I have stated how many entries there are in the dictionaries. --Dan Polansky (talk) 09:25, 16 December 2018 (UTC)[reply]

@Dan Polansky: I'm still not convinced. But if you find this system useful, it's fine to add all the Czech reference templates to the Appendix. As for the Hungarian template, I will revert the change. Panda10 (talk) 18:01, 16 December 2018 (UTC)[reply]

I'm letting it be now, I guess, but let me note that I don't understand it. I think it pretty obvious that the reader was better off having a link to a page with more detail, including English rendering of the title and the number of entries in the dictionary. --Dan Polansky (talk) 18:05, 16 December 2018 (UTC)[reply]

@Dan Polansky: I see too what you want. Actually instead of listing references in Appendices a technical solution that I consider agreeable is to have the transliterations, transcriptions and translations present in the templates but not shown without clicking to collapse – no? @Panda10: On this page, section 3 is actually about standardizing the information given by reference templates but the community does nothing, you could weigh in too. Fay Freak (talk) 20:21, 10 December 2018 (UTC)[reply]

Interesting BBC articles

edit

An interesting BBC article on an analysis of Twitter that traces the geographic rise and spread of neologisms in American English: Feeling litt? The five hotspots driving English forward (4 May 2018). -Stelio (talk) 08:26, 12 December 2018 (UTC)[reply]

Mentioned terms: amirite, baeless, baeritto, balayage, boolin', bruuh, candids, celfie, famo, faved, figgity, gainz, idgt, litt, litty, lituation, lordt, on fleek, rekt, scute, senpai, shordy, slayin, traphouse, waifu, wce, yaaaas (mostly missing from Wiktionary at the time of posting).

Another one on anti-languages: The secret “anti-languages” you’re not supposed to know (12 Feb 2016).

Gobbledygook: erectify, flackoblots, luxurimole
Thieves' Cant: bounge, bowsing ken, lower, Rome [4][5]
Elizabethan criminals: bawdy basket, counterfeit crank, dell, doxy, jarkman, prigger of prancers
Boobslang: cue ball, double yoker, goodnight kiss, under the thumb [6][7]
1980s American conmen: apple, egg, fink, Mr Bates [8]
Polari: bona, cottage, omi, upright, vada [9]
Gang culture: berry, chocolate cake, elroys, German chocolate cake, hubba, Penelope, slab
Prostitution: mongering, practicing the hobby, trolling [10]

- Stelio (talk) 13:28, 12 December 2018 (UTC)[reply]

Thanks for sharing. The inaugural lecture video has some more details, and the data and scripts are available as well. It would be interesting to apply this to other languages. – Jberkel 22:11, 14 December 2018 (UTC)[reply]

English words with contraction-'s, etc

edit

A recent RFD got me thinking: by vote, we don't allow entries for words with possessive-'s, with only a few exceptions. Do we have any policy on which contractions are allowed? I've created some more interesting ones myself (double and triple contractions), but...it seems like contraction-'s can be added to as many words as possessive-'s. Just googling the first few words from various parts of speech that pop into my head, I can find citations of all of them: not just nouns like cat's but difficult's (see google books:"difficult's an", etc), write's (google books:"If any line I write's a nobbler", etc), wow's (google books:"wow's an"), see also google books:dogs're. I presume we don't want entries for all of these! (The small set of ones attached to pronouns (he's, y'all'd've, etc) are worth keeping, IMO.) - -sche (discuss) 18:00, 12 December 2018 (UTC)[reply]

Yes, not only -'s but also -'re and -'ve can probably be attested for many words: "The party's over, the drinks've run out, the people're going home". I think we can do without creating entries for all of these! Could we limit it just to personal pronouns? Mihia (talk) 17:29, 22 April 2019 (UTC)[reply]

Oh yeah!

edit

Guess who has cracked 600,000 edits. [11] That works out at about 150 a day on average, though in practice I have some days when I don't come to Wiktionary at all and some days when I hammer away at it like a lunatic for eight hours. Equinox ◑ 00:06, 14 December 2018 (UTC)[reply]

I just love how the xtools edit counter has a big red banner at the top saying, "User has made too many edits!" Tsk, tsk. Andrew Sheedy (talk) 02:07, 14 December 2018 (UTC)[reply]

Impressive. And I haven't got to the half million mark yet. SemperBlotto (talk) 07:08, 14 December 2018 (UTC)[reply]

Impressive. That's over 1% of all the edits made on this site. - -sche (discuss) 22:35, 14 December 2018 (UTC)[reply]

Not so impressive. You wouldn't even get into the top 15 in Wikipedia. Also, perhaps you should get help for your wiki-addiction. What is impressive, however, is Wonderfool's hitting 300,000 despite around 130 blockings. --Mustliza (talk) 10:57, 15 December 2018 (UTC)[reply]

How about that top 15 on Wikipedia, though? bd2412 T 01:43, 16 December 2018 (UTC)[reply]

How on earth am I in sixth place? I feel like a lot of editors are way more active than I am. —Rua (mew) 23:24, 15 December 2018 (UTC)[reply]

You deserve some kind of medal, made of some kind of metal, for showing some kind of mettle. bd2412 T 01:44, 16 December 2018 (UTC)[reply]

Any use for a "rare character" index?

edit

Hello! There was recently a discussion at Extension:CirrusSearch about creating a new search index for "rare" characters that are currently not indexed by the on-wiki search engine. The three examples of difficult-to-find characters given were ☥ (Ankh), 〃 (ditto mark), and 〆 (ideographic closing mark). (Note that you can currently do an insource regex search like insource:/☥/, but on large wikis this is guaranteed to time out and not give complete results, and it is extremely inefficient on the search cluster.)

We can't index everything—indexing all every instance of e or . would be very expensive and less useful than ☥, for example. So, in English, we would ignore A-Z, a-z, 0-9, space, and most regular punctuation (exact list TBD) and index pretty much everything else.

The most plausibly efficient way to implement such an index would only track individual characters at the document level, so you could search for documents containing both ☥ and 〆, but you could not specify a phrase like "☥ 〆" or "〆 ☥", or a single "word" like ☥☥ or 〆☥.

I've opened a Phabricator ticket T211824 to more carefully investigate such a rare character index, to get a sense of how big it would be and what resources it would take to support it. If you have any ideas about specific use cases and how this would or would not help with them, or any other thoughts, please reply here or on the Phab ticket. (Increased interest increases the likelihood of this moving forward, albeit slowly, over the next year.)

Thank you! TJones (WMF) (talk) 16:27, 14 December 2018 (UTC)[reply]

One thing that comes to mind immediately is searching for control characters, private use block characters and unusual whitespace characters. It would be even more useful if such characters could be grouped together in a single search. DTLHS (talk) 16:35, 14 December 2018 (UTC)[reply]

We haven't thought too much yet about how the keyword for this would work. Parsing the query carefully so you can search for whitespace characters is always tricky. So, suppose the keyword is char:, then searching for documents with both ☥ and 〆 could be char:☥ char:〆, while searching for either would be char:☥ OR char:〆. We could have a special syntax like char:☥〆, which is more efficient, but would that be an implicit AND or an implicit OR? Either could be confusing; for example, searching for char:Иван would only incidentally actually find the name Иван.

For control or whitespace characters, being able to specify them by number would probably be useful, so \u2002 or U+2002 for an 'en space'. For the all three use cases, it sounds like you'd want OR, not AND as your combining operation, so you'd have to spell them all out, like char:\u2002 OR char:\u2003 OR char:\u2004 OR char:\u2005 ... for whitespace characters. I can see how something like char:\u2002-\u200D would be useful, but on the back end that would balloon into a fairly expensive search, and something like char:\uE000-\uF8FF for the whole Private Use Area or char:\uF0000-\uFFFFF for whole Supplementary Private Use Area-A would explode into ~6,400 or ~65,000 search terms on the back end, which we could not support. I could see maybe allowing specifying a range, but it would have to throw an error for more than some limit of characters in the range. (10? 20? 50?)

Were you hoping to search for an entire private use area at once, or just a limited range of characters? Thanks for the interesting use cases! TJones (WMF) (talk) 18:40, 14 December 2018 (UTC)[reply]

Yes, the whole private use area. Maybe that's not such a good fit for this request since I'm more interested in the boolean value "does this page have a private use area character in it or not", and not specifically which character it is. DTLHS (talk) 18:50, 14 December 2018 (UTC)[reply]

It might be possible to also index by Unicode block, so if I dig into this, I'll try to get a sense of what that looks like, too. Though I wouldn't expect it to be in the first version if we get that far. TJones (WMF) (talk) 19:30, 14 December 2018 (UTC)[reply]

For our purposes, Wiktionary entry names and links to entry names are far more important in searching for special characters: it helps to know when they include zero-width non-joiners, left-to-right markers, punctuation/whitespace outside of the Basic Latin Block, combining diacritics, or anything else that might produce a visual duplicate with different encoding. A different issue is mixing of scripts: Latin-script English paca and Cyrillic-script Russian раса (rasa) are fine, but we want to know when there's something like ~~pаcа~~ that has both Latin and Cyrillic, for instance. You might think of it as a multilingual version of antispoofing. Chuck Entz (talk) 20:08, 14 December 2018 (UTC)[reply]

Happily left-to-right marks and some other bidirectional control characters (see these testcases) are automatically removed from titles (as mentioned in Manual:Title.php § Canonical forms), so they would only need to be indexed when they appear in article text. For instance, ab%E2%80%8Ec (with a percent-encoded left-to-right mark) links to abc. — Eru·tuon 22:40, 17 December 2018 (UTC)[reply]

We already have that capability with the standardChars field in Module:languages. Searching inside entries for specific characters is more challenging. DTLHS (talk) 20:11, 14 December 2018 (UTC)[reply]

Latin/Cyrillic homoglyph detection and correction is a sometime hobby of mine on my volunteer account—so I know what a pain that can be. Did you know that intitle: now supports regex searches? This search finds titles (or redirects) that have a Cyrillic and Latin character adjacent to each other: intitle:/([Ѐ-ԯ][A-Za-zÀ-ɏɐ-ʯ]|[A-Za-zÀ-ɏɐ-ʯ][Ѐ-ԯ])/ (no link, because it's an expensive query, so you have to want it enough to copy-n-paste). There are some false positives with redirects that have been fixed, and with Kabardian and a few other languages that do seem to actually mix scripts, so къуэкIыпIэ is probably right, but ларпурлартизaм looks like the final a is Latin. Anyway, intitle: searches on regexes still time out (it's just too expensive to scan for everything), but they probably get closer to completion than insource: queries, which have more text to scan.

Anyway, it sounds like a second rare-character index for titles would be helpful for finding zero-width joiners, LTR/RTL markers, etc. in titles. Finding them specifically in links would be harder. They do get stripped from search terms, which is what I usually pay attention to. TJones (WMF) (talk) 20:47, 14 December 2018 (UTC)[reply]

Actually, that Kabardian word should have a palochka (here on Wiktionary, the lowercase one, ӏ; elsewhere often the uppercase, Ӏ) instead of a capital Latin letter I (see Kabardian orthography on Wikipedia). But about mixed scripts, on Wikipedia someone posted some words from Halkomelem, which adds the Greek letter theta into an otherwise Latin alphabet. (I was surprised that there wasn't a Latin theta character, because theta is regularly used in the IPA.) — Eru·tuon 00:28, 15 December 2018 (UTC)[reply]

@Chuck Entz: Here is a list of titles with both Latin and Cyrillic characters from the December 1st dump. Looks like there are a few quark words (like b-кварк) for which this isn't an error. [Edit: See also User:Keith the Koala/Mixed character sets, though it is not up-to-date.] — Eru·tuon 01:16, 15 December 2018 (UTC)[reply]

@Erutuon: I would go through that list and fix them, but right now there are too many redirects and valid uses of the palochka. If you could exclude those, the list would have much fewer false positives. —Μετάknowledge^{discuss/deeds} 02:45, 15 December 2018 (UTC)[reply]

@Metaknowledge: The palochka belongs to the Cyrillic script, so anything in the list with a palochka lookalike (like the aforementioned къуэкIыпIэ) needs fixing. — Eru·tuon 04:27, 15 December 2018 (UTC)[reply]

I've removed all the redirects. — Eru·tuon 04:40, 15 December 2018 (UTC)[reply]

@Erutuon: Thanks. I honestly can't remember the outcome of the old discussions about what to do with different ways to encode the palochka in Caucasian languages. @Atitarev? —Μετάknowledge^{discuss/deeds} 05:43, 15 December 2018 (UTC)[reply]

@Metaknowledge, Erutuon: I don't remember the exact outcome either BUT when Roman letters, numbers or "|" substitute for palochka (upper or lower case), they are definitely wrong but could be used as redirects, since the use of palochka proper is still uncommon. The correct/normalised spelling for Kabardian къуэкӏыпӏэ (qʷɛkʼəpʼɛ) is къуэкӏыпӏэ (qʷɛkʼəpʼɛ), using the lower case palochka ӏ but some people think we should use the upper case palochka Ӏ: къуэкӏыпӏэ (qʷɛkʼəpʼɛ). It's the form used when palochka was first introduced and there was no upper case/lower case distinction. Both forms look alike and the lower case palochka was added much later by the Unicode. In my opinion, we should use upper case Ӏ and lower case palochka ӏ following the capitalisation rules of the corresponding languages as intended. Lookalikes: !, 1, |, I, l should be all replaced with Ӏ/ӏ. --Anatoli T. ^{(обсудить}/^вклад) 06:20, 15 December 2018 (UTC)[reply]

@Atitarev, Erutuon: I have now fixed everything on the list except for legitimate/unclear uses and palochkas. Anatoli, would you be willing to move the palochka entries as you see fit, leaving redirects behind? —Μετάknowledge^{discuss/deeds} 06:24, 15 December 2018 (UTC)[reply]

@Erutuon, Metaknowledge: I think it's mostly done. I'm not sure what to do about the Akhvakh term жиᵸво (žı̇̃vo, “cow”), which has a letter letter ᴴ. It's a very poorly documented language. Letter "ᴴ", according to the Russian Wikipedia is only used to display the pronunciation and in real life vowels are written without it. The information is based on the newspaper «Ахвахцы — Ашвадо». I don't know if there is a Cyrillic equivalent for this superscript letter. --Anatoli T. ^{(обсудить}/^вклад) 00:28, 18 December 2018 (UTC)[reply]

@Atitarev It looks like there is a "MODIFIER LETTER CYRILLIC EN" (U+1D78), but it also looks like font support for it is kind of weak (but not much worse than the Latin version). That would give "жиᵸво", for example. It has the benefit that it normalizes to regular Cyrillic н for search, too! TJones (WMF) (talk) 19:36, 18 December 2018 (UTC)[reply]

@TJones (WMF): Sorry for taking time to respond. I have moved the entry to живо (živo), since Cyrillic ᵸ or Roman ᴴ are considered non-mandatory diacritics. I used "жиᵸво" in the display (жиᵸво (žı̇̃vo)) and converted both жиᴴво and жиᵸво to hard redirects to живо (živo). --Anatoli T. ^{(обсудить}/^вклад) 01:37, 29 December 2018 (UTC)[reply]

@Atitarev: No worries! With the end of the year and the holidays everything has been a little slow to happen. That looks like a good compromise. Thanks for worrying about the details! TJones (WMF) (talk) 19:25, 2 January 2019 (UTC)[reply]

To the invisible characters Chuck has mentioned (namely ZWNJs and LTR and RTL marks) as being undesirable in pagenames, and thus desirable to find, I would add: soft hyphens. Currently, all of these are caught by periodic checks of database dumps, as mentioned in Wiktionary:Todo#Semi-regular_tasks; being able to find the characters in a way that didn't require downloading database dumps would make it easier for more people to check for them more often. (This gives me an idea about MediaWiki:Titleblacklist which I will raise in a new section!) - -sche (discuss) 06:35, 15 December 2018 (UTC)[reply]

ZWNJs are quite desirable in pagenames for certain languages, e.g. Persian. You'd have to sort by language just to filter out all the good examples of ZWNJs being used. —Μετάknowledge^{discuss/deeds} 06:37, 15 December 2018 (UTC)[reply]

Good point. I suppose one might do a search like char:[ZWNJ] insource:-Persian. - -sche (discuss) 06:47, 15 December 2018 (UTC)[reply]

We could probably make a regex that matches ZWNJ in a position where it actually has a visible effect, for instance between a left- or dual-joining Arabic character, zero or more characters transparent to joining, and a right- or dual-joining character. (I imagine it would be long.) But it would have to be applied in the following manner: if the title contains ZWNJ, forbid it unless it matches this regex. Not sure if that's possible. I did notice there is MediaWiki:Titlewhitelist though. Maybe ZWNJ can be unequivocally blacklisted in MediaWiki:Titleblacklist, but then whitelisted under limited circumstances in MediaWiki:Titlewhitelist. — Eru·tuon 07:16, 15 December 2018 (UTC)[reply]

I mean, we don't have to blacklist ZWNJs if it would be problematic/complicated, we could just keep making periodic database-dump checks for them (excluding Persian), and only blacklist things that are indeed always unwanted. - -sche (discuss) 16:38, 15 December 2018 (UTC)[reply]

The problem with any regex approach on English Wiktionary or other large wikis is that unless there is a clear trigram that the regex acceleration can latch onto, the search will have to do a text scan, which will time out, and you are going to get incomplete results, which is a bummer. A rare character title index would actually be great for regexes built around specific characters. char:[ZWNJ] insource:/<complex regex with ZWNJ>/ would actually be likely to finish because the pool of docs with a ZWNJ somewhere in them would be relatively small. TJones (WMF) (talk) 21:06, 17 December 2018 (UTC)[reply]

I can’t a priori exclude that bidirectional control characters might appear legitimately in pagenames, and I can only warn since they do have a purpose and invectives against them frequently lead to gold-plating. They are just unlikely needed as multiple scripts are also unlikely needed in page names. I could imagine some mixed chat slang using Latin and Arabic or Hebrew script needing bidi characters, how far away the creation of pertinent pages might now be. Though definitely any bidirectional control sign should throw warnings. The direction-overriding U+202D and U+202E can be blacklisted though. Fay Freak (talk) 14:21, 15 December 2018 (UTC)[reply]

A bit off-topic, but I wonder if we might want to use such control characters more often in the {{DISPLAYTITLE}} magic word. I note that on my computer, the headword line at بند «پ» looks as it should, but the pagetitle does not. —Μετάknowledge^{discuss/deeds} 05:43, 17 December 2018 (UTC)[reply]

@Metaknowledge: My scriptTitles script fixes the top header of بند «پ» by adding a script class. Control characters would work, but I don't think they are the preferred method in HTML. — Eru·tuon 06:20, 17 December 2018 (UTC)[reply]

Soft hyphens are with the others on my list of commonly-encountered invisible characters. I monitor them when I make language analysis changes, and I've lobbied Elasticsearch and Lucene to strip them in their default language analysis chains. Anyway, a rare character index on titles would make finding them easier, so I'll add it to my list. Thanks! TJones (WMF) (talk) 21:06, 17 December 2018 (UTC)[reply]

Use MediaWiki:Titleblacklist to block titles with undesirable invisible characters

edit

It occurs to me that we could use this to prevent pagenames from containing various undesirable invisible characters which persistently creep up, like soft-hyphens (a recurring problem when people copy-paste words from certain other sites), couldn't we? My understanding is that pages containing those characters would thereafter be impossible to create, but presumably our existing entries on the characters themselves would be unaffected(?)—or if not, we could move them to Unsupported_titles/. - -sche (discuss) 06:37, 15 December 2018 (UTC)[reply]

An abuse filter would also work. Which one is more user friendly? DTLHS (talk) 06:42, 15 December 2018 (UTC)[reply]

Good point, and an abuse filter could also warn against and block or tag these in article bodies. OTOH, we can only have abuse filters do so many things before they run out of resources. As for user-friendly: it seems to be possible to display a customized message to anyone adding a blacklisted title (like w:MediaWiki:Titleblacklist-custom-imagename), which might be more friendly(?) than the messages abuse filters theoretically display, since those usually don't display for me (I see only the "short descriptions" like "ref-no-references") and apparently other users, based on confused feedback we've gotten from users wondering why their edits were blocked. - -sche (discuss) 07:08, 15 December 2018 (UTC)[reply]

Christmas competition

edit

Hey all. I made a new Christmas competition. You have until Nanakusa-no-sekku to submit an entry. --Mustliza (talk) 10:52, 15 December 2018 (UTC)[reply]

10-year-olds

edit

Another important announcement...another entry has hit 10 years. The one in question in Dakasian, which has been sitting in WT for 10 whole years without being corrected. It was made by some prat called Jackofclubs (talk • contribs). _{I wonder what came of him...} --Mustliza (talk) 11:07, 15 December 2018 (UTC)[reply]

I just touched the 10-year-old. Equinox ◑ 13:25, 15 December 2018 (UTC)[reply]

Why don't you have a seat right over there?Dixtosa (talk) 17:01, 15 December 2018 (UTC)[reply]

@Mustliza: Wonderfool, how do you know it is of English origin? Looks like an Anglicization of of Armenian Դաքեսյան (Dakʻesyan). --Vahag (talk) 13:46, 15 December 2018 (UTC)[reply]

Are the Dakasians something I should be keeping up with? Equinox ◑ 15:08, 15 December 2018 (UTC)[reply]

IIRC, VP thinks that all words are of Armenian origin. He may be right, though - this website shows Dakasians with first names Hayz, Vahan, Hagop and Vesta. --Mustliza (talk) 20:23, 15 December 2018 (UTC)[reply]

I wish I could see the mugs of these people. I can identify an Armenian face with a 99% accuracy. --Vahag (talk) 12:15, 16 December 2018 (UTC)[reply]

On Wikipedia, we had a project a while back to identify the oldest and longest untouched pages, touched by the fewest editors. We had a bot assign points based on the age of the page, age of the last edit, and number of people who had edited it. We did come up with a lot of problematic pages that way. bd2412 T 01:47, 16 December 2018 (UTC)[reply]

Who should we talk to about running that bot here? - -sche (discuss) 23:07, 16 December 2018 (UTC)[reply]

You don't need a bot, they're listed on Special:AncientPages. DTLHS (talk) 00:13, 17 December 2018 (UTC)[reply]

Another special page with false and annoying message: "Updates for this page are currently disabled. Data here will not presently be refreshed." DCDuring (talk) 03:33, 17 December 2018 (UTC)[reply]

So change the message! I think this is sth admins can do. --Mustliza (talk) 06:54, 17 December 2018 (UTC)[reply]

Maybe not in this case; there isn't a single editable page with "Updates for this page are currently disabled" in it, besides this one. It's probably built in. — Eru·tuon 07:01, 17 December 2018 (UTC)[reply]

So build it out! I think this is sth admins can do. --Mustliza (talk) 07:08, 17 December 2018 (UTC)[reply]

Tell me how to do it and I will do it. I also don't now who to ask or what to ask for. DCDuring (talk) 08:17, 17 December 2018 (UTC)[reply]

MediaWiki:Querypage-no-updates is the "no longer updated" message, and MediaWiki:Perfcachedts is the message about caching. Neither of those messages are specific to the page itself, so modifying them will possibly alter other pages unintentionally. MediaWiki:Ancientpages-summary can add text to the top of the page, but can't remove the existing text (it seems). We could suppress the message with javascript or CSS (class is mw-querypage-no-updates). - TheDaveRoss 19:51, 27 December 2018 (UTC)[reply]

@TheDaveRoss: Huh, so I was wrong when I thought that page didn't exist. How about just using a parser function, which I've just done? — Eru·tuon 20:38, 27 December 2018 (UTC)[reply]

Just to be clear, are we editing this text because the page is in fact regularly updated? - -sche (discuss) 20:48, 27 December 2018 (UTC)[reply]

It's been updated since I last visited it. The most recent update was on December 21, and I visited it before that time while the original discussion was still going on. Since that's about six days ago now, apparently it's only infrequently updated. — Eru·tuon 21:54, 27 December 2018 (UTC)[reply]

There are a few pages that are in fact updated twice a month or so that have the warning. There are many more that are updated on a similar schedule that don't have the notice. If it requires a rectal tonsilectomy to change the notice, I'm sorry that I asked. DCDuring (talk) 22:08, 27 December 2018 (UTC)[reply]

Also, some wiki-magic which may be of use to people in these situations, if you append ?uselang=qqx to a URL it will show the names of all messages in parenthesis, which makes them much easier to find. - TheDaveRoss 19:59, 27 December 2018 (UTC)[reply]

Selection of the Tremendous Wiktionary User Group representative to the Wikimedia Summit 2019

edit

Dear all,

Sorry for posting this message in English and last minute notification. The Tremendous Wiktionary User Group could send one representative to the Wikimedia Summit 2019 (formerly "Wikimedia Conference"). The Wikimedia Summit is an yearly conference of all organizations affiliated to the Wikimedia Movement (including our Tremendous Wiktionary User Group). It is a great place to talk about Wiktionary needs to the chapters and other user groups that compose the Wikimedia movement.

For context, there is a short report on what happened last year. The deadline is very close to 24 hrs. The last date for registration is 17 December 2018. As a last minute effort, there is a page on meta to decide who will be the representative of the user group to the Wikimedia Summit created.

Please feel free to ask any question on the wiktionary-l mailing list or on the talk page.

For the Tremendous Wiktionary User Group, -- Balajijagadesh 05:56, 16 December 2018 (UTC)[reply]

Who wants to go to Berlin? Does anyone know whether there is any money for travel? Otherwise, it will probably be dewikt that sends someone. DCDuring (talk) 22:15, 16 December 2018 (UTC)[reply]

@Psychoslave I missed the 2018 report when it was published. Lots of interesting points in there, but I doubt many people have read/noticed it, the talk page mentions a translation which never happened. – Jberkel 06:34, 17 December 2018 (UTC)[reply]

I'd love to go to Berlin! I'll be your representative. So send me there. I think this is sth admins can do. --Mustliza (talk) 06:52, 17 December 2018 (UTC)[reply]

For some reason I have been notified about this old message. Is there still a demand for translation here? Psychoslave (talk) 23:03, 1 July 2023 (UTC)[reply]

Hello everybody. I'm pleased to see there was a vote for sending someone at the summit, I hope this will pass the whole selection process. Sorry for the report no being available in other language than French, I didn't found time to translate it so far, and don't see available free time in my near schedule ever. Feel free to help translating it if you can. Also apologies for the lake of efficient communication regarding its publication. Please feel free to ask me anything if you have specific points you would like to have information about and to ping me for anything related to the TWUG.

This was an extremely unsatisfactory process. The selected representative is someone I've never heard of, but who seems to have been backed by a well-coordinated, well-informed group in what seems more like a coup than an election. DCDuring (talk) 16:23, 20 December 2018 (UTC)[reply]

Zulu vowel length marking

edit

Zulu orthography does not mark vowel length, however Wiktionary (and Wikipedia) have taken up the convention of marking vowel length in Zulu with a macron. I have not seen this convention outside of Wiktionary and Wikipedia, and I think it's a bit problematic because it could be mistaken for a tone diacritic. Also, it looks messy when tone diacritics get stacked on top of the macron. I would like to use the character ː to mark length because it leaves room for tone diacritics and is more clear in its meaning. Another possibility is to double the vowel character to show length, but in my opinion that doesn't look as good. Smashhoof2 (talk) 02:25, 17 December 2018 (UTC)[reply]

@Rua, Metaknowledge Chuck Entz (talk) 03:59, 17 December 2018 (UTC)[reply]

Well, I've not studied Zulu, but my understanding is that vowel length is not a very important phenomenon in it. You've got it in most (but not all) penultimate syllables like most Bantu languages, and that's allophonic. Then you've got it in some ideophones, but I think those are generally written with double or even triple vowels (correct me if i'm wrong). And then there are the contractions, which I suppose are actually phonemic but don't come up too frequently. I would say that we don't really need to be marking this on the headword at all — just stick the length mark in the IPA, and leave it at that. —Μετάknowledge^{discuss/deeds} 05:36, 17 December 2018 (UTC)[reply]

@Rua, Metaknowledge Vowel length is very important in the realization of tone. Some verbal prefixes have long vowels, and the short form of the perfect is a long vowel. And Rua mentions that the noun class prefixes contain long vowels. Along with that, there is the allophonic penultimate vowel lengthening. However, this allophonic lengthening has large effects on the tonal realization. For example, inja is written in the headword as înjá. However, this is misleading as it has two tonal realizations. With penultimate lengthening you get îːnjá, and without it, you get ínja. (The underlying tone is /ínjá/.) I plan on adding tone to the Zulu verb inflection tables, but to do so I will need to have a short form and a long form for every verb form, as the tone is different between the two forms. I just think we should adopt a different marking of vowel length because I don't like stacking diacritics, and I haven't seen the macron used elsewhere. Authors on Zulu tone (James Khumalo for example) tend to use doubled vowels in autosegmental derivations, but use ː in surface forms. Smashhoof2 (talk) 19:30, 17 December 2018 (UTC)[reply]

It's not a tone diacritic because only ´ and ^ are indicators of tone. ^ already implies length, so ¯ is only needed where there is a ´ tone or no tone at all. I don't agree with removing length indications though, because they can be distinctive (class 5 ī is distinguished from class 9 i, Xhosa distinguishes these orthographically). In the case of a long final syllable, the stress also shifts to the final syllable. —Rua (mew) 11:30, 17 December 2018 (UTC)[reply]

clickbait

edit

I've never see as much clickbait on Wiktionary as on the page toe-tapper - does it really help the reader to have a link to see pictures of a particular bathroom? I'd like to delete the link, but figure that this is sth admins can do. --Mustliza (talk) 07:32, 17 December 2018 (UTC)[reply]

Removed (it did seem a bit over-the-top) SemperBlotto (talk) 07:45, 17 December 2018 (UTC)[reply]

Word of the year

edit

Lots of the many online dictionaries on the web have got a supercute "Word of the Year" feature. There's nothing on Wiktionary:What Wiktionary is not that says we're not supercute either. Hence, as a result it is undeniably clear that we also are supercute too. QED. Maybe we could choose something from the bleeding edge of lexicography (this year's Word of the day), tweet about it and when hobnobbing with other lexicographers they will kiss our rings. I suppose it would have to be decided before the dernier day of the year - that would be tickety-boo. This is sth admins can do as well as regular wordsters. --Mustliza (talk) 08:04, 17 December 2018 (UTC)[reply]

Just as long as it's not bloody Brexit. SemperBlotto (talk) 08:05, 17 December 2018 (UTC)[reply]
That will probably be next year's. — SGconlaw (talk) 08:30, 17 December 2018 (UTC)[reply]
- Not a bad suggestion, but the question is who is going to draw up the shortlist and arrange for a vote. (Not me, please.) — SGconlaw (talk) 08:29, 17 December 2018 (UTC)[reply]
  - Shortlists are often driven by analysing search / request traffic, something we don't really do. I think we should also focus on non-English words, or maybe English borrowings from other languages, just to differentiate us a bit (we're multilingual, after all).– Jberkel 10:43, 17 December 2018 (UTC)[reply]
    @Jberkel, we do actually keep track of traffic: link. - TheDaveRoss 13:12, 17 December 2018 (UTC)[reply]
    @TheDaveRoss: We keep track, but don't really analyze in depth. And the top viewed pages don't look very interesting / feature-worthy. Do you know if there is a way to get hold of the search query data? – Jberkel 14:02, 17 December 2018 (UTC)[reply]
    @Jberkel: not that I know of, but I haven't looked into the stats API thoroughly. It doesn't seem to be on the hadoop cluster, and it certainly isn't available in the wiki database, so I am not sure where it would live if the data exist. - TheDaveRoss 14:37, 17 December 2018 (UTC)[reply]
    Yeah, looks like the top search term in 2018 was "BF". No idea why that would be the case. — SGconlaw (talk) 15:00, 17 December 2018 (UTC)[reply]
    I don't trust this data – if you enable "Show mobile percentages" you'll get some top entries in the 0.x% / 99.x% ranges, which can't be right. – Jberkel 16:00, 17 December 2018 (UTC)[reply]
    It has a lot to do with the fact that outside of the Western world the predominant means of accessing the internet is via mobile (e.g. India is 90% mobile views), and the vast majority of views on certain pages are from specific regions. - TheDaveRoss 17:58, 17 December 2018 (UTC)[reply]
    Granted, but 99.6% mobile access (for BF) seems improbably high. – Jberkel 20:48, 17 December 2018 (UTC)[reply]
    - Well, since we have both WOTD and FWOTD, we could have both a Word of the Year (English) and Foreign Word of the Year. Maybe one way to avoid having to compile shortlists would just be to ask editors to propose and vote on entries by a certain date. — SGconlaw (talk) 11:00, 17 December 2018 (UTC)[reply]

My proposal is to pick an entry entirely at random. DTLHS (talk) 16:06, 17 December 2018 (UTC)[reply]

When I just did a Random entry, I got broutâtes. Perhaps we should keep it at lemmas. --Lambiam 20:11, 17 December 2018 (UTC)[reply]

I think the idea was to pick a former (F)WOTD randomly. Or we could try to find the most zeitgeisty ones. My picks: Anthropocene / Dutch wereldbrand. Grim. – Jberkel 20:48, 17 December 2018 (UTC)[reply]

I see no point in designating an arbitrary random word as "word of the day/month/year" because we already have a random-entry search feature anyone can use. The point of these features is usually to illustrate the zeitgeist, i.e. words in the news. (This tends to be abused by using words that have been mentioned without much real usage, but I suppose some of them do survive.) Equinox ◑ 20:13, 17 December 2018 (UTC)[reply]

I rather like Jberkel's suggestions. By the way, not much time till the end of the year to do this, unless the Word of the Year is intended to be announced in the following year. Also, someone has to design a template that will fit on the Main Page. — SGconlaw (talk) 07:08, 20 December 2018 (UTC)[reply]

Should set-type categories also contain their namesake?

edit

I noticed a lot of cases where a set category, such as Category:en:Hares or Category:en:Dogs, contain entries for synonyms of their namesake (hare and dog in this case). I tend to place these in their parent category instead, Category:en:Lagomorphs and Category:en:Canids, because a hare and a dog are a specific kind of lagomorph and canid respectively, and not a specific kind of hare or dog. Other people seem to have thought differently in the past, and there isn't a specific consensus about it. To avoid people moving entries back and forth to match the way they think, I would like to form a consensus and perhaps even a formal rule. Personally, I think a set category should only used for kinds of their namesake, and not their namesake itself. —Rua (mew) 13:32, 20 December 2018 (UTC)[reply]

I see where you are coming from, but think it would be quite hard to enforce such a rule as it is not obvious. Perhaps just put the word in both categories. — SGconlaw (talk) 17:04, 20 December 2018 (UTC)[reply]

Seems logical to me. It'd be a good idea to have a set of guidelines at least somewhere, maybe on a Wiktionary namespace page (perhaps a policy page, but I don't think that much is even necessary) explaining how the set/topic cat system works and what it's intended to do. Related to this is the point that the description of the categories don't unambiguously point to what you're suggesting; e.g. the description of Category:en:Dogs ("terms for dogs") doesn't necessarily suggest that it is only for hyponyms of "dog" and that its namesake should be placed elsewhere. — Mnemosientje (t · c) 14:17, 21 December 2018 (UTC)[reply]

Nobody knows or agrees on what it's intended to do. DTLHS (talk) 18:26, 21 December 2018 (UTC)[reply]

All the more reason to make sense of it in writing somewhere for future reference? — Mnemosientje (t · c) 12:08, 22 December 2018 (UTC)[reply]

Vietnamese xxx class nouns => Vietnamese nouns classified by xxx

edit

I am going to rename all categories "Category:Vietnamese xxx class nouns" into "Category:Vietnamese nouns classified by xxx" as the same way of Chinese, Thai, Lao, Lü, etc. Also, "Category:Vietnamese nouns by class" into "Category:Vietnamese nouns by classifier" either. I must ask here if you agree. --Octahedron80 (talk)

Seems fine to me. I’m all in favour of consistency. — SGconlaw (talk) 14:59, 21 December 2018 (UTC)[reply]

All renamed. However, I did not check if any word has correct classifier or any classifier exists. --Octahedron80 (talk) 03:31, 23 December 2018 (UTC)[reply]

Pinyin, Zhuyin Fuhao and Erhua

edit

玩意兒

现代汉语词典7 p1348 "wányìr"; 现代汉语规范词典3 p1350 "wányìr"; http://dict.concised.moe.edu.tw/cgi-bin/jbdic/gsweb.cgi?o=djbdic&searchid=Z00000043688 "ㄨㄢˊ　ㄧˋㄦ　（ㄧㄜˋㄦ）"; http://dict.revised.moe.edu.tw/cgi-bin/cbdic/gsweb.cgi?o=dcbdic&searchid=Z00000161614 “ㄨㄢˊ　ㄧˋㄦ　（變）ㄨㄢˊ　ㄧㄜˋㄦ”

现代汉语词典7 凡例 p4 talks about the changes to pronunciation of the preceding syllable caused by erhua

I think that the time has come to include ㄨㄢˊ　ㄧㄜˋㄦ somewhere on the 玩意兒 page- but in what way? --Geographyinitiative (talk) 16:06, 21 December 2018 (UTC)[reply]

WT:VOTE after four years?

edit

As in most democracies, could votes be repeated after four years if the community decides so? --Backinstadiums (talk) 19:20, 21 December 2018 (UTC)[reply]

No. In democracy the demos decides not that the demos has to vote (but a different organ) and here the community would decide that the community has to vote, and Wiktionary is not even a democracy, like shareholder meetings or student committees are unrelated to democracy. Fay Freak (talk) 19:48, 21 December 2018 (UTC)[reply]

You can make any vote you want. Nothing is going to automatically repeat itself. DTLHS (talk) 19:52, 21 December 2018 (UTC)[reply]

Yes, you can make a new vote (or RFD) if you think the situation has significantly changed since the last one. Making a new vote/RFD/etc shortly after an old one, when it isn't likely that anything has changed, would usually be disruptive, but revisiting an issue four years on is usually OK (after several years, the community itself will have changed a bit as users come and go). If this is about reversing the current lemmatization of Chinese entries, though, it would be best to start a discussion first and try to get major editors of Chinese onboard... - -sche (discuss) 22:28, 21 December 2018 (UTC)[reply]

Yes, you can repeat a vote. There is no specified number of years that need to elapse, but it would be a waste of everyone's time to repeat a vote very often. --Dan Polansky (talk) 08:32, 27 December 2018 (UTC)[reply]

Initials

edit

(If I may make another w:WP:BEANS post,) I noticed that M. lists the two Latin names it's an initial of. This is probably sensible/tolerable for (inscriptional) Latin, where there are a limited number of regular abbreviations of names. In English and other languages, of course, any first, middle or last name that starts with M can be reduced to the initial M. (with or without a dot), and likewise for every other letter. Presumably we do not want thousands of {{abbreviation of}} senses to be added to M. and other letters for every abbreviated name. Do we have any guideline or policy that would prevent this, besides common sense, resorting to RFD, and blocking disruptive editors? I noticed this because I was trying to find out if "m." was an abbreviation of "mid" or "middle" in general or only in "m.Yks." for "mid-Yorkshire". - -sche (discuss) 22:41, 21 December 2018 (UTC)[reply]

Maintenance cost. Prognosis about how much it can get out of hand. How limited the set is. Cost benefit-ratio: If it is a waste of life time, may it be other editors’ or readers’ attention or may it seem like an unhealthy obsession (to which I count bad-faith trolling, like WF with the double surnames, like in Walden Two criminality was considered the result of an illness — ashaming that for all the speechlessness it took some days before Equinox set an example, even though the argument is simple, that one does not seek these entries, they are arbitrary, they violate limits of attention and abuse resources …) the line should be cut. For the readers side: One could have in mind more often how many people would look it up and find it useful. A list with all possible values for M. at M. has no value, one can use the forename category or whatever for the same purposes (Category:English given names, Latin has Category:Latin praenomina, Category:Latin nomina gentilia, Category:Latin cognomina). Also note the scope of the dictionary: There are uncommon abbreviations that can be cited enough but are to be found in the list of abbreviations of works. Like legal commentaries abbreviate all kinds of common words. If one does not recognize a word while reading such a work one looks into the list of abbreviations, then maybe one can look up the resolved word in a dictionary, this one or some other kind of reference works. If someone starts to add these abbreviations it needs to be stopped for his own sanity – there are enough places to do useless things, and too many things that should be done instead. In short, think about what consumers get and what editors get from having the entered content, this is a very universal principle. Fay Freak (talk) 19:33, 26 December 2018 (UTC)[reply]

I tend to agree. What are you referring to though, by "for all the speechlessness it took some days before Equinox set an example"? Per utramque cavernam 19:42, 26 December 2018 (UTC)[reply]

The fact that there was a long thread (it cringed me off) before Equinox just deleted them with the reasoning that Wonderfool is trolling (which is an umbrella term really that does not name the actual dangers), that it needed a Machtwort, like no admin was able to justify deletion expressis verbis, so he did it because he could do it without anyone able to cast doubts upon him. Of course there were good reasons, but admins shunned for perhaps knowing what is right but not being able to formulate what can be discerned without this being as such a formally voted rule. That’s how it appeared to me. Fay Freak (talk) 20:04, 26 December 2018 (UTC)[reply]

"Friendliness" versus Reality

edit

I have just made a change to Wiktionary:Example sentences (see also: Wiktionary talk:Example sentences). If you have to revert it, just know that you are 'the man'; it also doesn't affect my willingness to edit Wiktionary. Let me know what you think, but I think I just struck a blow against the empire. --Geographyinitiative (talk) 13:25, 23 December 2018 (UTC)[reply]

I don't agree with this change: I think the old friendliness rule was a good one. Equinox ◑ 18:15, 27 December 2018 (UTC)[reply]

I agree with Equinox, I think the old phrasing allowed for non-friendly examples when necessary, but preferred friendly ones when reasonable. The new rule gives no preference, and my view is that, all things being equal, we should include neutral or positive examples over negative or offensive ones. - TheDaveRoss 18:31, 27 December 2018 (UTC)[reply]

Hmm. Yes, the new text seems to tilt too heavily in the other direction, of writing offensive usexes. If we went back to the old text except we dropped "some", i.e. "Although ~~some~~ offensive or explicit words will require a sentence that demonstrates those qualities", that would seem to resolve the issue of suggesting that some offensive words should have friendly usexes. - -sche (discuss) 18:36, 27 December 2018 (UTC)[reply]

I tried that, also qualifying "You should generally write it so that it is unlikely to offend or embarrass". Better? - -sche (discuss) 18:40, 27 December 2018 (UTC)[reply]

Presentatives part of speech

edit

Which part of speech should be used for presentatives? I would like to add Zulu presentatives, but I'm not sure how to classify them. In the literature, Zulu presentatives are called copulative demonstratives. They are forms such as nansi "here it is (class 9)", nabo "there they are (class 2)", and nankaya "there they are over there (class 6)". They are also used with nouns to form expressions such as "Here is the dog", "There are the children", and "There are the girls over there." Other languages have presentatives as well, but I'm not familiar with any that do, so I'm not sure if/how they are labelled in Wiktionary. --Smashhoof2 (talk) 20:17, 26 December 2018 (UTC)[reply]

In at least some cases (English voilà, Latin ecce, Latvian lūk, Spanish vualá, Turkish işte) they have been classified as interjections. --Lambiam 21:25, 26 December 2018 (UTC)[reply]

Interjection is a junkyard category for standalone expressions in English. I would hope that we would use terms that grammarians of Zulu use, rather than the Latin parts of speech into which we cram English expressions. DCDuring (talk) 22:11, 26 December 2018 (UTC)[reply]

Yeah, interjection isn't a good fit because presentatives are used as part of larger phrases. The closest category is verb, since they can form predicates, but really it's not a good label because they definitely aren't verbs. It seems that Wiktionary doesn't have any fitting category for them. --Smashhoof2 (talk) 03:54, 27 December 2018 (UTC)[reply]

Most of the presentatives I listed above as currently categorized as interjections can also be used as part of larger phrases (ecce cor meum; lūk mana sirds; işte kalbim). The set of allowed POS headers is severely limited, so we have to resort to using some as catch-alls. --Lambiam 10:43, 27 December 2018 (UTC)[reply]

They're essentially verbal pronouns. I'd put them under the pronoun header, and instead of making independent entries, have the entries point to the independent pronouns with a link to an explanatory appendix. —Μετάknowledge^{discuss/deeds} 18:00, 27 December 2018 (UTC)[reply]

How are these named in popular books on Zulu grammar or, failing that, in a plurality of not-so-popular books on the subject? DCDuring (talk) 21:50, 27 December 2018 (UTC)[reply]

Vote: Lemming principle into CFI

edit

FYI, I created Wiktionary:Votes/pl-2018-12/Lemming principle into CFI, based on Wiktionary:Beer parlour/2014/January#Proposal: Use Lemming principle to speed RfDs. --Dan Polansky (talk) 08:26, 27 December 2018 (UTC)[reply]

Thanks for creating this, I hadn't realized it had never been voted on. - TheDaveRoss 13:18, 27 December 2018 (UTC)[reply]

Vote: Phrasebook CFI

edit

FYI, I created Wiktionary:Votes/pl-2018-12/Phrasebook CFI. From what I recall, some editors seemed to support similar criteria in RFD discussions. --Dan Polansky (talk) 09:06, 27 December 2018 (UTC)[reply]

Invitation from Wiki Loves Love 2019

edit

Please help translate to your language

Love is an important subject for humanity and it is expressed in different cultures and regions in different ways across the world through different gestures, ceremonies, festivals and to document expression of this rich and beautiful emotion, we need your help so we can share and spread the depth of cultures that each region has, the best of how people of that region, celebrate love.

Wiki Loves Love (WLL) is an international photography competition of Wikimedia Commons with the subject love testimonials happening in the month of February.

The primary goal of the competition is to document love testimonials through human cultural diversity such as monuments, ceremonies, snapshot of tender gesture, and miscellaneous objects used as symbol of love; to illustrate articles in the worldwide free encyclopedia Wikipedia, and other Wikimedia Foundation (WMF) projects.

The theme of 2019 iteration is Celebrations, Festivals, Ceremonies and rituals of love.

To know more about the contest, check out our Commons Page and FAQs

There are several prizes to grab. Hope to see you spreading love this February with Wiki Loves Love!

Kind regards,

Wiki Loves Love Team

Imagine... the sum of all love!

--MediaWiki message delivery (talk) 10:12, 27 December 2018 (UTC)[reply]

Reverting

edit

Hello, I just saw this edition, which I wanted to undo because it didn't make sense - a word derived from itself in the same language. In cases like this, where I don't speak the language, what's the best thing to do? Ignore? Undo? Ask here? Tag the page? Speak to the editor? --Pious Eterino (talk) 17:02, 27 December 2018 (UTC)[reply]

Don't ask here. Obviously bad edits should be undone. If you don't know what to do, leave it to somebody else. —Μετάknowledge^{discuss/deeds} 17:57, 27 December 2018 (UTC)[reply]

You can bring it to people's attention in the WT:ES (if the change is to the etymology) or WT:TR (if the change is to definitions, etc) if it seems fishy but is not obviously bad. - -sche (discuss) 18:32, 27 December 2018 (UTC)[reply]

Ignoring is always an option, but with millions of entries that no one has time to visit, bad edits can go unnoticed for a very long time. There are a number of better options. In this case, the details of the language are irrelevant: you don't list the word itself as its own source in an etymology, so you would have been entirely justified in undoing the edits with no knowledge of the language whatsoever. It never hurts to explain your reason in the edit summary, because a lot of bad edits are made in good faith (this edit looks like a matter of being solely focused on adding the reference and getting the technical details right without thinking about whether it made sense). Or you can include the same kind of message in a post on the editor's talk page, though there's always the possibility the editor may not respond satisfactorily. You can also use {{attention|wo}}, which puts the entry in a category that people who know the language may check ... eventually. If it does hinge on matters you're not qualified to address yourself but you want to challenge the etymology, you can use {{rfv-etymology|wo}} and start a discussion at the Etymology scriptorium. You can also start a discussion at the Tea room if it's something about the entry other than the etymology. Chuck Entz (talk) 18:36, 27 December 2018 (UTC)[reply]

Great advice, Chuck Entz! --Pious Eterino (talk) 00:40, 28 December 2018 (UTC)[reply]

CFI and English editing guidelines

edit

CFI was changed in diff to no longer point to Wiktionary:About English, pointing to Wiktionary:English editing guidelines (redlink) instead. I cannot see the point of the change; can someone undo the change to CFI? --Dan Polansky (talk) 19:00, 27 December 2018 (UTC)[reply]

Wiktionary talk:English entry guidelines § RFM discussion: November 2015–August 2018 Per utramque cavernam 19:05, 27 December 2018 (UTC)[reply]

The consensus for moving to English entry guidelines was pretty weak; if the page has to stay there, CFI needs to be updated accordingly ("entry" guidelines, not "editing" guidelines) or reverted back to Wiktionary:About English, which is still a redirect (and should be). --Dan Polansky (talk) 19:09, 27 December 2018 (UTC)[reply]

Why would only the English page be renamed anyway, when all of our infrastructure expects the page to be named the same in all languages? Is there a particular reason why English should have a different page name from other languages? —Rua (mew) 20:20, 27 December 2018 (UTC)[reply]

Defect in WT:THUB

edit

@BD2412, SemperBlotto — On another page Dan Polansky pointed out a shortcoming in the definition of qualification in the section Translation hubs of our CFI. It can be illustrated with the Chinese translation 失身 (shīshēn) of English lose one's virginity. The first character, 失, means to lose. So far, no good. The second character, 身, can mean many things, from body to social status to moral character. It cannot mean virginity. There is no way to construct this idiomatic meaning of the whole from the meanings of its parts. In all reason, this should qualify as a translation that supports inclusion of the English term. However, it does not, because the rules, as formulated (which are said to be “tentative”), specifically exclude any translations in a language that does not use spaces to separate words.
Obviously, we do not want, say, Chinese 超文本系統 to qualify as a supporting translation of hypertext system; it is simply a compound that is a word-for-word translation of the English term: 超文本 (chāowénběn, “hypertext”) + 系統／系统 (xìtǒng, “system”). But this is already excluded in the present formulation by the same rule that excludes German Autoschlüssel as a qualifying translation of car key. The remedy is simple: just strike the clause excluding languages that do not separate words by spaces. While we are at it, I propose to combine the first two exclusion rules to a single one:

a closed compound or multi-word phrase that is a word-for-word translation of the English term: German Autoschlüssel does not qualify to support the English term “car key”; or

--Lambiam 08:12, 28 December 2018 (UTC)[reply]

My first impression is that the proposed removal of "a phrase in a language that does not use spaces to separate words" from WT:THUB would be fine, and the proposed merger of the two items would be fine as well. However, I feel I do not have enough energy to consider the impact more thoroughly. --Dan Polansky (talk) 19:31, 28 December 2018 (UTC)[reply]

soft redirection template for Japanese (revisited)

edit

Hiragana	modern	まっとう
Hiragana	historical	まつたう
Kanji	全う真っ当完う
Notes	真っ当 – ateji, adjective only 完う – literary

Hello everyone! I have redesigned {{ja-kanji spellings}} after Modèle:ja-trans on the French Wiktionary. Here are my ideas of a soft redirection system for Japanese.

The new version of the template can display kana spellings (modern and historical) in addition to kanji spellings. As such, the name is no longer accurate. As the template is intended for wide use, the name {{ja-spellings}} is a little long. I would like to rename it to {{ja-forms}}, but that name is already taken. So I hope someone with a bot can rename that template and update pages using it to make room. (I have made the request here.)
The original version of the template supported embedding of kanjitabs in the kanji spellings, but this function is now removed. The idea is that {{ja-kanjitab}} would be used independently of the {{ja-ks}}/{{ja-see}} system. Therefore, if for the word aikotoba, n. the kanji spelling 合い言葉 is chosen as the lemma entry, it would have both a {{ja-ks}} and a {{ja-kanjitab}}, but if the kana spelling あいことば is chosen as the lemma entry, it would have only {{ja-ks}}.
The soft redirection system requires that among the different spellings of a given word, the lemma entry uses {{ja-ks}} to link to the other spellings, and the other spellings use {{ja-see}} to link back. Now that {{ja-ks}} is created, {{ja-see}} remains to be discussed:
1. Do you think {{ja-see}} needs to copy definitions from the lemma entry? One difficulty is that Japanese definitions on the wiki tend to be elaborate. For example, the definition of 酔う is “to get drunk, become intoxicated or inebriated, fall under the influence of alcohol; to become drunk or intoxicated by something; etc.” This is fine for the lemma entry, but a more concise definition such as “to get drunk; to become intoxicated; to feel sick; etc.” would be better on the non-lemma spellings. Chinese editors solve the problem by writing definitions in a way like # translation {{gloss|longer description}}, and {{zh-see}} would only copy the translation. Following this principle, the Japnaese definition of 万葉集 should read “Man'yōshū (Japan's oldest anthology of poems, completed in 759)” so that {{ja-see}} would only copy the “Man'yōshū”. [UPDATE: the ja-see template now uses a different format, one which looks better when definitions are copied in full.]
2. What do you think about categories? I think the categories concerning the word (e.g. Category:Japanese nouns for aikotoba, n.) should be copied among all the spellings, while categories concerning spellings (e.g. Category:Japanese terms read with kun'yomi for 合い言葉) should remain spelling-specific. In addition, I think sort keys can be removed and each page can be sorted under its first character (for example, あいことば under あ, and 合い言葉 under 口, the radical of 合) for the following reasons: First, the current practice of sorting pages under their readings do not work well for pages with more than one readings. For example, 避ける must be sorted under both さ and よ, which is not possible in the MediaWiki software. Second, under my proposal, categories concerning words would hold all the spellings of a given word. Since Category:Japanese nouns would contain both あいことば and 合い言葉, there is no need to categorize both under あ. Categories concerning spellings could be sorted by spellings instead of readings.
3. What do you think about soft redirects within written forms that are independent of reading? For example, 體 is a variant of 体 whether it represents tai, n., affix or karada, n., and １日/1日/壱日 is a variant of 一日 whether it represents ichinichi, n. or tsuitachi, n. Therefore it might be a good idea to have two stages of soft redirects, one within written forms (１日/1日/壱日 → 一日) and one from written forms to words (一日 → ichinichi, tsuitachi). The first stage could be handled by a template like {{ja-seex}} which copies all the categories from the lemma written form, and the second stage could be handled by the regular {{ja-see}} which copies only the relevant categories from the lemma word. (For example, 眞 would copy categories of both shin and ma- from 真, but 真 would only copy ma- “right, true” and not for example ma “time, pause, space” from ま).

(Notifying Eirikr, Wyang, TAKASUGI Shinji, Nibiko, Atitarev, Suzukaze-c, Poketalker, Cnilep, Britannic124, Fumiko Take, Nardog, Marlin Setia1, AstroVulpes, Tsukuyone, Aogaeru4): --Dine2016 (talk) 15:36, 28 December 2018 (UTC)[reply]

@Dine2016: How could it be used to improve the chinese entries as well? --Backinstadiums (talk) 19:04, 28 December 2018 (UTC)[reply]

Re Q3.1: The “Chinese” approach for avoiding long definitions seems fine to me. re Q3.2: Spelling-independent categories should indeed apply to all regular spellings, while spelling-dependent categories should (obviously) only be applied to the spellings to which they are applicable. I have no opinion on the best approach to sorting. As long as we cannot sort on multiple keys, any approach will suck. How is sorting on the actual entry (e.g., 合い言葉 under the character 合 rather than its radical) worse than sorting on the radical? --Lambiam 22:34, 28 December 2018 (UTC)[reply]

Sorry for the late reply.

@Backinstadiums: There is already soft redirection for Chinese entries so we do not need another set of templates. The Japanese case is different from Chinese. In Chinese, there is usually a one-to-one correspondence between Traditional Chinese and Simplified Chinese. In Japanese, however, kana spellings and kanji spellings of a word are usually one-to-many, and one page (e.g. かえる) can correspond to several words. So we should exercise more caution in synchronizing glosses and categories.

@Lambiam: I agree that MediaWiki categories are weak compared with the complexity of the Japanese writing system. Sorting on the radical is an idea taken from Chinese categories. I'm open to the idea of sorting on the whole character rather than its radical, though. --Dine2016 (talk) 03:11, 31 December 2018 (UTC)[reply]

Sorting on the radical may be helpful to people who are sufficiently familiar with Kanji to know not only the radicals, but also which of potentially several candidates is the radical traditionally assigned to a given character. An advantage of sorting on the whole character sequence is its conceptual simplicity. The collation order would be Wikimedia’s system, based on the CJK Unicode values, which I think are already presorted on the radicals. But I never search for a specific term through the categories, so I cannot really speak for users who do. --Lambiam 07:27, 31 December 2018 (UTC)[reply]

Each tranche of CJK characters is sorted independently, so sorting by codepoint only really works for the initial batch of 20,000 or so. ICU includes better sorting data, and it might be possible to use an ICU sort instead. --RichardW57 (talk) 13:13, 1 January 2019 (UTC)[reply]

Since Japanese does not have an alphabetical order in the traditional sense—i.e. having an unequivocal ordering from the first written character to the last—Japanese kanji dictionaries tend to sort kanji using the radicals. If one tries ordering them individually after the Unicode consortium's table, the Japanese Industrial Standards (JIS) classification, or some other system, a problem quickly arises once we get into lesser known ones, such as the Hyōgaiji categories kokuji, extended shinjitai, or Asahi moji (characters), many of which might not even be included. Whatever system is chosen to sort them after, it needs to be applicable to any kanji that exists, no matter how obscure it is. --AstroVulpes (talk) 14:42, 1 January 2019 (UTC)[reply]

Interesting. How does Wiktionary handle unencoded characters? Ideographic description sequences? I had assumed that Wiktionary only allowed entries that could be written in non-PUA Unicode. The lag from Unicode standard to ICU is to be measured in weeks. The bigger issues are then the adoption of new issues of ICU, corrections to the Unihan database, and disagreements as to the radical or to the stroke count. --RichardW57 (talk) 05:05, 2 January 2019 (UTC)[reply]

As far as sortkeys go, ideographic description sequences are identified by a function in Module:zh-sortkey, and then looked up in Module:zh-sortkey/data/unsupported. If the module encounters an IDS without a sortkey, the page is tracked. At least some titles with IDSes are categorized in Category:Terms containing unencoded characters. — Eru·tuon 05:24, 2 January 2019 (UTC)[reply]

How about radical + number of strokes + whole character? That way you have the customary cross-linguistic dictionary order, plus a tie-breaker. It may seem like a lot of work, but I believe we already have everything we need in the data modules- so it's just a matter of coding. Chuck Entz (talk) 01:16, 2 January 2019 (UTC)[reply]

(Catching up after the holidays -- very behind in many different dimensions of life...)

Addressing just the sorting issue, I'd like to point out that, for dictionaries of whole words (i.e. not just single kanji listed as kanji, such as in a specific kanji-lookup dictionary), no other dictionary that I'm aware of sorts by radical or kanji. When every other dictionary sorts by reading, there's a strong rationale for doing so at Wiktionary too. I understand that the MediaWiki software backend is deficient, and entries like 避ける (sakeru, yokeru) can (currently) only be categorized under one sort index at a time. However, sorting under either reading is preferable to sorting under neither (such as by sorting on the kanji codepoint or the radical). ‑‑ Eiríkr Útlendi │^{Tala við mig} 00:11, 8 January 2019 (UTC)[reply]

@Eirikr: Thanks for your (long-awaited) reply. Actually, the format I purposed would sort words under both reading and kanji. For example, CAT:Japanese verbs would list さける under さ, よける under よ, and 避ける under 避 or 辶. The only deficiency is that spellings are sorted under kanji only. For example, CAT:Japanese terms spelled with 避 read as さ would have only 避ける under 避 or 辶, not さける under さ, because I considered さける and 避ける different spellings of the same word sakeru, v. As the former is not spelled with 避, it is not categorized under CAT:Japanese terms spelled with 避 read as さ. --Dine2016 (talk) 05:06, 8 January 2019 (UTC)[reply]

Should languages be grouped in translations?

edit

The translation editor currently treats the various Sami languages as entirely separate languages, and sorts them accordingly. But there are also some entries like these, where there is a single heading "Sami" with all the languages listed under it. Before I go and "correct" all of these, I'd like to know what everyone thinks the format should be? Should they be treated entirely separately and scattered across the translation table, or grouped together? Also, shouldn't there be two ** instead of *:? I thought *: was reserved for different orthographies of the same language, like Serbo-Croatian. —Rua (mew) 12:36, 29 December 2018 (UTC)[reply]

No grouped translations use "**". There is no such distinction with "*:". DTLHS (talk) 16:56, 29 December 2018 (UTC)[reply]

The distinction is made in descendants lists, at least. —Rua (mew) 22:35, 29 December 2018 (UTC)[reply]

Well other than using *: and not **, translation language grouping is inconsistent and completely undocumented (and therefore left up to the whims of individual editors), so if you feel it's best to group Sami together I support it. DTLHS (talk) 22:49, 29 December 2018 (UTC)[reply]

Perhaps this is a good opportunity to codify something then? —Rua (mew) 23:42, 29 December 2018 (UTC)[reply]

That would be good. Personally, I think it makes sense for every L2-header-having language to be sorted under its own name (I think someone will look for languages which they know we call 'Bavarian' and 'Ancient Greek' under 'B' and 'A', respectively, not under 'G'), and to only nest things like 'Min Nan under Chinese' since we also do that on the L2 level. But things are indeed very inconsistent, because the translation-adder has been revised to start or stop nesting various things without existing entries being changed, and people do things manually. Maybe we could straw poll "should X be grouped?" for each set of L2-having languages that is currently grouped, with an option somewhere for "don't group/nest anyting [that has its own L2 header]"? - -sche (discuss) 01:43, 30 December 2018 (UTC)[reply]

I would like the rules to be in Module:languages/data instead of the translation adder script, for one thing. DTLHS (talk) 03:23, 30 December 2018 (UTC)[reply]