Wiktionary:Beer parlour/2019/April

discussion rooms: Tea roomEtym. scr.Info deskBeer parlourGrease pit ← March 2019 · April 2019 · May 2019 → · (current)

Contents

Including or excluding ethnic slurs under synonyms for ethnicityEdit

Recently, User:Jimbo2020 removed the ethnic slurs/derogatory terms under the synonyms of Somali. On their user talk page, they argue that the precedent is to "not have ethnic slurs as synonyms unless they are historically significant". The examples for entries with synonyms including ethnic slurs were Chinese, German and African-American, while those that did not include them were Italian, Finn, and Oromo (the last two of which do not have any synonyms). I don't think there are any specific guidelines on this, so it would be a good idea to come up with at least something. — surjection?〉 18:42, 1 April 2019 (UTC)

I don't know what "historically significant" means. All words are "historically significant". DTLHS (talk) 18:44, 1 April 2019 (UTC)
Wiktionary is not censored. If they are or were at one time used as synonyms, they belong in the list of synonyms. —Rua (mew) 18:54, 1 April 2019 (UTC)
It's not about censoring. Kraut would be a "historically significant" entrance under German. Listing a marginally used neologism like muzrat (which should be deleted btw) under Muslim would not be. Almost all ethnonym pages do not list any ethnic slurs unless the word has a storied history in the English language or is particularly relevant to English speakers. BTW The American page currently has Ameritard listed as a hyponym, does that look right to you? Jimbo2020 (talk) 18:57, 1 April 2019 (UTC)
Based on the definition as "stupid or ignorant American", I would say yes, since it describes a subset of Americans. — surjection?〉 18:58, 1 April 2019 (UTC)
Again, are they synonyms? Then we list them as synonyms. Have you ever looked at some of our Thesaurus pages? They're full of offensive terms. —Rua (mew) 18:59, 1 April 2019 (UTC)
Once again this is about a general style precedent check the pages for Arab and Pakistani and Italian, where are the slurs listed as synonyms? Jimbo2020 (talk) 19:01, 1 April 2019 (UTC)
You've pointed out that those entries are missing some synonyms, so someone will hopefully get around to adding the missing ones. —Rua (mew) 19:04, 1 April 2019 (UTC)
This would seem to undermine our mission as a descriptive reference work. It would not undermine it if we RfVed allegedly offensive terms, though the RfV process itself would advertise them. DCDuring (talk) 19:09, 1 April 2019 (UTC)
I don't understand. How does including all synonyms go counter to our mission? The opposite seems true to me. —Rua (mew) 19:42, 1 April 2019 (UTC)
Sorry about the misleading indentation and the ambiguous deixis of this. I was referring to the deletion of content on apparent grounds of offensiveness. DCDuring (talk) 22:34, 1 April 2019 (UTC)
Ah, ok. Thanks for clarifying. —Rua (mew) 22:46, 1 April 2019 (UTC)
Mehhh. I do sympathize with the desire to not (in effect) "promote" obscure derogatory terms by putting them as synonyms of common terms (like the example of muzrat on Muslim, above), but precedent certainly seems to be that they would be included, with appropriate tags of course (e.g. "derogatory, rare"), along with any alternative spellings (including e.g. rare and obsolete ones which someone might also complain about the oddness of "promoting"). A possible compromise would be to put them in a collapsed box like related and derived terms are put in, or to offload them to a Thesaurus page and have the synonyms section direct people to it. But what I would regard as the usual approach, of just listing them as synonyms with appropriate tags, seems OK. - -sche (discuss) 00:05, 2 April 2019 (UTC)
Terms that meet our CFI should be included, also if they are offensive – until we decide to change the CFI. At the same time we should be careful to mark offensive terms as offensive. Under German (the noun) the synonym “Kraut” is labelled offensive. I think “skinnie” is at least as offensive – and not only to the person being derogated with the slur, but to anyone with sensibility.  --Lambiam 20:31, 2 April 2019 (UTC)
Note that the change did not remove the entries for skinnie/Skinnie, just their appearance as synonyms for Somali. It's not entirely unreasonable to take the position that a slur is not an exact synonym for the corresponding more neutral demonyms. But a slur does have a semantic relationship to the corresponding demonym. Is it an antonym, a coordinate term? Should it appear under 'See also'? For consistency should we make sure that mutt and mongrel do not appear as synonyms of mixed-breed/mixed breed (Lemmings have it.) because they are pejorative?
The likelihood that anyone with pejorative intent will come to Wiktionary to find some good ones is negligible. It is much more likely that someone will come here looking to object to our inclusion of the pejoratives. So this seems to be a matter of w:virtue signalling rather than something likely to have a bad effect outside of the potential controversy. It is a question of our ascription of virtue to descriptivism vs. the proscription against any purported encouragement or even license of the use of ethnic slurs. DCDuring (talk) 21:30, 2 April 2019 (UTC)
If the issue is that words like "muzrat" are absurdly rare (which may be true), it seems this is a problem with listing synonyms of anything, not just of ethnicities. Equinox 21:39, 2 April 2019 (UTC)

Fun game againEdit

Hi all. As last year we had an excellent time playing a multilingual board game, I'd like to repeat this year. I set up Wiktionary:Random Competition 2019. We'll start sometime soon provided there's someone to play with me. --I learned some phrases (talk) 10:26, 2 April 2019 (UTC)

URL shortener for the Wikimedia projects will be available on April 11thEdit

Hello all,

Having a service providing short links exclusively for the Wikimedia projects is a community request that came up regularly on Phabricator or in community discussions.

After a common work of developers from the Wikimedia Foundation and Wikimedia Germany, we are now able to provide such a feature, it will be enabled on April 11th on Meta.

What is the URL Shortener doing?

The Wikimedia URL Shortener is a feature that allows you to create short URLs for any page on projects hosted by the Wikimedia Foundation, in order to reuse them elsewhere, for example on social networks or on wikis.

The feature can be accessed from Meta wiki on the special page m:Special:URLShortener. (will be enabled on April 11th). On this page, you will be able to enter any web address from a service hosted by the Wikimedia Foundation, to generate a short URL, and to copy it and reuse it anywhere.

The format of the URL is w.wiki/ followed by a string of letters and numbers. You can already test an example: w.wiki/3 redirects to wikimedia.org.

What are the limitations and security measures?

In order to assure the security of the links, and to avoid shortlinks pointing to external or dangerous websites, the URL shortener is restricted to services hosted by the Wikimedia Foundation. This includes for example: all Wikimedia projects, Meta, Mediawiki, the Wikidata Query Service, Phabricator. (see the full list here)

In order to avoid abuse of the tool, there is a rate limit: logged-in users can create up to 50 links every 2 minutes, and the IPs are limited to 10 creations per 2 minutes.

Where will this feature be available?

In order to enforce the rate limit described above, the page Special:URLShortener will only be enabled on Meta. You can of course create links or redirects to this page from your home wiki.

The next step we’re working on is to integrate the feature directly in the interface of the Wikidata Query Service, where bit.ly is currently used to generate short links for the results of the queries. For now, you will have to copy and paste the link of your query in the Meta page.

Documentation and requests

Thanks a lot to all the developers and volunteers who helped moving forward with this feature, and making it available today for everyone in the Wikimedia projects! Lea Lacroix (WMDE) (talk) 11:57, 3 April 2019 (UTC)

The relationships between lemmas and formsEdit

Why is colours the plural of colour and not of color? The obvious answer to this question would be that the spellings are different. But I ask you to look at little deeper at this question. All our definitions, etymology and translations are currently on the page color, so that is clearly the lemma. Yet, if you look up colours, then you don't get sent to the lemma, but instead to colour, which doesn't actually have any information and just redirects you a second time. A lot of our entries have this idea that there is some kind of "main" term, a lemma of sorts, which has inflections. But as you saw here, the lemma isn't always the actual lemma (the page that defines the term). Instead, we've created a kind of intermediate tier that is not a lemma, yet it has inflections as if it were a lemma. The result is this double indirection.

Having to hunt for links just to get to the definitions of a term is really bad for users. Someone who looks up colours is not interested at all in colour, which has no useful information. They are looking for color, where the definitions, etymology, translations and everything else useful are. And it begs the question: why is colours not defined as an alternative form of colors? It's equally valid, after all. Moreover, forcing this kind of "sublemma" structure gets really confusing in cases where it doesn't work so neatly. A single form could belong to multiple possible sublemmas (alternative forms). better is the comparative of good, but it is equally the comparative of the alternative goode. In highly inflected languages, you can have quite complicated situations, where there are multiple possible lemma forms, yet all the other inflections are shared. Inflections can sometimes have their own inflections; participles are well-known examples. All this increases the mental burden on the editor who somehow has to figure out how to translate the situation into Wiktionary's conventions, and also on the user who has to jump through multiple hoops to get to the real lemma.

I would like to re-examine the relationship we have between lemmas and forms. There is really only one true lemma here, because only one of the entries has a definition. It's the relationships between the different forms that is throwing us off, because we introduce concepts like "alternative forms": lemmas that aren't lemmas. The way I would analyse the situation above is that there is one lemma (which itself has no inherent written representation) with multiple possible representations of both the singular and plural. color and colour are singular forms of this lemma, and colors and colours are plural forms of this lemma. Each of the forms is used by some subset of English speakers, but they all belong to one lemma, not two. We are hamstrung by the need to place definitions, etymology and translations on the page of one of those forms, and by convention that is the singular, so we picked one of the possible singular forms and placed everything there. But it would be beneficial if we could let go of the idea that the singular is therefore "special", that it has its own inflections and cannot be an inflection itself. There is really no need for alternative forms, and the complications they bring, if we can accept that color is simply the lemma of four forms: color, colour, colors and colours. —Rua (mew) 20:36, 3 April 2019 (UTC)

I don't like your way of doing it because it suggests to me that someone took the plural colours and decided to respell the plural specifically. I think the real solution here is to come up with a system that can show the full entry regardless of which spelling is visited, with appropriate modifications (I realise this won't be easy due to accuracy of citations etc.). Equinox 20:45, 3 April 2019 (UTC)
That's only because we have to choose one of the possible spellings/forms to place the definition at. If we didn't have to do that, if the lemma could be entirely detached from the way it's spelled, then that would no longer be a problem. They'd simply all be lemmas of entry 19515, or something like that. Unfortunately, as I said, we're hamstrung by having that requirement. However, I don't think that should be an excuse to convolute the relationships on purpose, by introducing multiple "fake" lemmas as intermediates when there is really only one. —Rua (mew) 20:50, 3 April 2019 (UTC)
Also note, I'm not directly proposing anything to be supported or opposed. Rather, I'd like people to challenge the assumptions we've always made on Wiktionary, and consider other options. Some of what I said is inspired by Wikidata's data model. Wikidata strictly separates lexemes from forms, where lexemes contain one or more forms, but always at least one. Forms have grammatical properties such as "singular" or "plural", they have a written representation, and they have a pronunciation, all of which lexemes do not have. The representation of the lexeme (the lemma in their terminology) is not strictly tied to how it's written. The lexeme for our color is titled colour/color for example. It seems that most of the problems I described above arise from tying lexemes too closely to one particular written form. If we could treat the lemma form as simply the place where everything is gathered, and not as a word, then things might be easier for us. —Rua (mew) 21:01, 3 April 2019 (UTC)
I can supply a slightly more extreme example: Medises, an inflected form of Medise, a (variant capitalization of medise, which is itself a) variant spelling of medize. I wouldn't want to define Medises as being an alt form of medizes and link to that non-lemma, but I think we could simply pipe the link to the lemma, i.e. define Medises as: Third-person singular simple present indicative form of [[medize|Medise]]. "Colours" could likewise be: plural of [[color|colour]]. The only downside is that that might be what Wikipedia calls an "Easter egg", a link that doesn't go where a reader would necessarily expect, if they expect it to go to the display form and not the place where the content is. However, that doesn't seem much different from how e.g. Mēdōrum goes to Medorum, not Mēdōrum, and since medize mentions Medise as an alternative spelling, a reader should not be confused for long. Would that be a simple solution? (Is this what you were already thinking of, or...?) - -sche (discuss) 23:28, 3 April 2019 (UTC)
How about something like this: at colours have "plural of colour (see [[color]])" giving "plural of colour (see color)". That way they see both forms, but they're linked to the lemma. Chuck Entz (talk) 03:10, 4 April 2019 (UTC)
What should be done for cases where the inflections belong to multiple alternative forms of the same lemma? Or the extreme case where the lemma form is the only form that differs between them? —Rua (mew) 11:39, 4 April 2019 (UTC)

Proposed change to zh-derEdit

zh-der currently automatically provides the Mandarin pinyin for entries that have Mandarin pinyin in zh-pron. But for those entries which don't have Mandarin pinyin in zh-pron, no romanization is given. I propose including the non-pinyin romanizations like the Yueyu Pinyin and Min Nan POJ. It does not have to be well thought out or well planned at this stage, it just needs to happen and then be refined over time. --Geographyinitiative (talk) 22:51, 4 April 2019 (UTC)

That would be very confusing to mix up different romanisations. Also, I think this topic is only for Chinese editors only, so this can be discussed at Wiktionary talk:About Chinese instead, rather than here. --Anatoli T. (обсудить/вклад) 23:14, 4 April 2019 (UTC)
moved to Wiktionary talk:About Chinese per suggestion --Geographyinitiative (talk) 23:30, 4 April 2019 (UTC)

IPA-to-speechEdit

Hi!

Are there any IPA-to-speech projects here?

I see there are a few FOS engines out there. How would/could they be incorporated?

Thanks. Saintrain (talk) 18:37, 6 April 2019 (UTC)

No such projects here, and also no plans. We’d rather have no audio representation than an inaccurate one. Even in narrow transcription IPA cannot reflect all nuances of human speech.  --Lambiam 07:06, 7 April 2019 (UTC)

Vote on excluding typos and scannos is liveEdit

A heads up: the vote on a proposed change to CFI that would exclude typos and scannos is now open. (See also the thread above titled CFI-amendment: excluding typos and scans.)  --Lambiam 07:13, 7 April 2019 (UTC)

Read-only mode for up to 30 minutes on 11 AprilEdit

10:56, 8 April 2019 (UTC)

Fortunately for us, English Wiktionary isn't on the list at phab:T220080. — Eru·tuon 10:59, 8 April 2019 (UTC)

Category:English coordinated pairsEdit

I came across this category and I'm trying to figure out what a coordinated pair is. We don't have a coordinated pair entry, and the description at the top of the category is not very helpful either. Could someone write a better description so me and future mes know what it's for? Thank you! —Rua (mew) 18:36, 8 April 2019 (UTC)

The membership in the category is an ostensive definition of the category. The meaning is SoP. I'll review the membership to see if any mes have erroneously included any terms. DCDuring (talk) 19:05, 8 April 2019 (UTC)
But what's "coordinated" about the pairs then? I really don't get it. It seems to be a category for just any pair of words that happen to appear together in an entry name. —Rua (mew) 19:39, 8 April 2019 (UTC)
I said I'd take a look and I have.
The easy cases are terms linked by coordinating conjunctions, principally and, or and their word-like equivalents ' n ', &, et. In these, each term in the pair is at the same grammatical and (usually) semantic level as the other. slowly but surely seems similar. The harder cases are the pairs linked by commas or hyphens/dashes. In ding-dong, willy-nilly (and others) the elements may or may not have distinct lexical existence and are, in any event, in Category:English reduplications. I'd be inclined to remove these from the category and refer to the reduplication category on the Category:English coordinated pairs page, either as a "see also" or by making it a subcategory. In another day, another dollar, finders, keepers, finders, keepers; losers, weepers, first come, first served and others, there is no coordinating conjunction. The semantic link seems to be not coordination but implication. I'd be inclined to removed these only if there is another plausible short category name that would describe them. I haven't thought of such a name.
I'd like other opinions. DCDuring (talk) 19:43, 8 April 2019 (UTC)
There is Coordination (linguistics) in Wikipedia. I don't pay much attention to the categories, but it would be nice if it had a description or a link to a Wikipedia article which would describe it. -Mike (talk) 21:36, 8 April 2019 (UTC)
If the description is updated, consider also updating Category:English coordinated triples. - -sche (discuss) 22:02, 8 April 2019 (UTC)

Japanese entry layout revisitedEdit

Hi. I'd like to propose the following long-term changes to the Japanese entry layout, and would like to have some of them incorporated into WT:AJA

  • A new citation format of Japanese terms: 日本 (にほん, Nihon, にっぽん, Nippon) or やまと (大和, , Yamato).
    • Currently many Japanese words are either cited with {{m|ja|...}} or {{ja-r}}. The disadvantage of the former is that there is no way to show both kanji and kana, or support multiple readings. The disadvantage of the latter is that (1) it takes up too much vertical space, discouraging editors from adding more synonyms, derived terms, etc. (2) The font size of the kanji is too big compared to normal citations Japanese terms, causing disharmony, and the size of the kana is too small on some computers, as Eirikr reports. I would like to employ the new format to cite all Japanese words and reduce the use of ruby to examples, and I think the best way is to modify {{ja-r}} to use the new format by default. This way we don't need to create new templates or mass-update mainspace entries. Please see User talk:Suzukaze-c#CSS for more.
    • I would also like to propose the new syntax {{ja-r|KANJI:KANA}} in addition to {{ja-r|KANJI|KANA}}. Editors can still use the second format, but other templates relying on {{ja-r}} can take advantage of the former. The reason is as follows: For most languages, one parameter is enough to enter a word (e.g. {{m|en|English}}), and the format of templates is pretty predictable (e.g. {{compound|en|place|holder}}). For Japanese, however, two parameters are often needed (e.g. {{ja-r|日本語|^にほんご}}), leaving different ways to place these parameters (e.g. {{ja-compound|日本|^にほん||}} versus {{ja-vp|終える|終わる|おえる|おわる}}). If we build the new syntax KANJI:KANA, then templates relying on it will have more consistent and more predictable syntaxes (e.g. {{ja-compound|日本:^にほん|語:ご}} and {{ja-vp|終える:おえる|終わる:おわる}}), which are also more interchangable with kanji/kana only versions (e.g. {{ja-vp|終える|終わる}}).
    • What about automatic fetching of the reading from the mainspace entry? For example, {{ja-r|日本料理}} should produce 日本料理 (にほんりょうり, Nihon ryōri) while {{ja-r|日本}} could produce 日本 [Term?] because there are many readings possible.
  • Eliminate sortkeys. Once the use of soft-redirection ({{ja-see}}) is established, there will be no need to categorize the kanji terms under kana. This is because {{ja-see}} copies categories from the lemma spelling to the non-lemma spellings, so all spellings of the term will appear in the same category. If we eliminate sortkeys, the kana part and the kanji part of a category will contain the same set of vocabulary, once in kana and once in kanji, so there is nothing to lose. More importantly, editors are liberated from the constant need to watch for categorizing templates (such as {{lb|ja|...}}) and add sortkeys.
  • Is there consensus on whether to lemmatize the wago vocabulary at kana spellings? I prefer to lemmatize terms at the most common spelling as a general rule, but make the core wago vocabulary an exception to it. First, wago terms have a greater degree of independence from and variety in combination with kanji. The most common kanji spelling is not necessarily the intended meaning it is used (e.g. 帰る返り点), but kana is acceptable everywhere. Second, the etymology of non-transparent-compound wago terms are best illustrated by the kana form. In etymology sections, “くら (, kura) + (, wi)” looks better than “ (kura) + (wi)”. (By the way, when the focus is on the meaning, such as in synonym sections or entries from other languages, I think the kanji should still be put before kana.) On the other hand, I'm not sure about whether to do the same for transparent compounds like 繰り返す, which have less justification. This means that the border between “terms lemmatized at kana” and “terms lemmatized at the most common spelling (usually a kanji spelling)” can be very vague and arbitrary.
  • What about a custom reference template? {{ja-ref|DJR}} is much easier to type than <ref name="DJR">{{R:Daijirin}}</ref>. For common references, we can also make the template link to Wiktionary:About Japanese/references, rather than generating a <ref>, because ===References=== <references/> is also tedious to type :)
  • Simplify the interface of inflection templates. The current syntax is unnecessarily complex. I think only two formats are needed: {{ja-infl|type=1}} (for わらう) and {{ja-infl|つれて いく|type=iku}} (for 連れて行く; the space is merely for the purpose of romanization). Everything else, from slight irregularities (e.g. 行く, ある) to separating the stem and the ending (e.g. {{ja-go-u|わら}}) as well as detecting |sik= should be built into the module. This should make it easy for {{ja-see}} to copy inflection tables around. With the current templates, {{ja-see}} would need to recognize both Category:Japanese inflection-table templates and {{ja-conj-bungo}} as well as learn their quirks (such as remembering to add |sik= when copying from もうでく to 詣で来), which is too tiring and error-prone.

(To be continued.) (Notifying Eirikr, TAKASUGI Shinji, Nibiko, Atitarev, Suzukaze-c, Poketalker, Cnilep, Britannic124, Nardog, Marlin Setia1, AstroVulpes, Tsukuyone, Aogaeru4, Huhu9001, 荒巻モロゾフ, Mellohi!): --Dine2016 (talk) 10:04, 9 April 2019 (UTC)

  • Generally in favor.
What is sik?
‑‑ Eiríkr Útlendi │Tala við mig 21:37, 9 April 2019 (UTC)
It is short for suffix_in_kanji, one of the parameters of {{ja-conj-bungo}}, used for example in the conjugation of .  --Lambiam 15:11, 10 April 2019 (UTC)
  • I have no real objections (though I don't think I understand all of the technical specifics). I do support the principle of using the most common form of a lemma rather than having a language-wide rule. In other words, treat e.g. 日本, 学校, する, and きれい each as 'main' entries, rather than having a general preference for kanji or kana. Cnilep (talk) 03:54, 10 April 2019 (UTC)
  • I think these are improvements that I expect to be uncontroversial. Some of these proposals are easy to implement, but I feel a plan is needed on how to roll out the more involved changes. As to lemmatization – apart from the fact that we need to strike a balance between what is most useful to the users and what is a reasonable effort to ask from the editors – which form to prefer is an issue for all languages offering alternatives that in the end needs to be addressed on a case-by-case basis, and if for any specific case two choices are more or less equally good (or bad), there is no point in losing sleep over which one to choose. It will be helpful to offer advice on such issues in Wiktionary:About Japanese.  --Lambiam 15:11, 10 April 2019 (UTC)
  • Support.
    • As for the new syntax, perhaps it can be implemented in the major link templates, so that we can use {{compound|ja|FOO:ほげ|...}} or {{compound|ko|방:房|...}}. (Or perhaps make a it a new parameter in the style of {{{ts}}}, if there is larger objection to :.) Personally, I am worried about : taking on too much responsibility in the linking templates.
    • Sortkeys: no aprticular comment.
    • Wago: +1 for kana.
    • However, I don't really like {{zh-ref}}, TBH. (Well, I have =3+hr+ expand to ====== + References + <references/> in my IME; maybe that's why I'm not terribly bothered.)
    • inflection templates: Absolutely.
  • Suzukaze-c 19:58, 11 April 2019 (UTC)
    • Support everything but Oppose lemmatising wago terms on kana entries. Like before, we should lemmatise on the actual most frequent Japanese spelling, so  () (yomu) is the lemma, IMO, not よむ (yomu).
    • I don't understand what is going to happen with eliminating sort keys. Will 日本 (にほん) (Nihon) still be sorted by "に"? Also, how are terms with multiple readings are going to be sorted?
    • Welcome to all new Japanese specific templates, they are overdue.
    • I think we also need to add categories for Sino-Japanese terms, similar to the Korean and Vietnamese but possibly split into smaller categories, considering the complexity of etymologies, reduce info in kyūjitai entries. Care should be taken when using Middle Chinese templates for sources but this should be encouraged. --Anatoli T. (обсудить/вклад) 01:22, 12 April 2019 (UTC)
  • Curious as to your opposition to lemmatizing yamato kotoba at the kana spelling? The kanji spellings are irrelevant to the etymologies of yamato kotoba, only being applied later when Chinese characters were borrowed, and lemmatizing at kanji spellings actively obscures cognacy and relationships.
Take the verb tsuku, for instance. By kanji, this could be spelled 付く・着く・就く・即く・憑く・突く・衝く・撞く・搗く・舂く・築く・吐く・漬く・浸く・尽く・歇く・竭く. Most of these 17 spellings are etymologically related, sometimes very closely indeed. Lemmatizing by kanji spelling hides this interrelationship and adds confusion, and necessitates a lot of data duplication across entries. ‑‑ Eiríkr Útlendi │Tala við mig 03:38, 12 April 2019 (UTC)
Agreed re: the failure of sortkeys. The current approach was based on the assumption that the back-end capability would eventually support multiple sortkeys for a given lemma string. We reported the MediaWiki shortcoming years ago, and received zero response from the devs -- 黙殺された. It's clear they don't give two shits, so we clearly need to change our approach if we want something workable. ‑‑ Eiríkr Útlendi │Tala við mig 04:40, 12 April 2019 (UTC)
@Dine2016, Eirikr: OK, agreed and Support on both points and sorry for doing this again to you. I completely forgot about the convincing つく-argument :) --Anatoli T. (обсудить/вклад) 05:24, 12 April 2019 (UTC)
@Eirikr, Atitarev: Honestly speaking, I'm not sure if making an exception for wago terms is really a good idea. One problem with kanji spellings is that the most common spelling does not necessarily cover all meanings of the term (while kana does). For example, the 帰る spelling of かえる does not cover the sense “to turn over”, so that the etymology of 裏返る has to be written as “ (ura, ) + 返る (kaeru, alternative spelling of 帰る in the sense ‘to turn over’)”. If we take a kana-centric approach to wago terms, then the etymology of うらがえる is simply “うら (ura, , …) + かえる (kaeru, 返る, 反る, ‘to turn over’)”. Another problem is that wago terms may appear as the reading/furigana to entirely irrelevant kanji, such as in person's names. However, such problems only concern a small percentage of the wago vocabulary, so I'm doubting whether it's really worthy to employ the kana spelling for all wago terms, especially transparent compounds such as 追い払う(注). I think an alternative approach is to (1) either just lemmatize at the most common kanji spelling, but still list the whole range of kanji with {{ja-spellings}}, and sense division with {{ja-def}}, or (2) break the word into different sense groups (e.g. かえる(帰る・還る) and かえる(返る・反る)), and lemmatize each of them as if they were different words, but use soft redirection for the etymology and pronunciation sections to avoid data duplication (c.f. Daijirin's treatment of 帰る as 〔「かえる(返)」と同源〕). This way every word is lemmatized at the most common spelling, and everyone is happy. --Dine2016 (talk) 06:09, 12 April 2019 (UTC)
Um, maybe we can justify the wago exception on the basis that the JA WT is also making it. Or this argument: “if the ‘lemmatize at the most common spelling rule’ were applied for Chinese, then each Chinese word would be lemmatized/mentioned in Simplified Chinese or Traditional Chinese based on whether it is used more frequently in {Mainland China and Singapore} or {Taiwan, Hong Kong and Macau}, which would be too absurd.” --Dine2016 (talk) 06:50, 12 April 2019 (UTC)
Using simplified over traditional as the main entry is a legitimate request, which has been discussed but discarded for very important reasons, etymological and technical., btw. your link is not working: ja:Wiktionary:項目名の付け方. --Anatoli T. (обсудить/вклад) 07:12, 12 April 2019 (UTC)
  • I don't work on Japanese entries but wanted to make general remark about something which came up while working on parsing code to find wanted entries (replacement for Template:redlink_category): if we can avoid specialized templates like {{ja-compound}} it will really help to make these sort of automated tasks much simpler. Otherwise we need to have additional logic to cover the language specific linking templates. The general idea would be to push the responsibility into the core linking code (which could internally still delegate to other modules). This would keep the template "surface area" small. Another thing to avoid is nesting inside linking templates: I've seen some instances of {{bor|en|{{ja-r|....}}}} which is tricky to parse and produces invalid output. {{bor}} should be able to figure out what to do when used with Japanese entries. – Jberkel 00:16, 16 April 2019 (UTC)
    Support language-specific logic that is incorporated into the "main" templates. —Suzukaze-c 18:19, 18 April 2019 (UTC)

Allographic variantsEdit

@Eirikr, Suzukaze-c, Lambiam, Atitarev While I was editing まま, it occurred to me that there are allographic variants among kanji forms which are fully exchangeable in writing without regard to reading. For example, and are essentially the same kanji whether read as まま or まんま, and 間々 and 間間 are essentially the same kanji form whether read as あいあい, あいだあいだ, ひまひま or まま. Therefore I would like to create a variant of {{ja-see}} with a recursion depth of two instead of one:

{{ja-see|儘|v}}
For pronunciation and definitions of – see the following entries.
まま【儘】 ⇒まま
[particle]as it is; remaining in a certain state; while; still
まんま【儘】 ⇒まんま
[particle](uncommon) Alternative form of まま (mama, as it is; remaining in a certain state; while still)
(This term, , is an alternative spelling of , which in turn is a kanji spelling of several terms.)

Under this approach, only "canonical kanji forms" will contain a list of readings (e.g. soft redirects to まま and まんま), while other kanji forms will simply redirect to the canonical kanji form (e.g. soft redirects to rather than duplicates its content) and have the template fix "double redirects".

For this we need to define what the "canonical kanji form" is. For example:

  • Should we allow extended shinjitai and lemmatize tōrō “lantern” at 灯篭, or should we stick to the official shinjitai list and lemmatize it at 灯籠? I think we need to have a standard if we want to build jitai conversion modules.
  • Should we allow the 踊り字 in canonical titles? I prefer to do so, because it's an essential part of modern orthography just as shinjitai and modern kana spelling are.

Also, how should we list the variants of a canonical kanji form, such as kyūjitai? It seems that there are two ways to present kyūjitai: either we limit ourselves to JIS X 0208/0213 to comply with Japanese computing, or we utilitize Unicode as much as possible to adhere to the Kangxi dictionary printing forms. If the latter, we might want to list as the kyūjitai of , of , or even 𥳑 of , the last of which seems to lack font support.

(Well, sometimes the orthographic variants are not fully exchangeable. For example, needs to fetch a subset of content from , and so does けふきょう, which complicates matters.) --Dine2016 (talk) 15:47, 12 April 2019 (UTC)

As to “canonical kanji”, inasmuch as lemmatization is automated (it may be prudent to allow overrides), following the 2010 jōyō kanji list has the advantage of being a clear standard and avoiding potentially endless debates over which character is to be preferred on a case-by-case basis – but a disadvantage is that this may (at least in some cases) not be what most users would expect. But, as I wrote above, there is no ideal solution to lemmatization, and occasionally having to follow a soft redirect is (IMO) not a big deal. Whatever is decided, the decisions should be encoded in tables used by the software modules, so that future revisions of the list can easily be incorporated. As to the internal representation of other kanji, I am somewhat partial to Unicode as being the more portable approach across platforms and probably the future also of Japanese Industrial Standards. Disclaimer: I have no experience whatsoever editing Japanese entries, so my opinions should not be assigned as much weight as those of experienced editors.  --Lambiam 16:53, 12 April 2019 (UTC)
I compiled a list of 398 official shinjitai at Template talk:ja-spellings#kyūjitai, of which 67 kyūjitai were found to be encoded using CJK Compatibility Ideographs. Since modern computing systems now have better font support for Japanese glyphs, I would prefer to comply with Japanese computing for better searchability. We can still list older forms such as and 𥳑 which are not in JIS X 0208/0213 as "historical kanji" rather than "kyūjitai" and nonstandard simplified forms such as as "extended shinjitai" rather than "shinjitai". KevinUp (talk) 02:51, 13 April 2019 (UTC)
It seems reasonable to me. Perhaps we should enforce the use of "official" Japanese kanji as main spellings, including 籠, for the sake of consistency. And I prefer 々. —Suzukaze-c 22:04, 13 April 2019 (UTC)

Proposal to look to Wikisource for citations.Edit

I think that perhaps we should establish a practice of making Wikisource the first place that we look for citations for words, particularly older words. There are now thousands of books transcribed there. Cheers! bd2412 T 01:26, 10 April 2019 (UTC)

Why? —Μετάknowledgediscuss/deeds 01:29, 10 April 2019 (UTC)
It's hard to tell how accurate a cite is with just a sentence of context sometimes, and even if that editor can see the full context in Google Books, other editors, depending on their location and sometimes dumb luck, may not be able to. Wikisource will show the whole context to all users.--Prosfilaes (talk) 04:08, 10 April 2019 (UTC)
Wikisource is hardly the only site with full texts. If this is about providing more links that could be discussed. I see no reason to favor Wikisource over other sites such as archive.org. DTLHS (talk) 04:39, 10 April 2019 (UTC)
Archive.org doesn't generally provide full transcribed text, and the scans on Archive.org can often be quite slow to flip through. Wikisource offers both transcribed text and usually a link to the original scan.
Besides which, Wiktionary is hardly the only site with definitions. Should Wikisource work with Wiktionary, or should we link to other dictionary sites?--Prosfilaes (talk) 05:48, 10 April 2019 (UTC)
Archive.org is a rich but messy resource, some works have dozens of scans in varying quality, taking up precious editor time. Wikisource is definitely preferable here. There have been a few (community wishlist) proposals around to build tools to automatically extract and format quotations for the use in Wiktionaries but as far as I know nothing has materialized. – Jberkel 07:20, 10 April 2019 (UTC)
For what it's worth, {{Q}} (Module:Quotations) links to Wikisource quite a bit, for instance when you add a reference to the Iliad or Odyssey in an Ancient Greek entry: {{Q|grc|Il.|1|477|form=inline}}Homer, Iliad 1.477. — Eru·tuon 05:00, 10 April 2019 (UTC)
It might be particularly useful for all the requests for quotes from particular authors the templates for which some find annoying.
Any bias toward Wikisource is also a bias toward out-of-copyright sources and therefore old sources. I don't think we need that at all, even for terms that have been around for a while. DCDuring (talk) 12:21, 10 April 2019 (UTC)
A bias toward Wikisource over other similar collections of out-of-copyright sources doesn't change the overall issues. I'd like to have more quotes from the birth of our language. My problem is more about the dead period, from 1924 through ~1995 where we have the same problem basically anywhere we look. The works just aren't publicly available for copyright reasons anywhere.--Prosfilaes (talk) 03:28, 16 April 2019 (UTC)
  • I use Wikisource all the time for quotes. And there are the awesome lists on User:DTLHS/eswikisource. We should have User:DTLHS/enwikisource too, of course. I believe I asked D to make me one but the reply was something along the lines of that it was "full of crap" - yes, they were the exact words D used. --I learned some phrases (talk) 12:24, 11 April 2019 (UTC)

To expand on my original post:

  1. Wikisource is a sister project of ours, and as a Wiki any of us can edit there, meaning that we have some measure of control over what gets put there.
  2. Due to its joined status as a Wikimedia project, Wikisource is about as stable as Wiktionary. Other websites may disappear out from under our noses, but it is likely that Wikisource will exist as long as Wiktionary exists.
  3. To DCDuring's point, yes, Wikisource does have a lot of old sources but:
    1. We have a lot of old words, and there's nothing wrong with old citations if they define the word accurately.
    2. Wikisource actually does also have a lot of recent material, particularly public domain government documents including reports from various areas of specialization, and some case law; it can permissibly host much more of that.
    3. Didn't we just have this discussion last month about all these Webster's 1913 requests for quotes? Guess which Wikimedia project would be the one to host all the works from which those quotes could be found.
  4. Further to Jberkel's point, we could develop a tool to find and extract sentences containing sample words from Wikisource. It seems reasonable that somebody should be able to make a concordance of Wikisource, or of a particular subset of Wikisource texts.

Cheers! bd2412 T 22:11, 12 April 2019 (UTC)

Does Wikisource have Congressional committee testimony, especially Q&A? That's linguistically valuable and sometimes fun. Bureaucratic reports, not so much fun. DCDuring (talk) 02:05, 13 April 2019 (UTC)
That certainly falls within the remit of Wikisource, although I don't know how much of it there actually is at this time. bd2412 T 15:40, 13 April 2019 (UTC)

6 million entriesEdit

According to Equinox, Finnish konelypsy (automatic milking) is our six millionth entry, created by User:Surjection. —Μετάknowledgediscuss/deeds 14:34, 10 April 2019 (UTC)

That sounds like enough, job done. Time for us to find some new, worthy project; I wonder if Wikipedia still needs help generating lists of Pokemon... - TheDaveRoss 14:43, 10 April 2019 (UTC)
There are still one or two words in Wiktionary:Wanted entries so we shouldn't give up just yet. SemperBlotto (talk) 14:45, 10 April 2019 (UTC)
Onward to the six million and two!  --Lambiam 15:15, 10 April 2019 (UTC)
Did McDonald's stop at 6 million burgers? I think not. -Mike (talk) 20:40, 10 April 2019 (UTC)
Surjection and his milk again. *sigh* --I learned some phrases (talk) 12:21, 11 April 2019 (UTC)
  • Next question: who is the most prolific entry creator? DonnanZ (talk) 23:37, 18 April 2019 (UTC)
Equinox, followed by SemperBlotto. If you count machines then SemperBlottoBot, then WingerBot, then Equinox, then NadandoBot, then SemperBlotto. - TheDaveRoss 00:04, 19 April 2019 (UTC)
Oh, and if you only count euphemisms by sockpuppets, Wonderfool. - TheDaveRoss 00:05, 19 April 2019 (UTC)
Thanks, a predictable answer, I suppose, but I didn't think of pages created by bots. Do you base your figures on pages created by each editor? I do this for my own paltry figure. DonnanZ (talk) 08:29, 19 April 2019 (UTC)
This stats site gives a rundown of the top 76. If they had just one account over the years instead of 200, Wonderfool would be in 4th place, actually. --I learned some phrases (talk) 12:34, 19 April 2019 (UTC)
OK, so that doesn't have page creations. Where do you get your results, Dave? --I learned some phrases (talk) 12:36, 19 April 2019 (UTC)
It does give creates, that is the right set of columns. It is also no longer being updated with 2019 data and beyond, it has been replaced by Wikistats 2 which is garbage for things like user stats. X's tools is still current, but doesn't show lists of users. Not sure if there is a better view of users by contribution count available currently. - TheDaveRoss 12:44, 19 April 2019 (UTC)
Also WF is including bot edits in his count, but not in anyone else's, so [Citation needed]. - TheDaveRoss 12:46, 19 April 2019 (UTC)
It looks as though I rank 12th for edits, and 4th for creates (which is quite astonishing). If I look on my watchlist at "pages watched not counting talk pages" that gives me the current figure (56,708) as all pages created are automatically watched (and I don't watch any other pages). The 53,106 figure for creates in those stats is out of date of course, but seems to be accurate. DonnanZ (talk) 17:53, 19 April 2019 (UTC)

How should gerunds be handled?Edit

In English, gerunds seem to be entirely ignored, I guess because they are always identical to the present participle. However, that doesn't apply for other languages. There are a few specific cases where this is relevant.

The first is Dutch. Dutch has a gerund, but it's identical in form to the infinitive, which is also the lemma form. We usually don't make form-of entries for forms that are the same as the lemma, so we have no entries for Dutch gerunds at all. It is mentioned in the inflection table, though, see roepen. As shown in the table, the gerund has neuter gender. Should every Dutch verb have a separate entry for the gerund?

The second case concerns German and West Frisian. In both of these languages, the gerund is also neuter, but it's not identical in form to the lemma. In German, there is a difference in capitalization, which also shows that gerunds are treated as nouns. In West Frisian, it's identical to the long infinitive, which is something the other languages don't have (but Old English had it). There seem to be a bunch of entries created for German gerunds already, in Category:German gerunds, and they are given a Noun header with its own inflection table. West Frisian barely has any entries for verb forms yet, so there is no precedent to go by.

The implication I take from the German treatment is that we should really be treating the English, Dutch and West Frisian gerund as nouns in their own right too. After all, why would we have entries for German gerunds but not for English, Dutch and West Frisian ones? In German, the gerund is unique in its orthographic representation, so it can't just "piggyback" on another verb form, and must have its own entry. But gerunds aren't just verb forms in other respects. They can have genders, like nouns, and even case forms depending on the language. They can also take both definite and indefinite articles, as well as possessive and other determiners, in English too. We already treat participles specially in many languages, giving them their own Participle header to show that they aren't just verb forms, but are more like adjectives. The same could be argued for gerunds, but we don't currently have Gerund headers anywhere. Should we? Or should we call them Noun? The fact that gerunds have genders and case forms tells me that we shouldn't just be labelling them as Verb. A sticky point is that Dutch gerunds can have a direct object before them (Wiktionary bewerken is leuk!) and English ones can have it after them (Editing Wiktionary is fun!), which is something specific to gerunds and not shared with regular nouns. That speaks in favour of a separate Gerund header. —Rua (mew) 12:58, 11 April 2019 (UTC)

Why should we draw any implications whatsoever for English PoS from what Dutch, German, and West Frisian inflection. Uniformitarianism is not the official religion of Wiktionary. DCDuring (talk) 17:55, 11 April 2019 (UTC)
I’m of 2.718 minds about this. On the one hand, it seems eminently reasonable. These gerunds are syntactically nouns, and therefore a heading “Verb” is misplaced. On the other hand, giving separate entries for all gerunds whose form is indistinguishable from a verb form will mean a lot of extra work. (In some cases the gerund has become a noun with a slightly different sense, like eten meaning “food”, not the act of eating, and such nouns definitely need a separate entry; here we consider the true gerunds whose meaning follows directly from the meaning of the underlying verb.) In Turkish, next to the infinitive (Sigara içmek yasaktırSmoking is forbidden), also the third-person present simple and future can assume the role of a noun (çıkmaza girmişHe has entered a dead end, literally a “does-not-exit”; gelecek bilinmezdir - the future is unknowable, literally the ”will-come”); moreover, they can also serve as adjectives. (Normally these are called participles by grammarians, not gerunds, but I see no argument why the same reasoning would not apply here.)  --Lambiam 19:07, 11 April 2019 (UTC)
Considering that editing has a noun entry, are you just arguing that the header should be changed to "Gerund"? Could it not just be handled in an etymology section or as text at the beginning of the sense definition? -Mike (talk) 20:19, 11 April 2019 (UTC)
There is noun entry for this particular verb, but every verb has a gerund. I'm saying that we should be making this a regular thing. —Rua (mew) 20:49, 11 April 2019 (UTC)
Is "editing" really a gerund in "Editing Wiktionary is fun"? Equinox 20:21, 11 April 2019 (UTC)
What else can it be? It's not a participle, unless you somehow read it as meaning that Wiktionary is doing the editing. —Rua (mew) 20:49, 11 April 2019 (UTC)

Wikimedia Foundation Medium-Term Plan feedback requestEdit

Please help translate to your language

The Wikimedia Foundation has published a Medium-Term Plan proposal covering the next 3–5 years. We want your feedback! Please leave all comments and questions, in any language, on the talk page, by April 20. Thank you! Quiddity (WMF) (talk) 17:35, 12 April 2019 (UTC)

Classical compounds in Category:English words by prefix and Category:English words by suffixEdit

These categories are a complete mess right now, because we categorise all elements of Greek and Latin origin as affixes. As a result, the actual proper affixes of English are all but unfindable among all the noise. I think the problem here is our treatment of Greek/Latin elements. The combinations that are created when putting them together are called classical compounds, which makes their nature as compounds rather than affixed words very clear. While they are used productively in English and other languages, they follow their own rules, very different from true affixes:

  • They can be attached to each other, with no apparent root word, like anthropo- + -centric. You can't do this with real affixes: be- + -ness cannot make *beness.
  • They have a strong tendency to occur together. Often they can only be attached to each other, not to any other random word.
  • They originate in their parent language from root words, not affixes. Thus, combinations of them are not affixed words, but rather compounds. This is reflected in the English term for them, too.
  • One and the same term might be a prefix or suffix, with a difference in form. But what's really going on is that the shape depends on the position within the compound, final vs nonfinal. In informal use, words are adapted to this pattern by adding an o at the end of a nonfinal element.

Because of this, I don't think it does to call these "prefix" or "suffix", they're really their own kind of thing. I think in the interest of making the two above categories usable again, we should split the elements of classical compounds into their own kind of derivational category. There should at the very least be a Category:English classical compounds. We could have further subcategories based on the elements used, but I'm not sure if that's really fitting, given that these are compounds and we already tried and failed to categorise compounds by their elements before. I'm not sure about all the details of the solution yet, but I hope it's clear to everyone that something is wrong here. —Rua (mew) 20:55, 14 April 2019 (UTC)

Let’s define an English prefix or suffix to be something that is affixed (with possible morphological adjustments) to the stem of English words so as to form new English words, whose meanings for a given pre-/suffix are more or less derivable from the meanings of the words it is affixed to. Then indeed many of the entries currently advertized as English pre-/suffixes are miscategorized. The distinction with components with a classical pedigree is not always clear-cut, though, as seen in neologisms like user-centric ([3][4][5]) and Britain-centric ([6][7][8]). I think in these words -centric is a (productive) suffix. As another example, -ize is on the one hand a French suffix (-iser) that lifted along with words like angliciser when they were anglicized – in these words it is not an English suffix but an anglicized French suffix; on the other hand, it is responsible for forming new words like dandyize, bowdlerize and mongrelize. While I agree with the drift of this gripe, I think “English classical compounds” is a misnomer. Whatever xeno- and -phobia are, they are not compounds, but components found in classical compounds (and sometimes used in making new compounds with a classy appearance). Perhaps Category:English classicistic components?  --Lambiam 17:27, 15 April 2019 (UTC)
I think you misunderstood a little. I'm not saying that the elements of the compounds should be called classical compounds, but rather the combinations formed from them. In other words, anthropocentric should not be categorised as Category:English words prefixed with anthropo-, nor as Category:English words suffixed with -centric, because it's neither. I do see your point about terms like user-centric, and in that case we might be able to consider them suffixes, but I'm not completely convinced if -centric is a suffix in that case either. And since it's not a classical compound, that's separate from the matter I'm describing here anyway. —Rua (mew) 17:34, 15 April 2019 (UTC)
Sorry, I indeed misunderstood. I agree we should remove anthropocentric from Category:English words prefixed with anthropo- and Category:English words suffixed with -centric; in fact, I just did by changing {{confix}} to {{compound}}. I am not convinced there is a need for a new category Category:English classical compounds. (If the need exists, we will presumably also want Category:French classical compounds and Category:German classical compounds; and what about Category:Ancient Greek compounds and Category:Latin compounds?)  --Lambiam 17:54, 15 April 2019 (UTC)
Distinguishing Greek and Latin won't really be practical, because some of these combine both, even if there are some purists out there that hate it. :) —Rua (mew) 10:59, 16 April 2019 (UTC)
Yes, the word television is an abomination that flies in the face of etymological decency. The horror! The horror! Perskyi pereat!  --Lambiam 21:42, 16 April 2019 (UTC)

Wiktionary:Random Competition 2019Edit

Hello all, I decided it's time to kick start the 2019 Wiktionary word game, which, for copyright reasons, it not like any other board game in the world. Ever. Any such resemblance is purely a fluke. User:Metaknowledge has won the last two years, let's try to knock them off the top. --I learned some phrases (talk) 00:27, 16 April 2019 (UTC)

Splitting AramaicEdit

It seems to me like we need to split the various stages of Aramaic into actual separate language codes, chief in my mind, Ancient Aramaic and Imperial Aramaic from Middle Aramaic, i.e. Jewish Babylonian Aramaic. I'm thinking [arc] should be reserved for the family code. @Fay Freak, Profes.I., Wikitiki89, -sche, Metaknowledge, thoughts? --{{victar|talk}} 03:57, 16 April 2019 (UTC)

No. Don’t know why Jewish Babylonian Aramaic would be Middle Aramaic, while Galilean Aramaic not? And Biblical Aramaic is still not so distinct from Jewish Babylonian Aramaic. And Imperial Aramaic is not that far. And what would even be Babylonian? If some people wrote Aramaic in Spain I would not know if it is “Jewish Babylonian Aramaic”. And what which Aramaic derive all the Arabic, Armenian and what ever terms from that are said to be from Aramaic? Working on the premise that the Aramaic form is the same or same enough, sometimes only a more modern form given (as for example when one gives the now leading German form when there have been a lots of forms before but a language derives from earlier German, not clear exactly which form), we have customarily given “Babylonian” forms from which other language terms are derived. All much constructed, and useless distinctions, and not resembling the actual language situations. Other dictionaries do not distinguish either necessarily, though some restrict a dictionary to a certain “dialect”. The various terms are more distinctions of genres of texts, classifications of corpora, that is for literary studies, than useful for linguistics, or specifically lexicography. What would you gain except pain from splitting?
General rule: If the set of grammar is essentially the same, it is the same language. One should recognize that some languages move slower than others. So “Aramaic” spans two thousand years or more before deserving split language codes, and Arabic has also only one over one and a half thousand years and rightly though some dialects coexist with this Dachsprache, whereas over this time span French has four (Latin, Old French, Middle French, New French), but most other Romance languages only three (Latin, Old Spanish, Spanish), and even that being under the suspicion of being too much as the difference is not so great (“Old Italian” has hardly been used here).
Why would the situation for Aramaic be different from what is now seen in Arabic? They all wrote a Dachsprache even if dialectally the differences might have been greater amounting to “different languages” (which isn’t a clear concept either with the modern Arabic dialects). Only after being conquered by Arabs the unity dissolved. What you want to do is like to remove Arabic as a language and only treat it as a group because one sees some “stages” and unintelligibility between the “actually spoken” languages – but there is continuity too. The situation with Greek seems even be similar, for is it any different from splitting Ancient Greek in “Attic”, “Aeolic”, “Ionic” etc. and “Koine Greek” and “Byzantine Greek” (“Middle Aramaic”)? If something is only from a certain period one can state it in labels, but splitting the alleged stages is de trop. Fay Freak (talk) 13:05, 16 April 2019 (UTC)
@Fay Freak: Sooo... Scots and English are separate languages but not Jewish Babylonian Aramaic and Imperial Aramaic? Look, I'm a huge advocate for merging dialects and do so on the regular, but these are two distinct languages, with their own pronunciations, morphologies, written in two different scripts, and separated by hundreds of years. The delineation seems pretty clear to me, far more than, say, Old French and Middle French. --{{victar|talk}} 15:23, 16 April 2019 (UTC)
But also, what is this entry? It lacks any and all labels. What are its sources? Is it ancient, and if so, are these vowel points true to the attested word, or are they hypothetical? --{{victar|talk}} 16:57, 16 April 2019 (UTC)
See, you force people to write things that they don’t know.
What kind of question is that even “is it ancient?”? People see words as Aramaic, they add them as Aramaic. Who would split all the Aramaic entries? You wouldn’t. Nobody is there who would. You are proposing a thing that is impossible to accomplish, going against existing desires: One could have separated already the lects by labels, but the desire to separate has not been there, and you won’t create it against editors who have hereunto been reluctant to separate.
And again, you ignore the principle of unity while claiming they have been “separated by hundreds of years”. Cicero is separated from us more than two thousand years, and yet Stephanus Berard writes the same language. There is back-coupling and cross-coupling and it is as important and sometimes more important than evolution. And yeah, there is no reason why there wouldn’t be Classical Nahuatl from 2019 if an author subscribes to the old rules. Years and scripts are not even an argument at all, and pronunciation only with caution. Sounds merging the distinction of which is not even expressed in script, like also begedkefet, is rather an argument against splitting for lexicographical purposes because the differences are not relevant on the token-level graphically. The fact that we distinguish “Imperial Aramaic script“ and “Hebrew script” is delusive: It is two scripts but it is also the same script, the like as Cyrillic and Latin Serbo-Croatian but diachronically, or even closer. Morphology: Dubious, I stressed the differences must be essential: The fact that some or many Classical Arabic constructions and derivation types are now not used does not mean Modern Standard Arabic is not the same language. MSA is a subset of Classical Arabic, JBA is a subset of Imperial Aramaic. Distinguish decline from split. Romance was bad Latin before it become modern languages. Fay Freak (talk) 22:15, 16 April 2019 (UTC)
@Fay Freak: How it is unreasonable to expect the user to know which form of Aramaic it is? That's like saying how can we expect people to add πατήρ under the correct header when it just reads Greek. It should be the contributors responsibility to understand the material, especially when it comes to ancient texts. In virtually all of my Aramaic sources, it either specifies the form of Aramaic or cites the work that does. So going back to my example of פתגמא‎, without sources and a proper label, how do we know the original text was even in Hebrew, let alone had vowel points? It seems to me, specifying the form of Aramaic is essential to the entry's quality and the comparison to Serbo-Croatian is comparison between unequals.
I'm trying to follow your comparison to Arabic and Latin. Yes, Imperial Aramaic was a standardized liturgical language, much like Classical Arabic and Middle Latin, but how does the Jewish Babylonian Aramaic of the Middle to Late Aramaic periods fit into that using your argument? Are you trying to say JBA is the liturgical successor to Imperial Aramaic, like Classical Arabic is to Modern Standard Arabic? --{{victar|talk}} 02:48, 17 April 2019 (UTC)
It isn’t some source or the provenance that should tell you whether it is in a language but the text itself. The comparison with Serbian, Bosnian, Montenegrin, Croatian lies here. If I see a text on the internet it often takes long to find out in which of these it is and I often do not know it at all at the end. (And this is not even since the internet but similarly with printed texts in Austria-Hungary.) Hence it is sane to treat all as Serbo-Croatian, because the difference is too minute. The fact that it is often treated separately is no indication that it shouldn’t be done otherwise. And the reason to restrict treatments of Aramaic to certain lects is similar to treating only a regional variant of Serbo-Croatian. A Serbo-Croatian historical dictionary is more work than a dictionary of standard Croatian of the 21st century. Similarly, there is no man who could compile a “Comprehensive Aramaic Lexicon” though there are many who wish they could. If people compile works restricted to periods and provenance it is because of the pile of material is large and scattered. Hence we also have Latin dictionaries restricted to Latin of antiquity because a work including Medieval or even Modern Age Latin would be yuge (so Karl Ernst Georges could enter all Latin from when Latin lived into a dictionary, but not more, and it took his life). You also see that for Ancient Greek the limits have been pushed more to the present apparently with media access improving. The recent The Brill Dictionary of Ancient Greek covers all up to the 6th century CE, other Ancient Greek dictionaries stop at 200 or wherever, just because one is a homo oeconomicus and has to end somewhere or publish somewhere. The definition of a language is pliable dependent on what one wants to accomplish. This is to show you that “what is a language” is an economic decision, and when one writes about “languages” one writes about the literary history of the grammars and dictionaries created: the picture will be slanted by this fact. (Encyclopedias, Wikipedia, often fall for this fallacy, because they can’t know all the material either.) That hence treatments of what might seem as “languages” do not entail that there are indeed separate languages that they should be like that in a community dictionary. A lot of absurd distinctions have been entered this way into Wiktionary already, so we have “German Low German”, “Dutch Low German” (Dutch Low Saxon) and “Mennonite Low German” (Plautdietsch), which is obviously caused by researchers not accessing all three areas, though these lects form a unity. Hence editors who dealt with texts from the three languages somewhat extensively concluded that the separation was wrong (@Korn, as I remember). As a result the editors get disenchanted because of the arbitrary distinction and cease to treat the language on Wiktionary.
What you say “without sources and a proper label, how do we know the original text was” etc. is a general problem of Wiktionary and lexicography, but has little to do with language distinction. We would ideally have quotes to make all clear, which regions and periods used it and what the semantic range was or probably was, but splitting the language distinctions anew is no way to achieve editors to do it more than they already do, but I expect it will cause treatment of the language to die off. Fay Freak (talk) 19:01, 17 April 2019 (UTC)
Confirm, Low German would likely greatly benefit from eschewing categorisation based on non-linguistic tradition (orthography and political borders), but nobody wants to do the work of actually etching out a working solution for such a radical change or risk letting someone else do it alone. From this experience I warn that once a split is decided, some people will start implementing it, maybe botting it, potentially frustrating some editors. But if after five years everyone agrees it isn't an optimal state, it's likely that the decision will never again be undone, because that would require enough editors in Aramaic to band together, declare consensus, and then elbow-grease away five years of random edits to implement it, so you'll have a terrible mixed mess. Korn [kʰũːɘ̃n] (talk) 22:02, 17 April 2019 (UTC)
I'm wary of language splitting in general, but how hard would it really be to split it into Old and Middle or something equivalent?
Most of Wiktionary is written by people with a passionate interest in given languages, rather than passerbys who enter a word or two, especially for ancient or obscure languages. I think that asking editors to ascertain the variety roughly enough is not placing too much burden. I certainly feel that using the actual script in which the form was attested is in should be obligatory.
That said, it comes down to convenience, could we hear some more people who work on Aramaic? Crom daba (talk) 23:17, 17 April 2019 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── I think I need to circle back on my original complaint. To call Jewish Babylonian Aramaic the same language as Imperial Aramaic is completely unprecedented according to current scholarly conventions, which is why we have one ISO language code for Old Aramaic [oar] and another for Jewish Babylonian Aramaic [tmr]. Equally troubling is calling Classical Syriac [syc] lemmas "Alternative forms" to JBA, as seen in this entry, which by no stretch of the imagination should be considered one in the same language. Lumping JBA into Old Aramaic also creates the problematic situation where we find ourselves labeling inherited and borrowed descendants of Imperial Aramaic as derivatives of JBA in the descendants section, which is grossly inaccurate.

Here are three possible options:

Keep Old Aramaic and JBA mergedSplit Old Aramaic and JBA apartSplit into Old Aramaic and Judeo-Aramaic
  1. Label all identified entries, ex. {{lb|arc|Imperial Aramaic}}
  2. Move all unidentified entries to Latin script entries
  1. Split Aramaic [arc] into a) Old Aramaic [oar] (also with appropriate labels, i.e. Ancient, Imperial, Biblical, etc.), and b) Jewish Babylonian Aramaic [tmr] (beside other Middle Aramaic lects, i.e. Mandaic [myz], Samaritan [sam], etc.)
  2. Move all current Aramaic entries in Hebrew script (which is all of them) to Jewish Babylonian Aramaic
  1. Split Aramaic [arc] into a) Old Aramaic [oar] (Ancient, Imperial), and b) Judeo-Aramaic [arc-jud] (Biblical, Jewish Babylonian, Jewish Palestinian)
  2. Move all current Aramaic entries in Hebrew script to Judeo-Aramaic

Incidentantally, I found this very old (2012) discussion also advocating splitting Jewish Babylonian Aramaic away from Aramaic, but unfortunately nothing came of it. It should be noted that the 7th most recent Aramaic entry is from 2008/2010, so activity within Aramaic has been virtually dead for a long time. I wrote the transliteration module for Imperial Aramaic just yesterday. --{{victar|talk}} 05:16, 18 April 2019 (UTC)

In “this entry” the Syriac/CPA isn’t put as an “alternative form of JBA”, this is exactly what isn’t done because the entry is just “Aramaic”, and also the JBA form is rather the here alternative form אֲרִישָׁא‎, not what I made main form אֲרִיסָא‎. I didn’t make the JBA form “the” form. Nothing at all troubles me with the current language headings. Also as you can see on the CAL the form is way earlier, “Palestinian Targumic” and the like. On the other hand “Old Aramaic” allegedly gives way to “Middle Aramaic” by the 3rd century which is quite arbitrary. Haven’t dealt with the Imperial inscriptions much but at least Biblical Aramaic is not very different in grammar from so-called Jewish Babylonian Aramaic, from what I have seen in grammars and samples. The “options” show again how arbitrary the distinction is, apart from being variously complicated, given all the alleged dialects of back in the day that have a name (why would you even want to do the opposite from what CAL does? Their preference is also to lump and label, to avoid complicated structure at such a fundamental level). But if you can distinguish, you could already do it with labels, and you could auto-categorize Imperial-Aramaic script Aramaic as Imperial Aramaic.
Ironically, in “this very old (2012) discussion” @334a argued (nobody else argued on this topic) that Aramaic should be split “like Mandarin, Cantonese, Wu”, which is a split that has been reversed since because of having been experienced as tedious.
And again you ignore the principle of unity. You say “To call Jewish Babylonian Aramaic the same language as Imperial Aramaic is completely unprecedented according to current scholarly conventions, which is why we have one ISO language code for Old Aramaic [oar] and another for Jewish Babylonian Aramaic [tmr].” One can likewise say that they are all the same language, hence the language code [arc] and hence the name Aramaic – it shows that the Verkehrsauffassung prefers a unity, even more than with Serbo-Croatian. As I said: The definition of a language is pliable depending on what one wants to accomplish. The truth of what is a separate Aramaic language does not need to be pursued here howsoever. The fact of being entered in Imperial Aramaic script is alone a feature that allows every reasonable reader and editor separation. Adding “Imperial” in every L2 header of every such entry does not add any value, because nobody cares at that point, but splitting has the potential of disenchantment, as sufficiently outlined with Low German and what not. Fay Freak (talk) 16:52, 18 April 2019 (UTC)
@Fay Freak:
  1. Delineating Aramaic into Old and Middle is well founded principle of thought within Aramaic scholarship (see Fitzmyer and Siegal). To call a time period delineation "arbitrary" is to call all such delineations so. A line in the sand needs to be made somewhere and all Middle Aramaic languages have their own language codes on en.Wikt with the unjustified exception of JBA.
  2. No one is suggesting any radical segmentation of Aramaic -- we're just talking about splitting JBA away, as per common convention.
  3. [arc] is intended to exist as a family code, which is why we have codes for Old Aramaic [oar] and JBA [tmr]. Arguing that is like arguing "why do we even have [gem]?"
  4. Who has suggested adding "Imperial Aramaic" to any headers? Not I. I am calling for labels within Old Aramaic. I recommend you carefully read through my three proposals again.
--{{victar|talk}} 17:17, 18 April 2019 (UTC)
Does not make a difference. If you write “Jewish Babylonian” everywhere it does not make the entries any truer. The entries should be already true, and it is true if it stands that they are “Aramaic”, adding “Jewish Babylonian” in L2 does not add to it. Labelling “Jewish Babylonian” (with {{lb}} and {{tlb}}) is what one can already do – provided a form is really “only Jewish Babylonian Aramaic”. With “Aramaic” one is on the safe side. I dispute that a line in the sand must be drawn. Not delineating is also founded. What we must do only is to call entries by a name to give the reader the information they want about what it means and where it comes from, and “Aramaic” does the job no less than any split distictions. Whether it belongs to a certain region or period is something readers expect in labels and not on the language distinction level. Hence Chinese is not even split because it is not necessary for giving the information. Fay Freak (talk) 17:29, 18 April 2019 (UTC)
@Fay Freak: And that viewpoint would best first of the three options I put forth, but that should prerequisite moving unidentified Aramaic entries to Latin script entries because to render them in Hebrew is inappropriate. You may disagree with that statement, citing your incomparable Serbo-Croatian comparison, but I think you'll find that most editors disagree with you on that point, and think that historical lemma should either be rendered in their original script, or in Latin. --{{victar|talk}} 17:41, 18 April 2019 (UTC)
Aramaic should never-ever be in Latin script. The so-called Hebrew script (which is really the Aramaic script at least no less than the Hebrew script) is the appropriate fall-back script, particularly when a script is not yet encoded (there have been a lot, and they flowed into each other that one often does not even know if it is “already an own script”). The scholarly convention has always been to give Aramaic in Hebrew script when it has been attested in a script that is not available for printing or easier to enter, for example Nabataean Aramaic has often been cited as Hebrew. Again you miss what is the actual cause of scholars doing a thing. They don’t use Latin script because it is somehow standard or conventional, but because they are bad at doing better. Like Nostraticists cite all in Latin script. So Iranists rub their hands as long they can excuse writing books about Middle Persian and making recensions of Middle Persian works in Latin script with its not being encoded or no keyboard layouts being available etc. and thereafter they revel in being able to continue being lazy by referring to an alleged convention consisting of practices that have however always been wrong, but they have missed to tell it just in case it becomes opportune to rest on the Latin script. But the philological science of a remote culture only starts when you leave the schemes of the Latin script: As long as something is in Latin script it is pseudo-science, an imitation, if it wasn’t inclined for the Latin script, a surrogate for science. Now I see how the cat jumps: This proposal is another plot of Rome to impose its script upon all peoples, to the detriment of science. Fay Freak (talk) 20:40, 18 April 2019 (UTC)
If I had a class of philologists at my disposal, I'd employ them to lookup random n sentences from the corpus of Old and Middle Aramaic and see word by word how many words would result in effectively doubled data (sans the script) when entered together with its respective cognate (if it exists) into Old and Middle Aramaic.
Barring that, I don't think this discussion will be fruitful. Crom daba (talk) 20:57, 18 April 2019 (UTC)
(edit conflict) @Fay Freak: I would say the vast majority Aramaic terms in sources not in their original attested script are rendered in Latin. The only exceptions you might find to that is in the context of Jewish Aramaic or Hebrew research. Just as we render Tocharian and Book Pahlavi in Latin, instead of, say, Chinese or Arabic, the default on en.Wikt is to use Latin script. Regardless, that would not be my prefered option or the three I present.
I really rather hear more from others, because evidently you and I cannot alone come to any agreement. --{{victar|talk}} 21:07, 18 April 2019 (UTC)
Just chiming in here. I think Aramaic should definitely be split. In particular I find Fay's comparison of Aramaic to Latin to be ill-conceived. While it is the case that Latin is still produced in academic and ecclesiastic contexts, it has no native speaker population. That class of native speakers did stick around in the form of Romance language speakers. While modern Aramaic may have a lot of learned borrowings from earlier stages, this is nowhere near saying that the spectrum modern Aramaic lects is equivalent to the very much dead Latin.
Furthermore, if a user has trouble adding an Aramaic term because they don't know its chronology or script, this is as it should be. We aren't a dumping ground for random words. Even comparison to Ancient Greek show the highly articulated system for handling Ancient Greek dialectology (if "Ancient Greek" existed in any real sense until the Hellenistic period). Users should be expected to know about the words they are adding. —*i̯óh₁n̥C[5] 00:10, 19 April 2019 (UTC)
Script, I always stood on the position that it should be in the original script as far as possible. Nor did I say anything about modern Aramaic dialects. But chronology? I emphasized that it must become clear from the language itself that it is a different language. If a user does not see a difference then there probably isn’t any. Like one finds texts and does not see whether they are Bosnian or Serbian, or even if one knows the difference it does not look enough to necessitate a separation. The date and region is not the language. It’s not about “trouble” but about artificially seeking language distinctions when it is unneeded. The idea that after a certain point we have “Jewish Babylonian Aramaic” because “the line must be drawn somewhere” is, however tempting, still arbitrary, because only the purpose determines whether something is a language. It still has not been shown that only if languages are separate in that scholarly sense that is determined howsoever they should be separated in this lexicon. We treat Ukrainian and Russian and all Slavic languages separately because the standards and tables are different and the words in details too often, though it might appear that they are only one language with umpteen dialects. It does not matter what “is a language”, no! Who inculcated you that when something is considered a separate language by scholars it must be split off on Wiktionary? Baseless presumption. Whether something “is a language” or whether it is “correctly seen so by scholars” does not matter if we still can handle all with the same tables. And so we can handle Biblical Aramaic like Jewish Babylonian Aramaic in the same sections with the same tables. After all we have even Chinese in one heading. According to most purposes German and English are separate languages but a split of Jewish Babylonian Aramaic serves according to my expectation no purpose. Fay Freak (talk) 00:52, 19 April 2019 (UTC)
This proposal is another plot of Rome to impose its script upon all peoples, to the detriment of science. Really? Aramaic is normally written in a script of 22 letters; one can replace them with an arbitrary set of any other 22 symbols and preserve all the information. Transcription into computerized or typeset Aramaic throws away a bunch of information that may or may not be relevant, and that transcription may be more or less accurate. Once that's done, an exact transliteration doesn't change anything, yet makes it easier for people familiar with Latin script to understand the text and makes it possible to compare across languages that use different scripts. Demanding that all the people working in Gothic and Egyptian hieroglyphics use the exact form of the original has nothing to do with science, it's just exclusionary.--Prosfilaes (talk) 04:25, 19 April 2019 (UTC)
So many dissonances: On one hand one shall not be “exclusionary” and one not demand the exact form of the original, on the other hand one shall and it “is as it should be”.
What about acknowledging that the current exclusionarity is exactly right? No proposal for change here is advantageous. It’s all about adding qualifiers like “Old”, “Judeo”, “Jewish Babylonian” or Latin script where it does not belong to. As they are, the entries are correct and miss only the details like most entries on Wiktionary (period, quotes). Pigeonholing further but base all on one’s own “Latin“ standard is just the classical American hybris and doublethink and, in this case, casual anti-Semitism. “Just disregard all cultural ties and do it like we do. The American way is best! Our way is the basis and father of all things!” Fay Freak (talk) 13:45, 19 April 2019 (UTC)
My stance is thus: Ideally, they should be split, but it's a lot of work and maybe not worth doing. A note regarding Biblical Aramaic: It is very different from Jewish Babylonian Aramaic. For one thing, BA is a Western dialect, while JBA is an Eastern dialect. There are numerous grammatical, phonological, and morphological differences. --WikiTiki89 19:11, 19 April 2019 (UTC)
Indeed. There are also numerous grammatical, phonological, and morphological differences between the comedies of Plautus and the sermons of Augustinus. So much one could write about the unlike syntactical constructions, the sound changes meanwhile, and the different endings used. And yet what matters lexicographically would need to justify different language headers. And yes, I also oppose the concept of a language “Old Latin”. Its names aren’t even correct. Fay Freak (talk) 19:47, 19 April 2019 (UTC)

Misspelling AlternativeEdit

One of the biggest objections that seems to be raised around removing misspellings, or banning them, is that the entry for a misspelling points users who search for it to a correctly spelled entry. One possible solution which would enable search to consistently find misspellings we deem important enough to include would be to put them on the correctly spelled entry, but not display them. The search function finds them easily enough, so searches for misspellings still present the correctly spelled entry to the searcher, but without the intermediate step of landing on an incorrectly spelled page.

I created a demonstration template {{misspelling}} which provides the simplest possible version of this idea, and applied it to the recently deleted page urothelical (urothelial). One simply adds {{misspelling|urothelical}} to the end of the language section (or entry?), and then when a user searches for the misspelling they see the correct page (in this case as the first entry suggested). Additionally the template labels the term as a misspelling, so it is somewhat clear what is going on. You can see what it looks like on these example search results. This could obviously be fancied up with language and categorization and all kinds of things if desirable.

Ideally the Mediawiki search would be smart enough that the user always had the correct entry suggested at the top of the search page, but with the multilingual nature of the project that is an extremely difficult goal. This sort of structure may actually strengthen the search's ability to suggest pages, or to become more advanced down the road. Thoughts? - TheDaveRoss 18:37, 17 April 2019 (UTC)

Cool. For this example, isn’t “urothelical” a misconstruction though? Fay Freak (talk) 19:07, 17 April 2019 (UTC)
Probably, I just copied the intent of the original page. No reason this same mechanism couldn't be applied to misconstructions and typos. - TheDaveRoss 19:09, 17 April 2019 (UTC)
If we define a misconstruction to be a misunderstanding or misinterpretation resulting from the use of the wrong meaning of a word that has multiple meanings, then this is not one. Although uro- has two meanings, this is not the result of interpreting it incorrectly as meaning “tail”. The issue is solely with the spelling thelical, which has zero meanings. Note that we also list the miscreation epithelical.  --Lambiam 18:13, 18 April 2019 (UTC)
Support. It's a pretty elegant idea that lets these words be findable without having an entry. On the other hand, how do you distinguish a misspelling from an alternative spelling? —Rua (mew) 15:52, 18 April 2019 (UTC)
Alternative forms are when there are reasons why an informed person uses or used the forms (conscious spellings), while misspellings are when such reasons are absent. If a misspelling is also a legit spelling then one would of course use the gloss templates we are used to, {{misspelling of}}. Fay Freak (talk) 17:05, 18 April 2019 (UTC)
If "urothelical" happened to be a word in another language this wouldn't work at all. DTLHS (talk) 18:15, 18 April 2019 (UTC)
Not at all is a bit strong, but if there is no acceptable way to make that situation work we can leave the status quo for entries which would otherwise exist, and use this method for entries which would not. Other solutions include listing common misspellings in the "also" section at the top, perhaps distinctly in a "did you mean" format. - TheDaveRoss 18:31, 18 April 2019 (UTC)
Does anyone actually make use of our misspellings as data for some purpose?
Even if there were, the proposal, with modifications and limitations as suggested above sounds good. DCDuring (talk) 18:46, 18 April 2019 (UTC)
At this point I'd just like to kill all "misspellings" until they become acceptable spellings (not sure exactly how we make that judgement!). Giving them first-class status encourages all the fungus of categories, alt forms, etc. to grow on them and legitimises them beyond what they deserve. But this idea might be an improvement, sure. Equinox 19:28, 18 April 2019 (UTC)

Proposal to unify the size and style of CJKV textEdit

I have a proposal for a number of changes regarding CJKV text:

  1. Unify font size: 120%.
  2. Set line-height to be 1, to prevent CKJV text from affecting the line-height of Latn text.
  3. Re-enable bold font weight for Japanese.
  4. Do not enlarge CJKV bold text.
  5. Do not use bold font weight for all Vietnamese Hani text.
  6. (Other cleanup.)

Rough preview of before and after.

Secondary to CSS:

  1. (Use Kore for all Korean text, instead of using Hani for hanja and Kore / Hang for hangul.)
  2. (Repair certain Japanese furigana templates to fix certain oddities regarding font size.)

If there are no objections, I will ask for implementation.

Suzukaze-c 19:56, 18 April 2019 (UTC)

If I remember the history, bold was avoided for Japanese due to concerns of legibility with some of the more complicated kanji.
Test cases, using a simple span and inline CSS to set the font size to 120% and the font weight to bold:
  • 警察、協議、宿題、頑張る、薔薇、麒麟、憂鬱、摩擦、魔羅、魅惑、輿論、興業
  • 警察、協議、宿題、頑張る、薔薇、麒麟、憂鬱、摩擦、魔羅、魅惑、輿論、興業
Subjectively, I'd say that bolding does cause a certain loss of visual distinctness. However, I'm not sure if it's enough to eschew bolding altogether. ‑‑ Eiríkr Útlendi │Tala við mig 19:55, 1 May 2019 (UTC)
Mm, but the Japanese Wikipedia doesn't have any problems with using bold formatting. No one else on the internet enlarges text like this. —Suzukaze-c 17:01, 8 May 2019 (UTC)
(@Eirikr, just in case —Suzukaze-c 19:21, 16 May 2019 (UTC))
(@Suzukaze-c, sorry, didn't realize any further input was needed? I have no strong objections: there's some blobbiness due to bolding, but I don't think it's a blocking issue -- especially when boosting the font size by 120%. ‑‑ Eiríkr Útlendi │Tala við mig 21:09, 16 May 2019 (UTC))
Just making sure. —Suzukaze-c 22:27, 16 May 2019 (UTC)

@Erutuon Do you think that adding this is alright now? —Suzukaze-c 02:22, 21 June 2019 (UTC)

@Suzukaze-c: I guess so, because there have been no objections, though I'm a little unsure because only one other person has responded. Perhaps implementing it is the only way to get people to express their opinions on it. — Eru·tuon 16:38, 21 June 2019 (UTC)
Hm, maybe. I will hold the responsibility. —Suzukaze-c 23:55, 21 June 2019 (UTC)
(@Erutuon. —Suzukaze-c 05:30, 1 July 2019 (UTC))
@Suzukaze-c: Oh, I added the CSS, but didn't notify you... sorry. — Eru·tuon 13:43, 1 July 2019 (UTC)
Ah, thank you! —Suzukaze-c 00:26, 2 July 2019 (UTC)
@Erutuon Oh, I think you forgot the addition of 'MoeSongUN' to Hant. (it's a font by the Taiwanese Ministry of Education, and seems pretty alright to me) —Suzukaze-c 00:28, 2 July 2019 (UTC)
@Suzukaze-c: Oops, done. — Eru·tuon 01:13, 2 July 2019 (UTC)

Semi-automatic correction of charactersEdit

Semi-automatic correction of Cyrillic text with Latin charactersEdit

As editors who watch Recent Changes probably have noticed, I've been correcting Cyrillic text that contains Latin characters. I created a list of links in {{m}}, {{l}}, {{t}}, and {{t+}} for languages that only have Cyrillic script listed in their data table that includes the entries that will be processed. Russian, Ukrainian, Belarusian, Bulgarian, and Macedonian have already been processed. I'm using the list of similar-looking Latin and Cyrillic letters at w:User:Trey314159/homoglyphHunter.js, with some additions. An example edit can be seen here. I review each edit and don't change some of the links because they are clearly in the Latin script. — Eru·tuon 20:33, 18 April 2019 (UTC)

Finished. There still remain other linking templates that might need cleanup. [Edit: Finished the most common etymology templates.] — Eru·tuon 21:28, 18 April 2019 (UTC)

@Erutuon Great work, thank you! You may want to do the same with Arabic (partial) homoglyphs with Arabic, Persian, Urdu, etc. If it's still required. --Anatoli T. (обсудить/вклад) 04:25, 19 April 2019 (UTC)
@Atitarev: Yes, I think I should do that. I'm familiar with Arabic, but not very familiar with Persian or Urdu. What characters should I be looking for in each language and replacing? In Persian, it looks like ك‎ (Arabic letter kaf) and ي‎ (Arabic letter yeh) would be incorrect, since Persian uses ک‎ (Arabic letter keheh) and ی‎ (Arabic letter Farsi yeh) instead. If that's right, I can look for the non-Persian character in Persian linking templates and replace it with the Persian one. (Here's the working list.) — Eru·tuon 05:01, 19 April 2019 (UTC)
(edit conflict) @Erutuon: ك‎ (Arabic letter kāf) and ي‎ (Arabic letter yāʾ), ى‎ (Arabic letter ʾalif maqṣūra), and ک‎ (Persian letter kâf) and ی‎ (Persian letter ye) are exactly the letters to look for, they are partial homoglyphs because they look identical only in certain positions, copypasta and wrong keyboards cause the common misspellings. Urdu uses the Persian ک‎ and ی‎. The Arabic ي‎ is also used in Pashto but Pashto uses the Persian ک‎. These are the most common errors, which can be checked without any deeper knowledge of these languages and the spelling rules. Things to look for is to check if letters specific to one language are used in another, e.g. the Arabic ة‎ (tāʾ marbūṭa) is normally not used in other languages or it would be an extremely rare case, like specific Persian letters can occasionally be used in standard Arabic or dialects. --Anatoli T. (обсудить/вклад) 06:02, 19 April 2019 (UTC)
Okay, thanks! I've added instances of alif maqsūra to the Persian list, and tāʾ marbūta as well, though the latter I will just let others deal with. — Eru·tuon 06:46, 19 April 2019 (UTC)
I've also inferred from w:Pashto alphabet that ی‎ (Arabic letter Farsi yeh) and ۍ‎ (Arabic letter yeh with tail) are only used word-finally. Updated list from all these languages at User:Erutuon/wrong script/Arabic. — Eru·tuon 10:09, 28 April 2019 (UTC)

Semi-automatic correction of Arabic charactersEdit

@Atitarev: I've written a script to semi-automatically replace Arabic kāf with Persian kâf and Arabic yāʾ and ʾalif maqṣūra with Persian ye in Persian and Urdu link templates, and Arabic kāf with Persian kâf in Pashto link templates. [Edit: Finished running through all the examples in User:Erutuon/wrong script/Arabic.] — Eru·tuon 01:25, 29 April 2019 (UTC)

@Erutuon Great work, thank you. Are you going to run to replace it on all entries? Not sure how it's going to work, sorry. It would be a safe replacement for Arabic and Pashto in non-final positions. Arabic yāʾ and ʾalif maqṣūra can be problematic (one has to choose and it could be another word or alt. spelling) in the final position but Arabic should never use the Persian ye in any position. Pashto only uses the Arabic yāʾ but there are other similar letters, in any case, suspicious cases could be dumped into a problem bucket to be looked at by knowledgeable editors. --Anatoli T. (обсудить/вклад) 01:36, 29 April 2019 (UTC)
@Atitarev: Sorry, my wording was probably not very clear; I'm not doing anything in Arabic-language links. The script allows me to make two replacements in Persian and Urdu links (Arabic kāf to Persian kâf, Arabic yāʾ or ʾalif maqṣūra to Persian ye) and one replacement in Pashto links (Arabic kāf to Persian kâf). As to Pashto only using Arabic yāʾ, w:Pashto alphabet indicates that Persian ye is used word-finally. Is that wrong? — Eru·tuon 01:44, 29 April 2019 (UTC)
(edit conflict) @Erutuon: No, it's correct. Are you talking about the translation adder above? If yes, the Arabic could also use replacements from Persian kâf to Arabic kāf (safe). --Anatoli T. (обсудить/вклад) 01:51, 29 April 2019 (UTC)
@Atitarev: Oh, no, this is about correcting link templates, not the translation adder. — Eru·tuon 01:55, 29 April 2019 (UTC)
@Erutuon: So, this will correct the displayed characters, even if the actual character is wrong? I think it's great but we need to prevent these to happen in the first place as well. --Anatoli T. (обсудить/вклад) 01:59, 29 April 2019 (UTC)
@Atitarev: No, I've been making edits to replace the characters like this. I agree that putting the logic in the modules is not a good idea. — Eru·tuon 02:01, 29 April 2019 (UTC)
@Erutuon: I'd need some training to understand what you've been doing. :) --Anatoli T. (обсудить/вклад) 02:04, 29 April 2019 (UTC)
User:Erutuon/wrong script/Arabic is the list for knowledgeable editors to look at – though it won't be updated until the next dump is released, so some of it will already have been corrected. (Hmm, I should work on rules for the Arabic language. At the moment it only has Persian, Urdu, and Pashto.) — Eru·tuon 01:48, 29 April 2019 (UTC)
@Erutuon: We should invite more people - Persian, Arabic, Urdu editors or someone who can run a bot - too many errors but it may require a human eye. The number of errors is mind-boggling. Some measures to prevent using wrong letters would be great (a general comment to the community, not yourself). --Anatoli T. (обсудить/вклад) 01:55, 29 April 2019 (UTC)
@Atitarev: Looks like ہ (U+06C1 ARABIC LETTER HEH GOAL) and ھ (U+06BE ARABIC LETTER HEH DOACHASHMEE) in Urdu, ه (U+0647 ARABIC LETTER HEH) in Arabic, and ە (U+06D5 ARABIC LETTER AE) in Uyghur would also be worth tracking. — Eru·tuon 03:55, 29 April 2019 (UTC)
@Erutuon: I agree but if you know what you're doing :) I think Urdu also uses ه‎, less commonly. --Anatoli T. (обсудить/вклад) 04:05, 29 April 2019 (UTC)
@Atitarev: My impression from w:Urdu alphabet is that Urdu is ideally only supposed to use ہ‎ (U+06C1 ARABIC LETTER HEH GOAL) and ھ‎ (U+06BE ARABIC LETTER HEH DOACHASHMEE), not ه(h) (U+0647 ARABIC LETTER HEH), but even that article is currently using ه (U+0647 ARABIC LETTER HEH) in places where ہ (U+06C1 ARABIC LETTER HEH GOAL) would be expected, so probably there is inconsistency. I've added examples of ه (U+0647 ARABIC LETTER HEH) in Urdu to User:Erutuon/wrong script/Arabic, but I won't make any replacements for now. — Eru·tuon 08:01, 29 April 2019 (UTC)
I surveyed Urdu entries in the dump, and 524 titles have ہ‎ (U+06C1 ARABIC LETTER HEH GOAL), while only 28 have ه(h) (U+0647 ARABIC LETTER HEH). So there's a tendency to use the correct character in entry titles. — Eru·tuon 08:30, 29 April 2019 (UTC)
@Erutuon: I am not 100% sure about this. It's good that you saved the data. Someone may want to use it later. --Anatoli T. (обсудить/вклад) 10:35, 29 April 2019 (UTC)

Finding multiword terms when searching for one of the words?Edit

I had an interesting situation just now, where a friend used the term Gish gallop. I had no idea what that meant. I tried looking up the unfamiliar word gish, but the definition there made no sense and didn't help me understand what was being said at all. Of course, I didn't realise that this was a multiword term, and doing what is most natural in the situation (looking up the one word I didn't know) gave me nothing. Eventually she explained it to me and then I realised that it's a combination of two words I needed to look up, which then led me to the right entry. But in itself, there was nothing to hint that this was an idiomatic combination and Wiktionary wasn't helpful in getting me where I needed to be. I'm guessing I'm not the only one to have this problem. Is there anything we can do to improve it? —Rua (mew) 23:53, 18 April 2019 (UTC)

Solution: you consider that the element "gish" might be capitalised, go to Gish, and see the link to the derived term. —Μετάknowledgediscuss/deeds 03:18, 19 April 2019 (UTC)
Because I frequently search for unlinked taxonomic names (including one-part names), I have the habit of searching for entries that merely 'contain' my search term. That kind of search yields Gish gallop as the third item on the search results. DCDuring (talk) 10:03, 19 April 2019 (UTC)

"What means X?"Edit

Hi Wiktionary! I know a German guy and when he doesn't understand a word in English he asks "what means X?". I was trying to explain to him that you have to say "what does X mean?", because "what means X?" sounds like you are asking for a word whose definition is X (although, admittedly, people will probably understand it because the other interpretation is too weird). I don't speak any German and I found it totally impossible to explain to him what the difference is, and why "what means X?" is wrong. (I believe the grammatical term for English is "do-support", but this isn't a guy who will go reading a lot of grammar.) Could anyone help me explain this to him? A few short sentences of German that explain the difference would be absolutely fantastic. Equinox 03:12, 19 April 2019 (UTC)

I'm not great at saying it in German, but What means X? has what as subject (like the nominative case) and and X as direct object (like the accusative case), and What does X mean? is the other way around, with X as subject and what as direct object. — Eru·tuon 04:14, 19 April 2019 (UTC)
I understand the issue grammatically, but I would like to explain it to a German who doesn't know or care about grammar, but thinks "what X means?" and "what does X mean?" are identical. Maybe it would be good to have those two sentences translated very literally into German. Equinox 04:48, 19 April 2019 (UTC)
Hmm, well, the German word for what (was) doesn't have distinct nominative and accusative forms, but if you replace X with you and translate, you get Was bedeutest du? ("What do you mean?") for What does X mean? but Was bedeutet dich? ("What means you?") for What means X?, which seems just as weird as the English. — Eru·tuon 05:10, 19 April 2019 (UTC)
Tell the German speaker that this is a present simple question, for which the auxiliary verb do is required. The German translation, of course, would be "was meint X?" so it would be easy to translate that as "what means X?" --I learned some phrases (talk) 09:50, 19 April 2019 (UTC)
In German the phrase is "was bedeutet X?" (not "was meint X?"), in English you need to add "do" to interrogative sentences, unless the question is about the subject, e.g. "what makes this sound?" - "was macht diesen Laut?" or "who is speaking?"/"who speaks?" - "wer spricht?". --Anatoli T. (обсудить/вклад) 11:56, 19 April 2019 (UTC)
It sounds like he won't care why. You could just tell him that when using means (or meant) in a question the known thing should always be first and the unknown is last. Hence, "This means what?" Now how to put that in German, I have no idea. -Mike (talk) 07:47, 20 April 2019 (UTC)
This web page in German gives a very concise but readable summary (by way of examples) of the main rules of English grammar, including the word order in questions. The example that matches your friend’s problem the most closely is, “What does she watch everyday?”. The page only states what the rules are, not why they are as they are.  --Lambiam 10:41, 20 April 2019 (UTC)

Eye dialect (again)Edit

I know that there have been discussions previously about the use of the label "eye dialect" within Wiktionary, and, especially, whether it is correct to use the term to refer to "pronunciation spellings" that are intended to mimic a nonstandard pronunciation. Since the last time I looked, I think, the following additional definition has been added at eye dialect:

2. (more broadly) Nonstandard spelling which indicates nonstandard pronunciation.

Is everyone happy with this definition, and happy that it should be applied within Wiktionary, such that, just to give one random example, geddit should be labelled "eye dialect"? Mihia (talk) 03:35, 22 April 2019 (UTC)

If /ˈɡɛɾɪt/ is (locally) the standard pronunciation of “get it”, it is not unreasonable to consider ‘geddit’ eye-dialect spelling (compare e.g. the uses of the spelling “compuder”); that does not require the contested additional sense. I for one am unhappy with such dilutions that make these terms less functional; given the etymology of eye dialect it also does not make sense.  --Lambiam 07:35, 22 April 2019 (UTC)
I'm not sure how far down we can or should appeal to "local" standard pronunciations. In some communities and/or registers people might routinely say /ˈɡɛɾɪt/, just as they might routinely say -in' for -ing, or fink for think, or whatever else. These may be "normal" for some people, yet I believe it would be misleading for us to refer to them as "standard pronunciations". Mihia (talk) 17:16, 22 April 2019 (UTC)
I thought realizing intervocalic /t/ or /d/ as an alveolar flap is fairly standard for most speakers of American English, which is a bit wider than “some communities”. Compare liddle.  --Lambiam 23:44, 22 April 2019 (UTC)

BolzeEdit

Here's a link to an interesting BBC article on Bolze, a language formed at the intersection of French and German in the Swiss city of Freiburg. It's rare enough that the Wikipedia article was created in reaction to the BBC article. Something for us to add here as well? -Stelio (talk) 16:52, 23 April 2019 (UTC)

What references are available? DTLHS (talk) 16:54, 23 April 2019 (UTC)
Here is one: [9]. I have the impression from what I see that this is more like a pidgin, where people could choose to communicate in either German or in French, but instead code switch continually, both at the phrase level and the lexical level. Nothing I’ve seen suggests that Bolze has a grammar of its own.  --Lambiam 20:37, 23 April 2019 (UTC)
Indeed, many border dialects of German [de] ~ Alemannic German [gsw] have a great deal of Romance borrowing. To the degree that there is a true mixed lect, Wikipedia suggests a comparison to Portuñol, which we also do not count as a language. —Μετάknowledgediscuss/deeds 20:52, 23 April 2019 (UTC)

The Wikipedia "see also" link to Portuñol is not based on any clear analysis, so I don't give that any weight. Some sample phrases for comparison, in case it helps:

English French Swiss German Bolze Source
open your umbrella ouvre ton parapluie öffne deinen Regenschirm tuuf dy Paraplüi BBC
no, that's too much non, c'est trop nein, das ist zuviel nei, dasch zvüu SWI

-Stelio (talk) 07:08, 2 May 2019 (UTC)

Quoting bible translationsEdit

Is it permissible to use, as a source of indirect quotations, a translation of the Bible that is

  1. Published on dead trees (so durable)
  2. Lawfully available on-line (so currently available)
  3. Still in copyright (so not so easily used)

My idea is to create a 'reference' entry that will link to one (or occasionally more) locations in the lawful on-line copy. An ultimate source I have in mind has a CC-BY-ND licence, which as I understand it, will prevent us quoting single sentences. RichardW57m (talk) 12:37, 24 April 2019 (UTC)

We have generally held that single sentences from any work of any copyright status constitute fair use and are permissible.
Also, most such works often contain permission for using portions of the text up to a certain number of verses, with some caveats. E.g., here is the NASB language: "The text of the New American Standard Bible® may be quoted and/or reprinted up to and inclusive of one thousand (1,000) verses without express written permission of The Lockman Foundation, providing the verses do not amount to a complete book of the Bible nor do the verses quoted account for more than 50% of the total work in which they are quoted." (link) Depending on which version you are hoping to cite there may be a similar permission. - TheDaveRoss 13:00, 24 April 2019 (UTC)
I can't find any such verse-limited permission. Also, I'm not sure how we would warn someone against creating a derivative work if we had many verses. I can certainly envisage myself using at least 40 verses this year. RichardW57 (talk) 21:15, 24 April 2019 (UTC)

Looking at long-term fixes for a and other high-content pagesEdit

This page often shows up with module errors, as it does again now. The reason is pretty obvious: there's so many languages and content on the page, that we overrun our Lua budget. The way we solved this so far is to de-luafy some portions of the page, and that seems to get it to work again for a while. But this is only a stopgap measure, and it always seems to come back one way or another. Eventually, if this process continues to go on, having any Lua on the page will break it. Moreover, the page already takes 10 seconds to load, and I bet that that's not just a matter of Lua. Template logic is generally slower than the equivalent Lua code. The reason Lua tends to break is that it is limited in ways templates aren't, and that we have large datasets in data modules like Module:languages/data2 that need to be in memory in order to be used. When we stored that kind of information in templates, it was pretty awful to manage, rather slow, and required us to enter a lot more stuff manually, such as script codes and transliterations.

Now, keep in mind that this is a standard Latin alphabet letter, most likely used by every single language written in the Latin alphabet. And that, if Category:Latin script languages is any indication, there are over 3000 of them. Given that our goal is "all words in all languages", all of those languages will eventually have an entry on a. Can you imagine having a single page with 3000 language sections on it? Such a page would be too huge to load, Lua or not. That clearly means that our stopgap measures aren't going to help in the long run, we need a proper solution that can work with such a huge number of entries. I see a few obvious solutions, but of course there may be others.

First solution: Characters are translingual only. Any language-specific information about characters is placed in an appendix page, of which there is one per language. Chinese and other CJKV languages would need special treatment because they have the opposite problem: there are too many characters to list in an appendix, and there is an overabundance of one-character words, meaning they'd need proper entries anyway. This solution works without giant overhauls of everything on Wiktionary, but it will not fix every problem we have with too-large pages. Already, the two-character sequence do is so common cross-linguistically that it's overloading the page, and it will only get worse in the future. The number of entries on do is never going to go down, it's only going to go up. So this solution fixes a subset of problems, but doesn't entirely prevent pages from having so many entries on them that they break.

Second solution: Finally bite the bullet and move towards a one-language-per-page solution. This would eliminate all memory errors, timeouts, page load issues and anything else Wiktionary could throw at us as a result of pages being too big or having too big a dataset. So this is really the Swiss army knife of solutions, able to fix all of it in one go. It's also by far the most intrusive solution, as it will require reorganising everything and will break all external links to Wiktionary. However, it may prove to be the only satisfactory solution in the long term. Even if we eliminate all the one-character entries, not all problematic entries are a single character. —Rua (mew) 15:35, 24 April 2019 (UTC)

Neither of your solutions considers that we use Lua for lots of things which are not necessary, if we did that less we would see errors less often. Presumably, as we have more and more content (e.g. translations into every language on every English term) the number of pages which will have Lua errors will trend towards every page with English terms. Things like the "redlinks categories" and dynamically adding language names, dynamically generating links, etc. are all better served by static solutions (the mapping of en to English is extremely unlikely to change often, if ever). We should stop using Lua so often for non-dynamic content before we consider completely restructuring the project to accommodate the errors we have created. - TheDaveRoss 15:40, 24 April 2019 (UTC)
Whether the mapping will ever change or not, we still need to store the mapping somewhere. Moreover, besides being unambiguous, language codes are useful in that they are short, and most uses of a language code require both the language and the code (name for categories and section link, code for language tagging). So if you were to opt for bypassing the lookup altogether, every {{t|fr}} would have to become {{t|fr|French}} in order to supply the information needed. I don't see anyone agreeing to that anytime soon either. —Rua (mew) 19:40, 24 April 2019 (UTC)
Well, in the case of {{t}} we already write the language name, so instead of * French: {{t|fr|fille}} it could be * {{t|French|fr|fille}}, or any number of other solutions. Transliteration is probably worse than language name lookup, that should also be static text. Even if the initial contributor didn't want to add it, we could have bots come along afterward and apply transliterations. Having the server repeat the same work over and over needlessly makes no sense. - TheDaveRoss 20:02, 24 April 2019 (UTC)
That I can agree with, at least. But we can have the best of both worlds. Automatic transliteration, but when the bot comes and provides one it bypasses the automatic one. If someone wants to run a bot to do that I won't oppose. —Rua (mew) 20:44, 24 April 2019 (UTC)
Do you really think that's how Mediawiki is implemented? Nothing is calculated "over and over", it's cached. If we need a better caching layer, as you seem to be suggesting in a roundabout way, it has to be provided by the software and not by ad-hoc "bots" with unknown sources that are liable to disappear at any time. DTLHS (talk) 21:21, 24 April 2019 (UTC)
@DTLHS Are you sure? It seems that loading a page like a takes forever whether I load it the first time or the second time around. The software must be repeating a substantial amount of work for it to take that long each time. —Rua (mew) 23:33, 24 April 2019 (UTC)
a loads for me in about half a second. Internet speed should be taken into account since the page has to be downloaded regardless of how it eventually gets rendered. DTLHS (talk) 23:39, 24 April 2019 (UTC)
@DTLHS That's weird, it takes around 10 seconds for me. Internet speed shouldn't be such a huge factor, because the entire thing is only 121 kB, not nearly enough to flood a modern internet connection. I wonder if browser is a factor? I'm using Firefox. —Rua (mew) 23:45, 24 April 2019 (UTC)
Well we're conflating several things here. Internet speed may not be relevant, but clearly page load time is not constrained by the server since some clients can have dramatically faster load times. This is irrelevant in terms of Lua module errors however since that runs entirely on the server and gets cached, errors and all. Both problems are solvable by splitting pages up. DTLHS (talk) 00:14, 25 April 2019 (UTC)
I tried loading a a few times consecutively and it took 7 seconds the first time, 2 or 3 the second time, and about 1.5 the third time (the latter two times being no worse than any other page for me). Andrew Sheedy (talk) 01:05, 25 April 2019 (UTC)
I agree that our memory problems need to be addressed, but the arbitrary memory limit is something which can be changed from above, and would be a game-changer. So I choose to approach the two suggestions as a matter of policy independent of memory issues: by that standard, I support solution 1, which seems like a less duplicative way of documenting glyphs that are not "words in a language" (like those in alphabets, but not sinograms). I strongly oppose solution 2, which would require users and editors to click through more pages to reach desired content, at least in the way that you have described implementing it in the past. —Μετάknowledgediscuss/deeds 19:51, 24 April 2019 (UTC)
How I described it isn't exactly relevant. Did you see the page I linked? —Rua (mew) 20:02, 24 April 2019 (UTC)
It is extremely relevant, because who would be implementing it? I did see that page, and you can read my comments on its talk page from back when it was current. —Μετάknowledgediscuss/deeds 20:04, 24 April 2019 (UTC)
One structural fix: stop treating all languages and scripts equally in the modules. English should have its own module, so that language code en is handled by a small, economical, quasi-hard-wired module without loading in modules like Module:languages/data2 where 99% of the data won't be used. Likewise the Latin script should have its own module. The idea is to bypass the normal, resource-hungry process for a few heavily-used language and script codes. If every language on the a page uses the same script, why should we be loading data for all the other scripts just to decide we're not going to use it?
Also, we should streamline the first step in language-code processing by having a bare-bones module that just lists the valid codes and where to get their data/lua code- why should I have the sort key for Greek in memory in order to display an English term or link to its entry? And data modules should be smaller and more numerous. Having everything in one place makes it easier to manage, but it massively wastes resources. Ideally, we should only have code and data in memory that we're actually using. Chuck Entz (talk) 01:27, 25 April 2019 (UTC)
That's a bunch of nonsense that you haven't actually tested. How do you know that loading "only what we need" is going to be better if we have to grab it from a bunch of different pages? Please stop speculating and actually implement it if you want to show it's actually worth doing. Scribunto is implemented in nonintuitive ways. DTLHS (talk) 01:48, 25 April 2019 (UTC)
I'm sorry to have given the impression that I was doing anything other than brainstorming. I fully expected to have someone tell me I was wrong if I was mistaken, but not to be, in effect, told to shut up and go away. My goal was to put forth ideas and stimulate discussion, not to show how brilliant I am or tell everyone else how wrong they are. Is there absolutely no part of what I said that might not have some relevance? Chuck Entz (talk) 02:05, 25 April 2019 (UTC)
@Chuck Entz: For nonsense it is pretty sound. I tested it, albeit on a limited basis, and it seems to work just fine. If you put the conditions in the template (which we already do, see {{t}}), and only invoke the module when needed (or invoke an alternate, smaller module) it does seem to only call the modules it actually needs. Certainly worth further exploration. - TheDaveRoss 12:26, 25 April 2019 (UTC)
I understand where DTLHS is coming from, because I haven't completely blocked out the memory of a proposal I made based on my misunderstanding of the way abuse filters relate to the rest of the system (I figured it was better to let it die quietly than to draw more attention to it by retracting it).
I still think there's merit to splitting things up, though it definitely has its limits. In a case like a, data for every language code is going to be needed, but not every script code, so it might help, some- especially if some of the data for languages without Latin scripts isn't loaded. In cases like water, though, my ideas probably wouldn't do much because, as I understand it, we have very little control over when and how modules are loaded into and unloaded from memory- in fact, they would likely do more harm than good by increasing the overhead that comes from loading and unloading modules. If you're hauling x number of items, a crate is lighter than the same number of items in individual boxes. We would have to work from the assumption that loading of data is cumulative- once loaded, we're stuck with it. My ideas would only be helpful where they would reduce the total amount of data loaded on the page as a whole. The key would be how smart we are about how we split things and how the initial module decides what to load- and that isn't something I can help with. Chuck Entz (talk) 14:02, 25 April 2019 (UTC)
I do think there is something to gain by splitting the data based on usage. A common source of errors is large translation tables, so we may have a look at what kind of data is necessary for translations. As of right now, we need:
  • Language code, for tagging.
  • Language name, for section linking and perhaps categorising.
  • Script, for detection and again for tagging. Scripts, in turn, result in the importing of Module:scripts/data as well.
  • Transliteration, for languages where that is needed. That imports a transliteration module in turn.
  • Information about replacements in display forms to create the right page name (display diacritics and such).
  • Information about language type, to block reconstructed languages from translation tables.
  • Interwiki information. This is unique to translation tables and not used anywhere else. I'm not sure it's necessary to have it in translation tables anyway, so we may just get rid of it.
All in all, this is quite a lot of information, and we're going to need this information for potentially every language which can appear in a translation table, which is almost all of them in the long term ("all words in all languages" implies also "translations of all English words into all languages"). A lot of the information lookups can be bypassed, however. The language name would not be needed if we adopted a page-per-language scheme in which the code is used in the page name. [[vertaling#Dutch|vertaling]] would become [[nl/vertaling|vertaling]], and the need to find "Dutch" would go away. Alternatively, we could have a link anchor called nl at the top of the Dutch language section on that page, which the link could then point to. The script detection can be bypassed by specifying sc=, but that would still need Module:scripts/data. Transliterations can be bypassed by providing them manually too with tr=. Display form replacements can be bypassed by including both the page name and alt form in the translation. Language type is used to make sure people don't do things we don't want, but if we have a bot checking that, the module doesn't have to. Interwiki information isn't all that useful anyway so we may get rid of it regardless. —Rua (mew) 11:58, 25 April 2019 (UTC)
Perhaps a dumb question, but how important is it to have a transliteration in the translations section? How often is that necessary, and is it too onerous to require that users who need it click through to the target page? I have no idea what percent of total resources consumed by {{t}} are in each function, but it seems the module could do a lot less without sacrificing much utility at all. - TheDaveRoss 15:06, 25 April 2019 (UTC)
From time to time I have found it useful to scan multiple transliterations, eg, when trying to track down possible etymologies from languages with scripts I can't read. But more often I just want one transliteration. If we have a proper entry, then the transliteration is just one click away. For translations without entries, would it be possible to get a transliteration on demand, say, using a JS gadget? The idea would be to click on the word and then or thereby trigger the gadget. DCDuring (talk) 17:52, 25 April 2019 (UTC)
@DCDuring: It wouldn't be too hard. The gadget could expand a template or call a module function to generate the transliteration and display it somewhere or other. I don't know how to design it well, so that you can still follow the link. (And I guess in this scenario folks without JavaScript or an internet connection wouldn't be able to see the transliteration.) — Eru·tuon 08:38, 29 April 2019 (UTC)
@Erutuon: Now we have entries that load very slowly and may not finish properly loading at all. Transliteration of redlinked translations (transliteration of blue-linked terms is but a click away) is not our most important content, as useful as it may be. BTW, couldn't we have gadgets for IPA on demand, audio on demand for any selected term, at least for any one for which we had an entry with such information, etc. DCDuring (talk) 12:05, 29 April 2019 (UTC)
Manual transliterations require that the transliteration scheme be unchanging, as well as needing a mechanism for checking new ones. However, transliterations matter in languages where homographs have different transliterations - Japanese, Thai and Arabic spring to mind. Picking the right homograph after clicking through will require effort. A policy of transliterating words with ambiguous spellings but only them would be confusing. Just possibly one could automate a rule that only certain languages need transliterations in translations, but that is yet more data to store or unpack. RichardW57 (talk) 17:36, 25 April 2019 (UTC)
Homographs should always have the same transliteration. That's what transliteration means, after all. —Rua (mew) 11:13, 28 April 2019 (UTC)
(a) That's not what is being done with {{t|tr=...}} for those languages.
(b) Should homographs that are spelt out differently be transliterated the same? The way of spelling out in Thai that I learnt distinguished monosyllabic เพลา (plao, axle) from disyllabic เพลา (pee-laa, time).
(c) Should homographs that are encoded inequivalently necessarily be transliterated the same?
-- RichardW57 (talk) 20:26, 28 April 2019 (UTC)
Re: "Homographs should always have the same transliteration." No. Identical graphical transliteration (letter by letter) may only create meaningless mess nobody can use. เพลา (pee-laa) and เพลา (plao) share the same "e b l ā" spelling, only of interest who is trying to understand the script. --Anatoli T. (обсудить/вклад) 01:33, 22 June 2019 (UTC)

What if we combine the one-page-per-language solution with the all-languages-on-one-page solution by using iframes? We move all language sections to separate pages by bot and then embed those x language pages on pages like do. It would be a drawback to load multiple HTML resources and it would also strain the server, hence each language section would demand from the user (at least if it has enough languages) that he clicks additionally “to load this language” or “to load all languages” (“this page contains 146 languages. Do you really want to load all frames?” – there are various design ways). And in the result one would also save resources in comparison since currently a user just loads everything. And it seems technically possible, Mediawiki even supports Iframes for external webpages. Fay Freak (talk) 17:29, 25 April 2019 (UTC)
Or this was a crude technical idea. Maybe it is about XHR. And together with a “click to load” one needs a “click to open” for all to work without Javascript. Fay Freak (talk) 17:49, 25 April 2019 (UTC)

That is something I thought about. But at the same time, I question the need to have one page with all homographs across all languages. As mentioned on Wiktionary:Per-language pages proposal, the primary use case is that someone has one particular language that they want to find words in, the other languages are totally irrelevant noise. I know of no other dictionary that does it this way. —Rua (mew) 10:13, 1 May 2019 (UTC)

Misencoding v. MisspellingEdit

There are several cases where different sequences of Unicode characters may render the same, and thus represent the same word and, to a Unicode-unaware reader, have the same spelling. Several cases are irrelevant:

  • Different scripts
  • Insertion of layout controls
  • Canonical equivalents

A simple example is KHMER CONSONANT SIGN COENG DA v. KHMER CONSONANT SIGN COENG TA. The Unicode standard recommends that the choice depend upon the pronunciation. Now, if there is significant variation in usage in a word, but without a difference in pronunciation, and only one is deemed right by Khmer standards, is it appropriate to call the other a 'misspelling'? The two are indistinguishable on the printed page.

Other possibilities include the use of <LETTER A, SIGN AA> v. <LETTER AA>. In such cases, one is normally deprecated.

Another issue is Tai Tham encodings. Sometimes a permutation of the characters will have the same rendering, but an entirely different pronunciation. Usually only one of these actually occurs in a language. A common cause of incorrect encodings is a simplistic conversion from the Thai script to the Tai Tham script - many of the encodings in the recently published New Testament are like this.

For such cases, would it be appropriate to have a template {{misencoding of}} for use instead of {{misspelling of}}? RichardW57 (talk) 20:46, 25 April 2019 (UTC)

The number of possible combinations seems excessive. I would say, no. {{also}} can be used if there are 2 valid entries with similar characters. Otherwise the search functionality should (hopefully) take care of it. DTLHS (talk) 20:49, 25 April 2019 (UTC)
{{also}} only helps with visual homographs that are different words. How does one fix the search functionality to recognise <MA, AE, SAKOT, WA> (pukka entry ᨾᩯ᩠ᩅ) as a suitable match for a search string ᨾ᩠ᩅᩯ <MA, SAKOT, WA, AE>? That search string finds no matches. (In Thai script, they would both be แมว.) How will the user learn that the latter encoding is wrong? It's only wrong because there is no such word.
Are you saying that misencodings meriting entries should be tagged as misspellings? RichardW57 (talk) 01:00, 26 April 2019 (UTC)
I think misencodings should be caught automatically and added to special tracking categories - to either to be fixed or accepted as an exception. Cyrillic and Arabic script based languages are riddled with such errors, some are even intentional (lazy, lack of proper keyboards, bad practice becoming common). --Anatoli T. (обсудить/вклад) 23:21, 25 April 2019 (UTC)
That deals with errors in entries, which is not what I am asking about. RichardW57 (talk) 01:00, 26 April 2019 (UTC)
I am talking about cases like چکى(čeki) (wrong) in this revision, which is an incorrect homograph of چکی(čeki) (correct). --Anatoli T. (обсудить/вклад) 01:55, 26 April 2019 (UTC)
If the incorrect چکى is common enough in Persian, then it merits an entry just as 'seperate' does. How does one advise the reader that it is wrong? Just say 'misspelling'? RichardW57 (talk) 03:04, 26 April 2019 (UTC)
Nope, just using the wrong keyboard, Arabic instead of Persian or copypasta. There are some similar cases where it's alt form but in this case, it's simply wrong. Such entries should have tracking cats. --Anatoli T. (обсудить/вклад) 03:10, 26 April 2019 (UTC)

You're probably using the wrong dictionaryEdit

http://jsomers.net/blog/dictionaryJustin (koavf)TCM 07:31, 27 April 2019 (UTC)

I agree. Dictionaries should be used by no one- they should be locked away after they are created and possibly unearthed several centuries later for use in historical linguistics. DTLHS (talk) 07:43, 27 April 2019 (UTC)
Nice find. Because the article is nearly 5 years old, link rot has set in. If you want to get to Webster 1913 you need to try Webster dictionary, free dictionary or use OneLook. DCDuring (talk) 09:06, 27 April 2019 (UTC)
The article could be read as saying that opinioniated prescriptivist definitions are more fun than neutral descriptivist ones. It also suggests that synonym lists (a la Wikidata and Wikisaurus) aren't as useful as they might be because they lack explanations of subtle differences. DCDuring (talk) 09:13, 27 April 2019 (UTC)
I think that is a proper reading of it. I like how at voy:en:, we use a more engaging tone. In a dictionary, we definitely have a mission to be neutral and factual but there are definitely options for spicing up the reading experience with more engaging quotes, usage notes, useful appendices, fun categories, and interesting graphics for our visual dictionary. Sometimes it's easy to forget that we are making not only something factual but something that human beings may want to read. It's not read narratively like w:en or s:en or b:en, but it's not pure data or media like d:, c:, or species:. Just something to bear in mind about our copy and how much more engaging things around it can be. —Justin (koavf)TCM 16:30, 27 April 2019 (UTC)
Decent essay. 1. It does bother me that sometimes we had good (if faintly colourful) definitions before, which captured some nuances of a particular word, and then someone with good intentions has tried to modernise them and at the same time erased the subtle distinctions between these words and others. 2. A nice feature of many thesauri (and certain dictionaries) is a separate paragraph comparing the various words and giving examples. I have a recent-ish Chambers CD-ROM where these come under the heading "Synonym nuances": e.g.
"Dislike is a fairly mild term for something simply being displeasing, whilst despise is far stronger and implies an element of contempt. Both detest and loathe would similarly refer to something deeply felt, suggesting extreme hatred, while not stand is also suggestive of an inability to bear: she could not stand the sight of him, so she left the room. [...]"
Perhaps something we could work into our dedicated thesaurus-namespace pages. Equinox 02:10, 28 April 2019 (UTC)
I like and anti-despise that idea. —Justin (koavf)TCM 02:35, 28 April 2019 (UTC)
The templates in Thesaurus space suppress the display of any definitions or comments provided. Only about 40 Thesaurus entries have notes of any kind, most of which lack the kind of assessmant that Chambers has as "Synonym nuances". DCDuring (talk) 03:03, 28 April 2019 (UTC)

Use a bot to orphan Template:etylEdit

I'm not sure why we haven't used a bot to orphan {{etyl}} instead of doing it all by hand. A bot will be able to do it a lot faster if it just replaces it with {{der}} everywhere. —Rua (mew) 11:11, 28 April 2019 (UTC)

If we (or a bot by proxy) do this replacement, we may lose the information that possibly – in fact quite often – a more specific template ({{bor}}, {{inh}}) is in order.  --Lambiam
As a user who isn't very well-versed in etymologies but corrects formatting often, I've been changing {{etyl}} to {{der}} by hand under the assumption that once we eliminate {{etyl}}, entries with only {{der}} will be put in a category for review. I can't find the discussion where that was unofficially decided on, but it seems like the logical progression. Ultimateria (talk) 21:03, 28 April 2019 (UTC)
That's not the impression I got, especially after Wiktionary:Beer parlour/2019/March#Making etymological derivations more specific, retiring {{der}}. It seems that people want to keep {{der}} around. Sadly, there is no category for unspecific derivations, instead the "terms derived from" categories lump everything together indiscriminately, from {{inh}} and {{bor}} too. I find this completely pointless and would much rather have separate categories. A category for terms not inherited from an ancestral language is valuable information, and Category:English terms derived from Middle English could serve that purpose, but currently doesn't because of all the noise added by {{inh}}, {{etyl}} and erroneous uses of {{der}}.
Moreover, there is no way to track down which entries have {{der}} added when they should have something more specific. User:Donnanz has been changing everything to {{der}} indiscriminately, but when I complained that he wasn't actually cleaning anything up but just making the future cleanup others had to do harder, he claimed that he didn't "believe in" {{inh}} and {{bor}}, which apparently justifies putting {{der}} everywhere. :/ —Rua (mew) 22:20, 28 April 2019 (UTC)
I think you are correct (opening paragraph), a bot can't be used if a choice has to be made between {{der}}, {{inh}} and {{bor}}. There is a further complication with Category:etyl cleanup no target/language, where {{etyl}} can often be replaced with {{cog}}, but it depends on circumstances. DonnanZ (talk) 22:50, 28 April 2019 (UTC)
Another complication, which can only be found by manual editing, is when the etymology is done incorrectly in the first place; e.g. {{etyl|fr|en}} {{m|en|[[term]]}} (when it should have been {{m|fr|[[term]]}}); using a bot would probably create a twice-borrowed term. DonnanZ (talk) 23:05, 28 April 2019 (UTC)
A bot can catch such cases by comparing the language codes in the two templates and only replacing if they match. —Rua (mew) 23:09, 28 April 2019 (UTC)
And what would happen if they don't match? No cleanup? DonnanZ (talk) 23:22, 28 April 2019 (UTC)
Yes, that's how I usually write a bot. It only handles the clear-cut cases, leaving the problematic ones to humans. The goal is not to do everything, but to make the humans' work easier by eliminating all the easy cases. —Rua (mew) 23:30, 28 April 2019 (UTC)
It is valuable information, but likely it is also a wrong focus, considering that 1. exactitude cannot be achieved anyway because of incertitude 2. the distinction is arbitrary anyway, in so far as for example one can sometimes not even see the difference between a borrowing, a phono-semantic matching and a semantic loan: So it depends on your judgment if it is a “new word” or “an extension in meaning by influence of a foreign word” or “the pattern has been used intentionally to be like a foreign term” (for instance with خَرِيطَة(ḵarīṭa) you see how fragile the whole system is: what is even a doublet? We have absolutized catchwords encountered in the literary genre of linguistic works) 3. the opportunity costs of cleaning up are high 4. the more distinctions you add to the templates, the more straining it is for the editor, for he needs to stay conscious of all the templates around, and while he thinks about or even researches such a detail he does not focus on the more utilitarian parts of the dictionary, and this is a critical step towards “too many”. We can live with not every etymology being worked through by date and full chain: That mattamore derives from Arabic مَطْمُورَة(maṭmūra) is the key information, and the direct source and earliest uses might be interesting but such research cannot be done in general, for our millions of words, unless the AI takes over, but nobody needs categorization of whether an English word is directly from Arabic or mediated through French or whatever particularly because of the known imperfectness of editor knowledge, and whether the German Kebse is borrowed from Middle High German kebse or inherited or reinforced or whatever is almost not even interesting because of the identity of the two lects: Categorizing so finely we only win a match in metaphors, like whether a word has been resurrected or only awaked from rarity. You see how arbitrary it is by recollecting that there isn’t even a rule when a word dies out: Is a word used once in three years by three writers in turn dead and reborrowed from an earlier stage every time or are three years of being in memory of writers enough, or is it the nine that matters? An internet meme, when used after having died out, is reborrowed after which timespan of being unused? And which places matter for the judgment of what has died out, is a meme resurrected on 4chan a borrowing if it has survived in Facebook corners? Silly. Can Spanish actually borrow from Latin or is it just inheritance which the cultural technique of writing has made possible, to cache words on paper to later pick them up, oneself or one’s successors? These pigeonholes do not cause accrescence of knowledge, only when for each word it is stated what has happened. I am sorry to deconstruct your fun.
We can keep {{bor}}, {{inh}}, {{obor}}, {{lbor}}, {{psm}}, {{sl}}, and so on, I suggest so, if only lest already done work is not thrown away, but I agree with the proposal to bot away {{etyl}}, because even if editors could categorize better manually it doesn’t mean they should, but we should get rid of {{etyl}} for the technical reasons voted upon and so it does not get used any more and is not on our minds. For there are many other things that should be done even more.
Is “a category for terms not inherited from an ancestral language” your actual only desideratum? Maybe we add |noinh= to {{der}} to solve this, then {{der}} without this parameter will be, used for ancestors, for unspecific derivations e contrario because of lacking specific categorization and with |noinh=1 you categorize – invent you the name of the category, I spare my phantasy here.
Apart from this it can be that creoles might need a lexifier system aside from the ancestor system, but I am not capable to appraise those. Fay Freak (talk) 23:53, 28 April 2019 (UTC)
A minor comment. I think the interesting situation with خَرِيطَة(ḵarīṭa) can be adequately addressed by providing two etymologies for the term, one for its original senses, in which it is just a native Semitic word, and one for the map sense, in which it is a phono-semantic repurposing of an existing word.  --Lambiam 10:39, 29 April 2019 (UTC)
Doesn’t it have two etymologies already, but in one sentence? It seems slily formulated. Fay Freak (talk) 11:14, 29 April 2019 (UTC)

I've made some cases of "en" categorise further by source language in Category:etyl cleanup/en. This may make it easier to fix certain cases automatically. Is it ok to assume that all cases in Category:etyl cleanup/en/enm and Category:etyl cleanup/en/ang are safe to convert to {{inh}} if they are the first term in the etymology? We do have Category:English terms borrowed from Middle English and Category:English terms borrowed from Old English, but there are so few terms in there that we can probably assume the vast majority of English terms from these languages are inherited. —Rua (mew) 12:19, 9 May 2019 (UTC)

Safe to assume? I thought you were advocating for the distinction between inherited and borrowed terms so that we could be as precise as possible. To that end, wouldn't etymologies from parent languages need to be individually researched? Ultimateria (talk) 15:57, 16 May 2019 (UTC)
Sure, if you want to go through all tens of thousands of them yourself... —Rua (mew) 16:30, 16 May 2019 (UTC)
It may be 'safe', whatever that means, but it won't be correct. For example. mathom is a borrowing from Old English, currently marked with {{der}}. On the other hand, dreng, which I learnt from a discussion of *Modern* English phonology as an exceptional form, isn't marked as modern. It's supposed to be used by antiquarians, but I've only got a 'mention' that I could cite. RichardW57 (talk) 23:52, 16 May 2019 (UTC)

German Pluralia TantaEdit

Can we somehow cross Pluralia tanta out of the German category terms with incomplete gender? It's pretty much part of their definition not to have one and Duden just says "plural word" where the gender should be. Popular words like Ostern and Eltern are defined as being either neuter or both neuter or male, but that could then be reversed. Anatol Rath (talk) 20:09, 28 April 2019 (UTC)

I agree. German has no gender in the plural, only in the singular, so the gender of plural nouns simply isn't defined. Looking at the code of {{de-plural noun}}, there is a parameter to specify the gender and a list of valid genders, but since gender doesn't make sense for these nouns, this parameter should probably not even be on the template. I've removed the gender logic and made the template add pages to Category:de-plural noun with 1 whenever the gender is present, so we can find and fix those cases. —Rua (mew) 22:26, 28 April 2019 (UTC)
In the end, I just deleted {{de-plural noun}} altogether, and made {{de-noun}} able to handle them instead. It's cleaner that way. —Rua (mew) 23:04, 28 April 2019 (UTC)
We do the same for Yiddish (e.g. הייוון(heyvn)). —Μετάknowledgediscuss/deeds 03:53, 29 April 2019 (UTC)
Some previous discussions of German plurals' genders: WT:RFC#Category:German_unknown_gender_nouns and Talk:Antibabypillen. In the former discussion, entries were being changed from "p" to "n-p" etc to get them out of Category:German terms with incomplete gender; now they seem to be going back into that category as a result of the specific plural genders being unrecognized. Possibly the template should accept "n-p" etc and just silently treat it like "p". It does seem easier to treat "plural" as a gender / as not having a gender. - -sche (discuss) 06:04, 30 April 2019 (UTC)
I've added temporary tracking to {{de-noun}} to track the invalid genders, see Special:WhatLinksHere/Template:tracking/de-headword/genders. —Rua (mew) 20:22, 30 April 2019 (UTC)

Blocked on the Russian WikipediaEdit

I am blocked on the Russian Wikipedia on the grounds that my IP address belongs to a hosting company and IS an open proxy. Not exactly surprised, considering the political crackdown currently in place but I wasn't exactly political in Wikiprojects. Can't help thinking of it as anything other than political. I have requested an unblock. --Anatoli T. (обсудить/вклад) 03:51, 29 April 2019 (UTC)

Sorry. Is there anything any of us can do to help? DCDuring (talk) 01:14, 30 April 2019 (UTC)
UPDATE: Thank you. Turned out to be more trivial. They say the VPN that my company uses the open proxy and it's no good. It hasn't caused issues anywhere else. Can still edit from my home PC. --Anatoli T. (обсудить/вклад) 01:30, 30 April 2019 (UTC)
You may already be aware, but if not: there's a user right called "IP block exempt" which is described in English at w:Wikipedia:IP block exemption and in Russian at w:ru:Википедия:Исключение из IP-блокировок, which you could request a ru.WP or meta: steward grant you. - -sche (discuss) 05:52, 30 April 2019 (UTC)

LangCom proposed modification to policyEdit

Cross-post from wiktionary-l, meta:Language_proposal_policy/4-2019_proposed_revision. I didn't know of the existence of the "language committee" (unhelpfully abbreviated to LangCom). – Jberkel 09:59, 30 April 2019 (UTC)

On digitizing specialized dictionariesEdit

http://languagelog.ldc.upenn.edu/nll/?p=42607Justin (koavf)TCM 17:26, 30 April 2019 (UTC)

Has anyone here experience with OCR? I'm just trying to convert an old Italian-English dictionary and it seems to work rather well, tesseract 4.0 (Oct 2018) added a neural net-based engine and you can specify a list of languages to recognize, so it's great for dictionaries or other multi-lingual sources. – Jberkel 00:17, 7 May 2019 (UTC)

Splitting parts of speech into different etymologiesEdit

As you may know, Rua likes to treat different parts of speech as different words, assigning, for example, a noun and a verb with common origins to different etymologies. I think it's valuable to include the etymologies of both, since they do technically have slightly different origins, but when it comes down to it, they're really the same word. Rua's approach places different parts of speech on the same level of separation as completely unrelated homonyms, which seems odd.

My solution was to put the etym for different parts of speech under the same heading, but mention both, which I did at trace, overweening (note that in this case, the etyms were at least externally identical), and pall (two homynyms, four separate etyms). Rua reverted/undid all of my changes (note that she does this not only with entries she creates, but with any entry she sees fit to modify, meaning that she enforces her preferred layout, and anyone who disagrees is powerless to do anything without starting an edit war).

When it comes down to it, this affects the fundamental structure of our entries, so it seems like something we should approach with some sort of uniformity. I think Rua is doing something very useful by distinguishing the origins of different parts of speech, but I think we should stop short of treating them as entirely unrelated words by placing them in different etymology sections. I propose the mergers I made in my edits as examples of how we should treat the issue. I also think we should codify our practice here, so that we don't create confusion by being so inconsistent in the way we present our entries.

What think you all? Andrew Sheedy (talk) 19:58, 30 April 2019 (UTC)

I think we need better headings than "Etymology X", if we're going to do a lot more splitting. DTLHS (talk) 19:59, 30 April 2019 (UTC)
It is simply a matter of "every term has an etymology, thus every term gets an etymology". It has nothing to do with the relatedness of words, that's what "Related terms" is for. Things like "the verb is from" and "the noun is from" hark back to {{sense}}, where the user is sent all over the place to connect information together. By using the structure of the entries themselves to group etymologies with the terms they belong to, it's made easier for the reader because they can immediately see that grouping. This is why I believe that everything that belongs with a particular term should be at level 4, nested under the main POS header for the term. The only level 3 headers should be POS headers. I also believe in placing information under the thing to which it is associated in general. So not only should etymologies and pronunciations be placed under their term/POS header, but also alternative forms. Semantic relations such as synonyms, and term-specific categories like {{topic}}, should likewise placed under the sense to which they belong. The fact that the new placement of "Alternative forms" was approved in a vote, and so was the placement of semantic relations, shows that there is some support for following this principle. I only ask that we be more consistent with it. —Rua (mew) 20:13, 30 April 2019 (UTC)
But your approach doesn't work with the entry layout we have. As a user, I find it somewhat confusing when one entry has two closely related parts of speech like the noun and verb forms of pall separated from each other just as much as words that have entirely different origins (again, like pall). It's not intuitive to most people that a noun and a verb with almost identical origins would be separate like that.
It's also not what other dictionaries do. Most print dictionaries I've used have different headwords for each homynym, distinguished by superscript numbers. They do not have different headwords for each part of speech. What you're doing is equivalent to conflating the two levels of division, which one would never see in a print dictionary, and which is nothing but confusing. Andrew Sheedy (talk) 20:29, 30 April 2019 (UTC)
What I'm doing is making the treatment of words independent of homography. If two terms would have different etymologies if they were written differently, they also get different etymologies if they are written the same. Terms don't suddenly stop being different terms with different etymologies just because their spelling happens to be the same. Moreover, again, I'll bring back up the lemma argument. The lemma form that is chosen for a given word is arbitrary and generally a result of historical considerations and traditions. It is through accident that the lemma of English verbs, English nouns and English adjectives is identical, but this is not necessarily so for other languages, nor does it have to be the same for English either. I'm simply doing the same for English as I would do for any other language where lemmas are not spelled the same, and ignoring the historical accident that made the lemma forms the same in this particular case in favour of the general case. —Rua (mew) 20:37, 30 April 2019 (UTC)
That much I can agree with. I just can't agree with doing it under the current entry layout, with etymology being first. The first change that needs to be made is the order of headings in an entry, not how many etymology headers we have in an entry. Andrew Sheedy (talk) 20:47, 30 April 2019 (UTC)
I would much prefer placing etymology grouped with the term/POS whose etymology it describes. But the current entry layout is not a roadblock to an etymology-per-term principle, so I'm not sure why you think it is. It could be achieved if we simply agree to have at most one term per etymology section, which is what I've been effecting. —Rua (mew) 21:14, 30 April 2019 (UTC)
I don't think it's a roadblock, I just don't like it. It makes grouping everything under etymologies pontless, and it's confusing. Beyond that, though, you seem to be the only one doing it, which creates inconsistencies that drive me crazy. Andrew Sheedy (talk) 21:30, 30 April 2019 (UTC)
I found the practice of putting etymologies at level 3 and POS at level 4 pointless from the start. It never made any sense to me and still doesn't. You can't expect me to apply something I fundamentally cannot wrap my head around. I literally have no idea how to give each term its etymology yet at the same time pretend that their etymology is the same. It's a total contradiction. So instead, what I have done since I started editing Wiktionary is a kludge, fitting the information structure I have in my head within the twisted confines that WT:EL imposes, while hoping that someday there will be a consensus to fix it. It's not that I haven't tried to get consensus either; I just carry on with my kludge for as long as it's necessary while poking people about it from time to time. —Rua (mew) 21:40, 30 April 2019 (UTC)
But the etymologies aren't all that distinct. Most people would consider them the same word. They aren't true homynyms, so it seems odd to put them under entirely different etymology sections. Andrew Sheedy (talk) 22:21, 30 April 2019 (UTC)
And that's where it doesn't make sense to me. I look at Dutch verbs and nouns and they clearly have different etymologies, because they don't have the same lemma form. So I do that for English too, even where they do have the same lemma form. I don't see a way to pretend that they have the same etymology when they don't. —Rua (mew) 10:10, 1 May 2019 (UTC)
Part of the confusion/conflict/(pick your noun) may come from Wiktionary using "Etymology" in the titles for each new homograph section. A print dictionary just uses superscript numbers to indicate groupings of the parts of speech for each homograph, but instead of using "Noun 1", "Noun 2", etc., Wiktionary uses its current structure. So it would have to be determined how to indicate separate homographs if "Etymology" were to only be used strictly for etymological text. As for my opinion on the current structure, as a user it doesn't matter to me if the various etymological info is all appearing before the various parts of speech and explaining how those parts of speech evolved. And in cases where the adjectives are derived from the nouns and the nouns from the verbs, they really are the same term being used in different ways (since parts of speech could be thought of as arbitrary groupings of forms of usage). With words like overweening, the term comes from a similar form in Middle English but is also defined as a participle and gerund of the verb overween. I assume that overweninge was also the participle of overwenen in Middle English so that overweninge is the ancestor of the participle overweening (but I'm not an expert on older forms of English). I think that trying to dissect it as much as has been done does the reader a disservice, and it would be better if the etymology of the three parts were combined in a single section. -Mike (talk) 17:37, 1 May 2019 (UTC)
But then you end up with three etymologies in a single etymology section, which makes no sense. Every term should have its own etymology. And no, I do not consider a verb and a noun to be the same term. We only do that for Chinese currently, where the distinctions between the lexical categories are much more fluid. In English, not every verb is a noun or vice versa, they are clearly separated, and thus have separate etymologies. And we should not try to hide those etymologies from the user, it's ridiculous to think otherwise. Nobody in their right mind would consider koop and kopen to be the same term with the same etymology, yet somehow English has less right-minded people. —Rua (mew) 17:45, 1 May 2019 (UTC)
"I do not consider a verb and a noun to be the same term." Verily, yet I can noun your verb or verb your noun. Where such distinctions of origin are really profound, perhaps it makes sense to split. Where the distinctions are less extreme / more minimal, it may make more sense to group -- c.f. pall. It also bears stating that English is not Dutch.
The proposed approach of organizing entry structure by POS first and etym second would wreak absolute havoc on our Japanese entries. See as one extreme example. ‑‑ Eiríkr Útlendi │Tala við mig 19:31, 1 May 2019 (UTC)
Then why don't we allow both structures, side by side, and allow editors to choose what is the most fitting? We already did this with the new semantic relations templates and the new placement of alternative forms. —Rua (mew) 19:50, 1 May 2019 (UTC)
I would not support this proposal. --{{victar|talk}} 19:04, 1 May 2019 (UTC)
Having worked on pall, I would suggest that (at least in English entries) if the different parts of speech ultimately have a common etymology somewhere down the line, it is OK to group them under the same etymology heading (see [10]). The etymology should set out the derivation of each part of speech. — SGconlaw (talk) 08:30, 9 May 2019 (UTC)