Wiktionary:Beer parlour/2020/September

Is it non-controversial to run bot-tasks to apply the conventions at WT:NORM? edit

As of the last XML dump, there are 88,299 entries that violate WT:NORM in ways that Special:AbuseFilter/103 detects. (Of these, 74.6% violate the "One blank line before all headings, including between two headings, except for before the first language heading" rule, and 44.7% violate rules besides that one. There's overlap, obviously.)

There are also probably many entries that violate WT:NORM in ways that Special:AbuseFilter/103 does not detect; I haven't checked.

Is it non-controversial to run bot-tasks that address violations of WT:NORM? Or do we need individual discussions for different violations and how to bot-address them?

Are there any best practices I should follow for such tasks, or pitfalls I should know about?

RuakhTALK
06:14, 1 September 2020 (UTC)[reply]

To answer your first question: non-controversial, go for it. —Justin (koavf)TCM 06:45, 1 September 2020 (UTC)[reply]
Agreed! We could really use more people like you volunteering for boring bot jobs! (There are some funky Chinese and Japanese entries using {{zh-see}} and {{ja-see}}, but I don't think they technically break any NORMs; they're just worth being aware of.) —Μετάknowledgediscuss/deeds 06:49, 1 September 2020 (UTC)[reply]
OK, sounds good; thanks! —RuakhTALK 21:50, 1 September 2020 (UTC)[reply]
@Ruakh: User_talk:Erutuon#ToilBot_"Normalizing"_Vandalism (@Erutuon) —Suzukaze-c (talk) 03:24, 3 September 2020 (UTC)[reply]
Thanks for the heads-up! It sounds like Erutuon's bot was specifically targeting recently-edited pages, hence that problem, and that he fixed it by changing it to instead target pages that received edits between one and thirty days ago. (Please correct me if I'm wrong.) If so, then my bot already wouldn't cause that problem, because I use the twice-monthly XML dumps to find the pages to edit, so there's a delay of much more than a day between when the NORM-violating entry was captured in the XML dump and when the bot retrieves and edits it. (Of course, it could still happen by random chance that it edits a page that was recently vandalized, but then, the same is true of Erutuon's updated bot as I understand it. And for that matter, the same is true of any other bot; my {{t}}/{{t+}} updater could similarly edit a recently-vandalized page. So I'm not too worried about this. But it shouldn't be too much work to change the bot to skip pages with recent last-edited timestamps, so, sure.) —RuakhTALK 06:15, 3 September 2020 (UTC)[reply]
For what it's worth, the current version of my bot script is here. It won't edit pages where the latest revision is more recent than one day ago. This feature is provided by the Recent Changes API (see rctoponly). That won't be useful for your bot, though, since it's pulling from the dump. — Eru·tuon 06:44, 3 September 2020 (UTC)[reply]
Thanks for the link. My bot takes a different approach, obviously; it just retrieves the page, and if it sees that it was edited less than 24 hours ago, it skips it without editing. —RuakhTALK 08:40, 7 September 2020 (UTC)[reply]
  • The "WT-NORM" alert is a perpetual annoyance, all the more so because the message does not actually say what the problem is, so it is impossible for anyone with normal patience to fix it, when nothing visibly appears wrong. Any automated process to eliminate this useless irritation would be welcome. Mihia (talk) 22:03, 11 September 2020 (UTC)[reply]
Sorry ... I think I do now remember someone saying that "WT-NORM" was useful to identify crap random edits. If so then I stand corrected, but for me personally it is just a stupid irritation because it does not actually tell me what I have done wrong. Mihia (talk) 22:26, 11 September 2020 (UTC)[reply]
"WT-NORM" should be broken down to different tags detailing the actual problem. 恨国党非蠢即坏 (talk) 06:21, 15 September 2020 (UTC)[reply]
I agree, though according to a previous explanation, I seem to remember also that many WT:NORM "problems" are totally anal from the user perspective, such as might be silently auto-corrected, if for some reason they have a system importance. Mihia (talk) 22:04, 18 September 2020 (UTC)[reply]
I'd say they're pretty much all anal. Personally, as the creator of the filter, I've been in favor of leaving the filter but removing the tag, but for some reason haven't done it yet. That way would be totally invisible to most users but users who know how to could find edits that matched the filter. But perhaps the filter should be gotten rid of and we should only be looking at the dump to identify WT:NORM violations. — Eru·tuon 23:39, 18 September 2020 (UTC)[reply]
I would definitely support that -- that is, make "WT:NORM" invisible to ordinary users but accessible to editors who care. Mihia (talk) 10:39, 19 September 2020 (UTC)[reply]

Format for thesaurus pages edit

On Thesaurus pages, lists of synonyms are currently wrapped in {{ws beginlist}} and {{ws endlist}}, with items given with {{ws}}. {{ws}} links to the WS page for the argument, if such a page exists. However, this was clearly designed with a monolingual thesaurus in mind. On Thesaurus:da:nonsense, you can see that it links to a Polish page. I think it would be better to have a single template {{ws list}} similar to {{col3}} that takes a language code, and then as many terms as needed -- of course, the current format for auto-linking only works if Thesaurus entries are entered under a native synonym like Thesaurus:da:fuld or Thesaurus:god (with or without the language code). Knowing the language might also allow us to do some other things, although I can't currently think of any.
Additionally, most Thesaurus pages are not currently in a subcat of Category:Thesaurus entries by language. Most of these are English, but far from all. I added a lang parameter to {{ws header}} some time ago that categorizes. Would someone get a bot to do this?
@Dan Polansky I assume you probably have opinions about this.__Gamren (talk) 23:37, 2 September 2020 (UTC)[reply]

Using User:AutoSkull for automated surname edits edit

Having had a decent handful of experience with Python coding at this point, I just started messing with pywikibot, with which I am building a potential Wiktionary bot that automates edits to surname entries, and also would automate their creation. I've already been using it on my main account (see some of my recent contributions) for slower semiautomated edits, and just today I had the idea to move the testing and operations of this code to my new AutoSkull account. There are definitely still some tweaks and problems I'm working out, but in the state it's currently in, it could deal with most surname entries pretty well...but obviously most isn't quite good enough.

The tasks it will be able to perform when it is finished are currently listed on the bot account's user page. Basically, though, it will pull from lists of verified surnames and search them on Wiktionary to see if they have English entries here yet. If there is no entry, the bot will just create it. If there is an entry, the bot will decide what to do from there.

It's worth noting that among its many surname-related tasks my bot will be editing currently existing surname pages to make them a bit more complete. It will be adding plural forms to the template {{en-proper noun}} according to the consensus on how surnames should be inflected in English, with a few exceptions (see the last bullet point on User:AutoSkull#English surnames). Entries for plural inflections of surnames will also be added in large numbers. It will also add relevant Wikipedia disambiguation page links for all our surname entries when such a page exists.

I won't share my code yet as it's not in a finished state, but when it is I will. I will also at that time share a large series of edits made perhaps in AutoSkull's userspace subpages, that emulate various different wild circumstances the bot may encounter when unsupervised, to prove it won't just be wreaking havoc here. But even so, I wanted to go ahead and let the community know about the fact that I'm coding and testing with this, as I suppose that's a predecessor to a bot status vote, which I'll start in the near future. I'm really hoping with this project I can help get Wiktionary's coverage of surnames to be pretty lengthy. Let me know of any suggestions or comments. PseudoSkull (talk) 03:16, 3 September 2020 (UTC)[reply]

@PseudoSkull This sounds fine to me. If you need specific help, let me know ... I've written over 400 scripts by now to do all sorts of things on Wiktionary. These all use pywikibot and (usually) mwparserfromhell, which has proven to be a great combination. For example, one of my most productivity-enhancing scripts has turned out to be a script I wrote called find_regex.py, which outputs a text file consisting of subsets of pages (either the entire page or one language section) matching a given regex, based off of a category, references to a given page, a fixed list of pages, or a Wiktionary dump. I can then edit the text file, either by hand or using a purpose-written script, and push the resulting changes back to Wiktionary using another script push_find_regex_changes.py. This makes it possible to quickly do all sorts of manual and semi-automated changes. Benwing2 (talk) 06:47, 13 September 2020 (UTC)[reply]

Old Korean lemmas with direct attestation are in the reconstruction namespace edit

The two egregious examples are the genitive and the topic-marking , both of which are omnipresent in the surviving Old Korean corpus. In the case of 叱, for example, the interpretive gugyeol data makes it undeniable that 叱 (or abbreviated forms) acts as a genitive:

  • 天人供 is used to gloss a Chinese phrase in the Avatamsaka Sutra that means "provisions of the heavenly ones"
  • 國土 is used to gloss a Chinese phrase in the Humane King Sutra meaning "territory of the Buddha's country"

And so forth. These forms are thus attested, there being universal scholarly consensus about their semantic value, and do not belong in the reconstruction namespace per WT:RECONS. What is reconstructed about them is their phonetic value, but this can be marked with an asterisk while the terms themselves (in the hanzi-based orthography) are moved to the normal entry namespace.--Karaeng Matoaya (talk) 08:28, 4 September 2020 (UTC)[reply]

  Support, agree with all points. —Suzukaze-c (talk) 03:24, 5 September 2020 (UTC)[reply]
@Quadmix77, who created these entries. —Μετάknowledgediscuss/deeds 05:22, 5 September 2020 (UTC)[reply]
  Support per above. -- 11:26, 5 September 2020 (UTC)[reply]
In the absence of further input, I'm making mainspace entries for 叱 and other attested OK grammatical particles.--Karaeng Matoaya (talk) 00:47, 7 September 2020 (UTC)[reply]
@Karaeng Matoaya: Please mark the duplicate entries with {{d}} and an explanation (or just a link to this discussion) once you've made the entries and fixed all incoming links. —Μετάknowledgediscuss/deeds 02:04, 7 September 2020 (UTC)[reply]
@Metaknowledge: Done.--Karaeng Matoaya (talk) 13:04, 7 September 2020 (UTC)[reply]

Draft proposal for pre-c. 1910 Korean forms (Old, Middle, Early Modern) edit

Hi everyone,

After some talks with @Suzukaze-c, I've drafted a brief sketch draft of how to deal with pre-contemporary Korean forms at User:Karaeng Matoaya/Draft.

This will probably be moved to Wiktionary:About Korean/Historical forms if people don't hate it too much. The main features include:

  • The use of the new periodization for Old Korean, in which texts up to c. 1300 are considered examples of OK. This is the growing consensus in South Korean academia and has a number of advantages compared to the traditional periodization still used in many Western sources, which wasn't really evidence-based in the first place.
  • Only forms attested in actual Old Korean texts are considered valid entries, which won't affect anything except 波珍, which should be deleted as a proper noun-based reconstruction. Also added some preliminary standards for disputed OK entries.
  • The use of the three-way periodization of Korean given by ISO 639-3: OKO for Old Korean, OKM for Middle Korean, and KOR for Early Modern and Modern Korean. This means that Korean forms attested between 1600 and 1900 share the KO language code together with contemporary forms, and are modified with obsoleteness templates instead. (Previously the very few EMK entries that existed seemed to be grouped together with Middle Korean forms, but this is problematic given academic consensus that MK ends in c. 1600; if we want to separate EMK from Contemporary Korean, the best way to do that is to create a new language code specifically for EMK.) Some examples of new EMK entries are at 뉴#Etymology 3 and ᄯᅡᆼ.

Thoughts?--Karaeng Matoaya (talk) 13:22, 7 September 2020 (UTC)[reply]

Looking over your draft, I have a few questions / comments.
  • In the Chinese wordlists section, you state, "references to these wordlists are strongly recommended in the Phonology sections of Old Korean entries, and in the Etymology sections of Middle and Modern Korean entries." I'm not quite clear on how you mean this. Presumably this recommendation is only for those terms that have alternative forms that appear in the Chinese word lists?
  • In the Proper noun reconstructions section, you state, "references to such reconstructions are strongly recommended in the Phonology sections of Old Korean entries, and in the Etymology sections of Middle and Modern Korean entries." Similar to above.
Albeit from something of an outsider's perspective -- my Korean ability is quite basic -- your proposal looks good to me.
Really appreciating the deeper dive you're giving for Korean entries. Thank you. ‑‑ Eiríkr Útlendi │Tala við mig 18:57, 9 September 2020 (UTC)[reply]
@Eirikr Thanks for the comments, and also for the encouragement—they mean a lot. I've fixed both to "strongly recommended in the Phonology sections of otherwise attested Old Korean entries, and in the Etymology sections of likely Middle and Modern Korean reflexes" and also added three examples of how Chinese or proper noun data can be integrated within attested entries: 有叱 (*Is-), 無叱 (*EPs-), and 거칠다 (geochilda).--Karaeng Matoaya (talk) 12:30, 10 September 2020 (UTC)[reply]
The changes look good to me. Thank you again for taking this on! ‑‑ Eiríkr Útlendi │Tala við mig 18:25, 14 September 2020 (UTC)[reply]

"Pronunciation spelling" label edit

Is everyone happy that the usage of the "pronunciation spelling" label has by implication been determined by the outcome of the recent "eye dialect" vote? That vote established that the "eye dialect" label is to be applied only to words such as sed for said or lissen for listen that represent standard pronunciations but imply that the speaker generally uses a nonstandard dialect. It has been said that "eye dialect" is a subset of "pronunciation spelling", on which basis such words could in theory be labelled both "eye dialect" and "pronunciation spelling", but I imagine that this would be viewed as unnecessary.

This leaves words such as borrowin' for borrowing and fink for think, that represent non-standard pronunciations, as well as simplified phonetic spellings such as lite, as eligible for the "pronunciation spelling" label. Is it uncontentious that all these should be labelled "pronunciation spelling"? Are there any other types of words that are "pronunciation spelling" candidates? Mihia (talk) 08:29, 10 September 2020 (UTC)[reply]

I wouldn't use the "pronunciation spelling" label for borrowin’ and fink; I'd simply call those nonstandard forms. Things like lite, tonite, and donut, on the other hand, are definitely pronunciation spellings that are not (I think) eye dialect (at least not usually). —Mahāgaja · talk 11:44, 10 September 2020 (UTC)[reply]
I second that. I suspect that some common misspellings arose as pronunciation spellings, or, as in the case of artic and nitch, even as mispronunciation spellings. I’d apply the term only, though, to intentional nonstandard spellings that do not imply the use of nonstandard speech but merely aim to convey how kool and with it the author is. — This unsigned comment was added by Lambiam (talkcontribs) at 13:55, 10 September 2020 (UTC).
In the case of artic, I wouldn't say that it's a mispronunciation spelling; rather, I'd say that /ˈɑɹktɪk/ is a spelling pronunciation, since 300 or so years ago artic was the normal spelling and /ˈɑɹtɪk/ was the normal pronunciation. —Mahāgaja · talk 15:52, 10 September 2020 (UTC)[reply]

Invitation to participate in the conversation edit

  • Yay, rules! We'd better start crafting templates to issue various degrees of admonishment, warning, and scolding before escalating to interaction bans and topic bans. We could use help from a graphic artist to produce good icons. Vox Sciurorum (talk) 18:13, 11 September 2020 (UTC)[reply]
I wanted to add "right-wingers are humans too" but I got banned instantly, lol. Equinox 22:24, 11 September 2020 (UTC)[reply]
You should have read the FAQ: "UCoC may not fit into all cultural contexts." Vox Sciurorum (talk) 22:47, 11 September 2020 (UTC)[reply]
... and the footnote at the bottom: "not actually universal"... On a more serious note (but still highly sarcastic), I'm loving the name "Trust and Safety Team" - it fills me with calm and respect, and can be made into a nice acronym too, which has been a must for any initiative since the Patriot Act. --Java Beauty (talk) 23:14, 13 September 2020 (UTC)[reply]
Careful there! One of the proposed rules is to ban sarcasm. (I'm not kidding, go look at the draft.) —Μετάknowledgediscuss/deeds 05:36, 14 September 2020 (UTC)[reply]
pics or it didn't happen
And it sure would be great if more right-wingers recognized others as human too :^) —Suzukaze-c (talk) 05:14, 14 September 2020 (UTC)[reply]

Archaic forms and spellings should not be lemmas edit

Most English archaic forms and spellings are lemmas. However, archaic forms are just like declined/conjugated/inflected forms, in that they don't add any information on meaning of the root word. --Numberguy6 (talk) 20:56, 12 September 2020 (UTC)[reply]

Archaic terms may have been the predominant form at times in the past. We are attempting to be a historical dictionary among other things. DCDuring (talk) 21:39, 12 September 2020 (UTC)[reply]
I disagree whom is just as much a lemma now as it has ever been and so is thee. —Justin (koavf)TCM 02:01, 13 September 2020 (UTC)[reply]
I also strongly disagree with the proposition that "Archaic forms and spellings should not be lemmas". Mihia (talk) 22:35, 13 September 2020 (UTC)[reply]

Phrase ellipsis, three regular dots or two ellipsis characters (six dots)? edit

Hi all,

First, sorry for cross-posting. I was advised that I'd be better served posting here. Here are my original questions:

Concern A: I came across how do you say...in English and I'm ... year(s) old. The former has been moved to how do you say …… in English. After reading the page history, there seemed to be a rational explanation as to why two ellipsis characters (six dots) were used. Given that Wiktionary:Phrasebook provides an example with three regular dots (three separate characters), I'm confused about what the naming convention should be. Please advise.

Concern B: Most people cannot type the ellipsis character (…) without copying and pasting from somewhere else. Doesn't this limit the usefulness of Wiktionary as a tool for looking up words? What if a phrase starts with the ellipsis characters and the user wanted to look that up? It would likely only be found with great difficulty.

-- Dentonius (talk) 00:11, 13 September 2020 (UTC)[reply]

I thought redirects worked and still work in Wiktionary as usual, don't they? In Wiktionary, there are many languages and in them lots of characters that are difficult to type for outsiders, however, redirects (and the {{also}} template) do an excellent job. I don't see why we should make an exception at this particular point when we don't do otherwise. The succession of three dots is just a clumsy substitute for an ellipsis character. There are several terms even in English that would be hard to type (e.g. 1,450 terms with æ or 1,213 terms with é) if it weren't for the convenient lookup and redirect features that we have here. Adam78 (talk) 01:06, 13 September 2020 (UTC)[reply]

Here is what I wrote at WT:GP:
This is maybe more of a beer parlo(u)r issue, and you might get more traction posting it there. However, I agree with you that six dots seems a bit strange. The explanation "and two of them to mark the width of an average word, separated by spaces as usual" by User:Adam78 makes a certain amount of sense but was clearly a unilateral decision. The issue with an ellipsis character vs. three dots seems less of an issue than you might think; at least for me, if I type "I'm ..." with three dots, it autocompletes to the variant with an ellipsis character. Same thing happens if you start typing "..."; it autocompletes to the ellipsis character entry. Even using a single ellipsis character isn't completely standard; for example, there's what does XX mean and Appendix:X is a beautiful language. In addition, all the entries under Appendix:Snowclones use X, Y, Z, N, etc. For snowclones maybe this makes sense as it makes possible things like Appendix:Snowclones/I'm here to X A and Y B, and I'm all out of A. I think at least all the non-snowclone entries should use a single ellipsis character.
Benwing2 (talk) 05:18, 13 September 2020 (UTC)[reply]
I disagree with ever using 6 dots to create space. Normally, when I'm just trying to create space I will use two m-dashes (——) or any number of underlines (___). But for what was being attempted on this site I don't know if any of that would be preferred. I would assume a single ellipsis would be sufficient. -Mike (talk) 22:34, 13 September 2020 (UTC)[reply]

A single ellipsis looks to me like a great compromise. I'm sorry for the one-sided change. Adam78 (talk) 15:45, 15 September 2020 (UTC)[reply]

Thanks, guys. I appreciate it. ;-) - Dentonius (talk) 17:13, 15 September 2020 (UTC)[reply]

Canadian English edit

Hello all, I raised a question at Category talk:Canadian English upon which I'd like to hear your input. -Montrealais (talk) 15:43, 13 September 2020 (UTC)[reply]

As far as the purpose of having a category is concerned, yes, I agree with what you say at that talk page. "Canadian English" should be for words used only (or primarily) in Canada, else what is the point. The actual name of the category could be open to discussion, though. Would a person expect a category called "Canadian English" to contain every word used in Canadian English? That is, including all North American or even "universal" English words too? I'm not sure. Mihia (talk) 22:24, 13 September 2020 (UTC)[reply]

As you imply, it doesn't make much more sense to put all North American words under "Canadian English" than it would to put universal English words under "Canadian English" on the grounds that they're used in Canada. I feel that if the category is to be useful, it should be for words that are, or at least mostly are, peculiar to Canada. There's a difference between a dictionary of English used in Canada (e.g. the Canadian Oxford Dictionary) and a list of Canadian words, which I believe most people would expect the category to be. - Montrealais (talk) 23:16, 13 September 2020 (UTC)[reply]

I think that one's perception may vary depending on whether the region in question is one's own or not. For example, as a BrE speaker, I would probably expect a list of "Canadian English words" to include words that are used only (or primarily) in Canada, whereas I might expect a list of "British English words" to include all words that are used in BrE. Opinions may vary. Despite this, we might adopt the convention that "X English" words include words used only (or primarily) in region X, and expect/require people to understand this. Otherwise the labelling may get clumsy. Mihia (talk) 00:33, 14 September 2020 (UTC)[reply]
@Mihia I definitely don't think having "British English words" consist of all the words used in British English (as opposed to the ones specific to this variety) would be workable. The category would be enormous and wouldn't be of much value, since over 99% of English words are common to all varieties. Benwing2 (talk) 03:05, 15 September 2020 (UTC)[reply]
No, I absolutely agree. I think perhaps I did not explain my point clearly enough. I was talking about what a person might expect a category named "British English words" to contain, if he or she did not already know how Wiktionary defined this. I was speculating that a person might think that it would contain all words used in British English, and therefore musing whether the category name should somehow indicate that it didn't (e.g. "Words specific to British English"). However, in conclusion I mentioned that this may be too clumsy. Mihia (talk) 09:25, 15 September 2020 (UTC)[reply]
@Mihia I see, makes sense. Benwing2 (talk) 03:31, 16 September 2020 (UTC)[reply]
I agree with the above, that I would expect this category to contain words that are specific to Canada, and I think the category should reflect that. Andrew Sheedy (talk) 22:51, 16 September 2020 (UTC)[reply]

We seem to have a consensus here, more or less - how should we proceed? -Montrealais (talk) 05:01, 18 September 2020 (UTC)[reply]

Lexicography films edit

Apparently there's only one film ever made about dictionaries, called The Professor and the Madman. Anyone seen it? --Java Beauty (talk) 21:12, 13 September 2020 (UTC)[reply]

(Malmoi is apparently another one. —Suzukaze-c (talk) 22:40, 13 September 2020 (UTC))[reply]
Oof, Korean film set in 1940, no thanks... --Java Beauty (talk) 23:18, 13 September 2020 (UTC)[reply]
I saw The Professor and the Madman. Watchable, though it takes various liberties with the facts in the name of drama. Out of curiosity I Googled and found these other ones:
SGconlaw (talk) 12:51, 14 September 2020 (UTC)[reply]

Do we really need Classical & Ecclesiastical pronunciations for unattested Vulgar Latin terms? I think, we ne need. If our community unanimously decide in favour of showing only Vulgar Latin pronunciations, then we can begin using |classical= & |ecclesiastical= (the latter does not work tho' & I know not why) to hide those twain. Or better, some user might even want to make some adjustment with Latin templates to this effect. inqilābī [ inqilāb zindabād ] 12:44, 14 September 2020 (UTC)[reply]

@Erutuon, what d'you think? inqilābī [ inqilāb zindabād ] 18:43, 15 September 2020 (UTC)[reply]

Wiktionary sitelinks dashboard: URL update edit

Hello all, and sorry for writing in English. Feel free to translate this message below.

The Wiktionary Cognate Dashboard presents interesting data about the extension powering your sitelinks. I just wanted to let you know that the URL of this tool changed: it is now accessible at https://wiktionary-analytics.wmcloud.org/Wiktionary_CognateDashboard/ . The former URLs, https://wmdeanalytics.wmflabs.org/Wiktionary_CognateDashboard/ and https://wdcm.wmflabs.org/Wiktionary_CognateDashboard/ , will be disabled on September 25th. Don't forget to update your documentation pages accordingly.

If you have questions about the tool or the URL switch, feel free to ping me. Cheers, Lea Lacroix (WMDE) 11:46, 14 September 2020 (UTC)[reply]

If anyone wants an Indian English translation, let me know 🙃 —AryamanA (मुझसे बात करेंयोगदान) 22:33, 14 September 2020 (UTC)[reply]
I'm... kind of curious to see what that would entail. —Μετάknowledgediscuss/deeds 22:42, 14 September 2020 (UTC)[reply]
Category:Indian_English? Weird but gotta be respected. Equinox 00:05, 16 September 2020 (UTC)[reply]
We could all have a go at translating into Scots... XD - -sche (discuss) 20:53, 17 September 2020 (UTC)[reply]

Propose making Template:en-noun pluralization algorithm smarter edit

@Equinox, DCDuring Pinging a couple of random people who I think work on English lemmas a lot. I propose to make the {{en-noun}} pluralization algorithm smarter. Not sure if this has been discussed before. Basically, I want to implement the following default rules (which are mostly already implemented in the pluralize() function in Module:string utilities):

  1. If the noun ends in -s, -x, -z, -sh or -ch, add '-es'.
  2. If the noun ends in consonant + y, and does not begin with a capital letter, change '-y' to '-ies'. Hence cherry -> cherries, but Kennedy -> Kennedys (begins with a capital letter; cf. Rolling Stones "who killed the Kennedys?"), boy -> boys (ends in vowel + y).
  3. Otherwise, add '-s'.

The values s and es would force an '-s' or '-es' plural, as before. The special symbols -, ~, ! and ? work as before. A new symbol + means "produce the default plural"; this is used e.g. on the page accessibility in {{en-noun|-|+}}, which currently has to be written {{en-noun|-|accessibilities}} (the - in conjunction with a plural means "usually uncountable"; without a plural specified, it means simply "uncountable"). This would be implemented as follows:

  1. Implement the new behavior, but only if |new=1 is given in the template.
  2. Use a bot to find the places where arguments would change between the old and new behavior; change the arguments to the new behavior and add |new=1.
  3. As soon as all such places are changed, make the new behavior the default and remove the dead code supporting the old behavior.
  4. Go through and remove the |new=1 parameter.
  5. Remove the dead code supporting the |new= parameter.

That way, there would be no disruption while making the change. The only possible issue is someone changing the plural of an existing noun or adding a new noun while step 2 is in progress. I may be able to work around this by checking esp. for new entries in the Category:English nouns category, as I think adding a new noun would be a lot more likely than changing an existing noun. Thoughts? Benwing2 (talk) 08:34, 15 September 2020 (UTC)[reply]

Sounds promising. I appreciate your concern about the transition process. I will try to think on possible problems etc. How long do you think step 2 would take? Could an input filter be used to prevent changes to the plurals where "new=1" was present? DCDuring (talk) 14:52, 15 September 2020 (UTC)[reply]
Currently the noun ally has |1=allies. Under the proposed smarter rules this could be omitted, but I see no step that would perform this simplification. Did I miss something? I am not sure~ what the old "dumb" rules are. What is an example of a case that would be flagged in Step 2?  --Lambiam
Oppose unless you can articulate what benefit this would have. I have not encountered mistakes in English plurals in any significant amount. DTLHS (talk) 22:26, 15 September 2020 (UTC)[reply]
“Smarter“ means it takes off the work of thinking about the code – as of typing it, since with the suggested changes one would have less to specificy manually –, which means creating English entries would be faster, and less erroneous in every respect because of more human attention left – unless syntax changes requiring accustomization work towards the opposite, but there aren’t “changes” in that sense, only things becoming unnecessary. Fay Freak (talk) 22:49, 15 September 2020 (UTC)[reply]
Hi. Interesting idea. Thanks for pinging my pimply arse. You know what, I mostly see templates as something that gets in my way, that I have to work around, and I know that's really sad and wrong, because a lot of templates do very useful things. See current discussion on my talk page about why I find it hard to use the proper citations template with lots and lots of parameters (year, author, etc.). I also very much appreciate the fact that you are proposing a new=1 parameter and a phase-in rather than sort of just throwing it in there and hoping it works (ahem...). Could you please tell me: (i) what is the basis of your proposed rules (did they come from a certain grammar book, or a corpus study, etc.?) -- not just your head, right?; (ii) I think I just didn't totally follow your explanation, but suppose we have got a "perfect exception" that works with old en-noun but not with yours, such as drivebys: is there any risk of breaking these while implementing your new proposal? Thanks. Equinox 00:14, 16 September 2020 (UTC)[reply]
@Equinox In response to your questions: (i) These are the rules I was taught as a kid. Can't cite a specific grammar book but I bet any standard English grammar contains these rules. (ii) Exceptional cases like "drivebys" and "nudibranchs" would just need to be specified as {{en-noun|s}} instead of the current {{en-noun}}. My bot will change them automatically. Benwing2 (talk) 03:11, 16 September 2020 (UTC)[reply]
OK. Then probably in favour of this. We always have the Preview screen, after all. (You might also enjoy seeing the hot mess of User:Equinox/code/FindMissingNounPlurals.) Equinox 03:19, 16 September 2020 (UTC)[reply]
@Equinox Your code isn't so bad :) ... and it's liberally commented, which is something near and dear to my heart. Benwing2 (talk) 02:41, 18 September 2020 (UTC)[reply]
@Lambiam, DTLHS User:Fay Freak articulated the reason well, in my view. What the current module does by default is to always add -s to the noun, regardless of the form of the noun. So, head -> heads, house -> houses, boy -> boys, but also batch -> batchs, cherry -> cherrys, box -> boxs, etc. What I'm proposing to do is make the default rule smarter, so that there are fewer exceptional cases, and so that the cases that do require the plural to be enumerated explicitly correspond with English speakers' intuitions of what are exceptional. For example, currently nudibranch is specified as just {{en-noun}}, relying on the default -s plural, hence nudibranchs. This happens to be correct for this noun because the final -ch is pronounced as /k/, but any native speaker will tell you this is an exception, and that the "normal" plural would be nudibranches. Someone who doesn't know this word and comes across it in Wiktionary might think the bare template call {{en-noun}} is a mistake by some other editor who forgot to specify the explicit plural (which is required for 99% of nouns ending in -ch), and try to "correct" it to nudibranches. When the template is changed as I propose, so its default rules accord with normal English plural rules, all the exceptional cases will be specifically indicated as such and this problem won't occur. Benwing2 (talk) 03:23, 16 September 2020 (UTC)[reply]
@Benwing2 — You did not answer my first question, about the bot applying possible simplifications, like for example for ally replacing {{en-noun|allies}} by {{en-noun}}.  --Lambiam 04:00, 16 September 2020 (UTC)[reply]
@Lambiam Apologies. Yes, the bot would apply all possible simplifications in step 2. That would include e.g.:
The way I would probably implement it is to first replace all arguments with + where possible (which means "use the default algorithm"), then eliminate + where possible. (Specifically, {{en-noun|+}} -> {{en-noun}}, and {{en-noun|~|+}} -> {{en-noun|~}}.) Benwing2 (talk) 05:26, 16 September 2020 (UTC)[reply]
@DCDuring I'm not exactly sure how long step 2 would take, but I can write a script to find out. It usually takes 1-2 seconds to save a page, meaning the bot can do maybe 3000 pages an hour. I think an edit filter would work and be a pretty simple solution; it could even be made to allow changes that add |new=1, but not otherwise, and display a message indicating that this needs to be temporarily done. Benwing2 (talk) 03:30, 16 September 2020 (UTC)[reply]
So, if everything worked right and it ran 24 hours a day with you faithfully overseeing it, at least on standby, it would take at least 4 days. If it was overseen 40 hours a week and was only run when overseen, it would take at least 2 1/2 weeks. I don't know that there is anyone besides you who could properly oversee it and lead a prompt recovery from any unforeseen problem. Thus the time step 2 would take would be totally subject to your availability for oversight, at least on a standby basis. DCDuring (talk) 11:41, 16 September 2020 (UTC)[reply]
@DCDuring This isn't actually the case. I think you're basing your calculations on the total number of English nouns (about 348,000), but the time would be determined by only those that need to be changed. I think that would be maybe 10,000-30,000, or about 3-10 hours, but I don't know for sure. This could be sped up by about 5x by running multiple processes at once. I also realized that I can use the tracking mechanism (Template:tracking) to track any pages needing updating that get changed during the primary updating process, so there's no need for an edit filter. Benwing2 (talk) 00:52, 17 September 2020 (UTC)[reply]
Indeed. I was basing my estimate on all occurrences of {{en-noun}}. Can you identify all of the English nouns that need to be changed before beginning the changes? DCDuring (talk) 01:11, 17 September 2020 (UTC)[reply]
@DCDuring Yes, I can write a script to do this. Benwing2 (talk) 02:33, 17 September 2020 (UTC)[reply]
Isn't that necessary to reduce the time you would need to supervise the process? from my estimate to yours? Or am I missing something blindingly obvious? DCDuring (talk) 03:13, 17 September 2020 (UTC)[reply]

──────────────────────────────────────────────────────────────────────────────────────────────────── @DCDuring No, you're not missing anything, I do have to write the script eventually. I went ahead and wrote it; here are the stats:

Stat Count
Total pages with {{en-noun}} 340,887
Number of pages touched if all possible changes/simplifications made and + used everywhere 32,739
Number of pages touched if all possible changes/simplifications made and + used everywhere except when replacing s 23,752
Number of pages touched if all possible changes/simplifications made and + used everywhere except when replacing s or es 22,024
Number of pages that will differ between old and new algorithm if no changes made 352

There are three numbers I give above, depending on how the changes are made. I recommend one of the latter two (where I leave alone s, or maybe both s and es, instead of replacing them with + where it's appropriate to do so). Both solutions would take around 7-8 hours to implement in step 2. Note that there are only 352 pages that actually *have* to be changed with the new algorithm, because they will have different results. These are pages like nudibranch that rely on the default -s ending without specifying it explicitly, and where the new algorithm would add something else (e.g. -es in this case). This means it's unlikely there will be many, if any, pages of this sort that will be added in the time it takes to run step 2 above. However, I will set up tracking so that any changes made in those 7-8 hours get reviewed and fixed up as needed.

Please note, the list of those 352 entries is here: User:Benwing2/convert-en-noun.warnings. Some of these are in fact wrong and need to have an explicit plural (e.g. windgrass, film library). I made a list of all those that may be wrong, here: User:Benwing2/convert-en-noun.warnings.likely-wrong. This list has 110 entries. Some are clearly wrong, some may or may not be wrong. Could you take a look and fix the ones that are wrong? Thanks! Benwing2 (talk) 01:16, 18 September 2020 (UTC)[reply]

Also, note that the above stats are derived from the Sep 1 dump, so there may be a few pages not included. Benwing2 (talk) 01:18, 18 September 2020 (UTC)[reply]
I just did 15, marked with {{done}}, with an explanation. Tedious. There are many common nouns that are derived from proper nouns which require looking at cites. There are a whole bunch of nouns ending in "oxy", which could probably be resolved most quickly by SB. I'll take another look when I can. DCDuring (talk) 03:47, 18 September 2020 (UTC)[reply]
@DCDuring Thanks. If you can just list what needs to be done for the others (if anything), I can make the changes fairly quickly. I flagged these because they need further investigation, e.g. all the -oxy ones seemed wrong to me but I don't know for sure. Calling User:SemperBlotto, who created many of them. Benwing2 (talk) 03:56, 18 September 2020 (UTC)[reply]
There are also some that should use {{en-proper noun}}, not {{en-noun}}, and others that are uncountable or, IIRC, plural-invariant. In some cases where there are both common- and proper-noun L2s, I wonder why we have both. DCDuring (talk) 11:52, 18 September 2020 (UTC)[reply]
@DCDuring Thanks. I implemented the necessary changes in Module:en-headword and I'm ready to proceed using the Sep 20 dump, when it comes out. Are you OK with this? Benwing2 (talk) 19:11, 19 September 2020 (UTC)[reply]
I fixed up some of the remaining cases in User:Benwing2/convert-en-noun.warnings.likely-wrong and removed the ones you or others had already done. There are about 55 left; almost all end in consonant + -y. Benwing2 (talk) 20:42, 19 September 2020 (UTC)[reply]
What I'm hoping will come out of this is a greater willingness to review the inflection line for English nouns so as to address various questions about plurals, including countability, the various departures from the basic rules, etc. Not having to type in so many of the plural forms might reduce the tedium and wear-and-tear on keyboards and thumbs of such reviews. DCDuring (talk) 00:26, 20 September 2020 (UTC)[reply]
This is done. Benwing2 (talk) 03:40, 22 September 2020 (UTC)[reply]

As many of you know, the Malagasy Wiktionary is the second-largest by article count as a result of its very low-quality bot-created entries. I have made a full report on Meta, and I'm hoping that the Wiktionary community can chime in on the talk page, and add pressure at Meta so this actually gets dealt with. —Μετάknowledgediscuss/deeds 02:17, 16 September 2020 (UTC)[reply]

  Support, this bot is highly irritating. PUC20:15, 16 September 2020 (UTC)[reply]
Don't comment here, go comment at Meta! —Μετάknowledgediscuss/deeds 20:24, 16 September 2020 (UTC)[reply]

Northwestern Indo-Aryan edit

I'm trying to improve our organization of Indo-Aryan languages (it's very loose as it stands), and I have an issue that could use some discussion. The Indo-Aryan lects of the Northwestern zone (Sindhi sd, Punjabi pa, etc.) are currently classified as descendants of Sauraseni Prakrit psu. The (literary) language most closely associated with these lects is Paisaci Prakrit, which we now have a code inc-psc for.

Certainly, Sauraseni doesn't give us the appropriate intermediary forms between Sanskrit and this languages: e.g. Kholosi taɽgo (a Sindhic language) preserves the r in the consonant cluster in Sanskrit दीर्घ (dīrgha), but Sauraseni has lost it as 𑀤𑀺𑀕𑁆𑀖 (diggha). Similarly, Punjabi ਭਰਾ (bharā)'s currently given etymology does not make much sense again due to preservation of r. The conservativeness of Northwest IA is well-known, e.g. {{R:inc:Masica:1993}} discusses it. Sauraseni, on the other hand, is distinctively a Central IA language that is obsessed with cluster assimilation.

However, Paisaci is very very scantily attested and so I'm uncertain whether it actually is a good candidate for intermediary language between Northwest IA and Sanskrit. Glottolog does include it in Northwest IA, but Glottolog also very stupidly puts Dardic languages in there. What's the best way to organize Northwest IA? Create a family for it and group it under Sauraseni? Or put it under Paisaci which would require removing all of the current etymologies involving Sauraseni? Pinging @Bhagadatta, Kutchkutch, Victar, not sure who else might be knowledgeable enough to help but any input is welcome. —AryamanA (मुझसे बात करेंयोगदान) 16:19, 16 September 2020 (UTC)[reply]

@AryamanA: I am not knowledgeable about Northwestern IA lects but I would support classifying them as descendants of Paisaci Prakrit. Paisaci Prakrit's meagre attestation should be no cause for not showing it as the intermediate between OIA and Northwest IA, because we should be more bothered about representing the IA family tree as accurate as current linguistic data points to. So we can obviously go ahead with cleaning up the etymologies and the descendants to be affected by this change. inqilābī [ inqilāb zindabād ] 19:40, 16 September 2020 (UTC)[reply]
@AryamanA: I don't mind having Paisaci as the ancestor of NWIA but there are a couple of issues. Paisaci Prakrit has sometimes de-voiced Old Indo-Aryan voiced stops like 𑀢𑁂𑀯𑀭 (tevara), for which Punjabi has ਦਿਉਰ (diura). There are more examples like 𑀧𑀸𑀴𑀓 (pāḷaka), 𑀓𑀓𑀦 (kakana) etc for which I don't know the Punjabi equivalent.
Also, as for Kholosi taɽgo, the initial dental is de-voiced in a Paisaci-like manner but the r is also preserved which does not seem to be something Piasaci would do: Skt. घर्म (gharmá) --> Paisaci Prakrit 𑀔𑀫𑁆𑀫 (khamma); the r has been lost.
How should we handle cases like this? -- Bhagadatta (talk) 02:08, 17 September 2020 (UTC)[reply]
@Bhagadatta: Hmm, so it seems we won't be finding any perfect Prakrit match to Northwest IA, as I had suspected previously. (If anything, Shahbazgarhi/Mansehra Ashokan Prakrit/Gandhari Prakrit seem to be closer.) I suppose we can make a Northwest IA family and put it under Sauraseni to maintain the status quo. We could have Paisaci as a separate branch with no descendants. —AryamanA (मुझसे बात करेंयोगदान) 02:24, 17 September 2020 (UTC)[reply]
@AryamanA: Well, I love the idea of Proto NWIA and Proto Central IA etc. But can we really continue calling Punjabi and Kholosi descendants of Sauraseni? As you pointed out, these languages preserve features and (remnants of) clusters that Sauraseni lost. But then again, there are a lot of features in Punjabi that appear to be from Sauraseni; one feature I can think of is geminated stops. Classifying IA languages is a really challenging task. -- Bhagadatta (talk) 03:16, 17 September 2020 (UTC)[reply]
@AryamanA, Bhagadatta, Inqilābī: It’s really unfortunate that the Indo-Aryan language family is not understood as well as it should be, and this particular issue is certainly one that needs to be discussed.
It is the overwhelming consensus in Indo-Aryan scholarship that Punjabi and Sindhi constitute a Northwest Zone distinct from the Central Zone. This modern Northwest Zone has similarities with the Ashokan Northwest Zone. However, there are several challenges in the definition of such a Northwest Zone.
First, the dividing line between the Northwest and Central Zones in modern South Asia is not well defined. Although the Western boundary with Iranian and the Northern boundary with Dardic are somewhat clear, the Eastern boundary is blurred due to contact with Hindi-Urdu and the Southern boundary is blurred due to Rajasthani and Gujarati.
Second, the academic understanding of the dialects of Punjabi and Sindhi is sparse.
For Punjabi, academic literature usually only refers to the Majhi lect or Standard Punjabi (MSP) of Amritsar and Lahore. This predominance of MSP in the academic literature distorts any general understanding of the Punjabi linguistic area as a whole. Other than MSP lect, Saraiki is the next best understood Punjabi lect. Saraiki has the advantage of being both the variety most consistently divergent from MSP and the one with the best local claim to separate recognition. Since Saraiki is spoken in southern Punjab close to the border with Sindh, there are numerous similarities between Saraiki and Sindhi. For example, both have implosives that are otherwise absent in Indo-Aryan. Pahari-Pothwari is perhaps another important Punjabi lect due it being the native lect of Rawalpindi, Islamabad and Mirpur.
Since most Sindhi speakers in Pakistan and less than 1% of Indians speak Sindhi, understanding the Sindhi linguistic area from an Indian perspective is very challenging. Fortunately, the Kachchi lect is spoken natively in Kutch, which is accessible to Indians. Although Kachchi is a Sindhi lect, many Kutchi people choose not to identify as Sindhi as can be seen in their choice to use the Gujarati script. What is needed for a general understanding of the Northwest Zone as a whole is enough data for one or two Module:zh-dial-syns for Punjabi and Sindhi. Appendix:Sindhi Swadesh lists is a step in that direction.
Third, as is the case for all of Indo-Aryan, the documentation of various earlier stages does not represent a logical successive relationship from one stage to the next. There is no information regarding the transition from MIA to Early NIA for the Northwest Zone. The earliest data for Northwest NIA are two short fragments of the Adi Granth termed ‘Old Punjabi’ that have been analysed by Christopher Shackle. Although the attestation is fragmentary, comparing Old Punjabi with Modern Punjabi and Sindhi helps with diachronic analysis.
Despite nearly a thousand years of Perso-Arabic influence, Punjabi and Sindhi still show many features of Prakrit to an extent greater than Marathi, Hindi and Bengali. Markandeya claimed that Vracada Apabhramsa was spoken in Sindh and is the ancestor of modern Sindhi (sindhudeśedbhavo vrācaḍopabhraṃśaḥ). Pischel and Grierson have both supported this claim by Markandeya. Very little is known about the Vracada itself, except nine peculiarities noted by Markandeya. Here are some of those features of Vracada:
  1. Retroflexion of MIA dental stops. For example, <> → /ʈ/ and <> → /ɖ/
  2. An epenthetic <> before <> in Vracada may be the source of the Sindhi and Saraiki implosive /ʄ/
  3. Sibilant merger: ṣ, s→ ś
  4. The च-series are pure palatals. For example, <> → /c/, <> → /ɟ/
Fourth, despite Bhagadatta’s valid analysis, multiple sources say Paisachi represents the Northwest Zone. Page 24 of {{R:inc:CGMIA}} says that regardless of whether the Northwest Zone is the home of Paisachi, it was also spoken in the Central Zone. Pages 30-31 say that there are at least four lects of Paisachi: Kaikeya, Saurasena, Pancala and Culika. The Paisachi lects in Pischel are perhaps Kaikeya and Culika with Culika being marked separately. I see no harm in using reconstructed Paisachi like reconstructed Ashokan Prakrit as a solution to this issue. Since Proto-Prakrit as a separate entity was rejected, attempting to create Proto-NW I-A as a separate entity is likely to be rejected on the same grounds. Like Ashokan Prakrit, the attested and reconstructed terms would represent different entities. See अक्खइ for a Paisachi quotation.
Fifth,
The area encompassed by Sauraseni Prakrit is too large. The area between Mathura and Karachi is at least twice the size of the areas encompassed by both Maharastri Prakrit and Magadhi Prakrit.
There is no Sindhi-speaking editing community (other than the inactive user User:Aursani) to obtain information from.
Old Punjabi pa-old is an etymology-only language.
What is the purpose of Category:Western Panjabi language pnb if it is intended to be merged with pa? Kutchkutch (talk) 13:15, 17 September 2020 (UTC)[reply]
@Kutchkutch: I am not an expert, but I would like to make a general remark that the biggest problem faced while classifying the IA family is the effect of dialect levelling and/or dialect mixing that happened historically, which can disrupt the otherwise regular nature of sound laws, and thus lead to common innovations in divergent lects. For example, Punjabi shares with Dardic the feature of losing voiced aspiration, and metathesis of the rhotic consonant. inqilābī [ inqilāb zindabād ] 17:10, 17 September 2020 (UTC)[reply]
@Kutchkutch: Very well articulated. I agree that Proto-Northwestern Indo-Aryan has little to no chances of being approved as a full fledged language on wiktionary, complete with lemmas in its name. The best solution seems to use reconstructed Paisaci for this purpose. -- Bhagadatta (talk) 01:34, 18 September 2020 (UTC)[reply]
@Inqilābī: The points that you have mention are worth considering. Anything I say about Northwestern Zone is from an outsider's perspective (so if something is incorrect then please feel free to correct it). User:AryamanA and the other Punjabi editors are probably in a better position to make internal judgements about the Northwestern Zone. Despite having an outsider's perspective and numerous limitations, learning about the other Zones of Indo-Aryan is still a worthwhile pursuit (If User:DerekWinters was still around, I'm sure that he would agree). Since there is an international border and a variety of religions in the Northwest Zone, discussing it in detail might involve several sensitive issues such as politics or religion.
Perhaps the effects of ʻdialect levelling and/or dialect mixingʼ, Areal features, Sprachbund#Indian_subcontinent and Dialect_continuum#Indo-Aryan_languages is one of the reasons why ʻthe documentation of various earlier stages does not represent a logical successive relationship from one stage to the nextʼ. These are some of the examples of the shortcomings of the Tree model especially for the Indo-Aryan family. The Wave model tries to fix some of those shortcomings, but understanding and applying it appears to be a challenging task. Although Module:zh-dial-syn seems to be a possible approach to addressing the shortcomings of the Tree model, data is either hard to find or non-existent.
@Bhagadatta: This discussion about the Northwest Zone raises interesting parallels with the other Zones of Indo-Aryan. The work on Maharastri Prakrit, Old Marathi, Marathi and Konkani has certainly been advancing our understanding of the Southern Zone in the public eye. It would be nice to see the same kind of collaboration (if it doesn’t exist already) on the modern and historical languages of the other Zones among native speakers and learners.
Interestingly, User:AryamanA created codes for Proto-Central Indo-Aryan inc-cen-pro, Proto-Northern Indo-Aryan inc-nor-pro and Proto-Northwestern Indo-Aryan inc-nwe-pro without anything more than:
I had not correctly categorized some of the subfamilies, so the pages themselves (except for 1 Ahirani lemma) are okay. This reorganization will take a bit.
So I'm a not sure whether that means we can start reconstructing these proto-languages (finding citations for such reconstructions would be difficult). Perhaps he'll tell us more about what is happening after the reorganisation is complete. According to Wiktionary:Families, non-genetic groups of languages can also be called a ʻfamilyʼ such as CAT:Prakrit languages and now CAT:Central Indo-Aryan languages and CAT:Eastern Indo-Aryan languages. The ancestor of Ahirani is now Sauraseni Prakrit instead of Maharastri Prakrit (perhaps the similarity of से (se) with છે (che) was the reason for the change) with the result now visible on आऊत (āūt) (औत (aut) is more common than अऊत (aūt) for Marathi but mr.wikt uses अऊत (aūt)). Khandeshi continues to have Maharastri Prakrit as its ancestor. Although {{R:ahr:RSS}} exists, not all the pages of that dictionary are available.
It says on Wikipedia:
Sanskrit refers to the whole range of mutually intelligible Old Indo-Aryan dialects spoken in North-western India at the time of the composition of the Vedas.
the original speakers of what became Sanskrit arrived in the Indian subcontinent from the north-west sometime during the early second millennium BCE
So perhaps that means that attested Vedic Sanskrit is OIA in the Northwest Zone, and *पुरिष (puriṣa), *दिन्न (dinna), Reconstruction:Sanskrit/झापयति and the terms in CAT:Sanskrit reconstructed terms could either be alternate forms of OIA in the Northwest Zone or OIA in the other Zones. Kutchkutch (talk) 09:22, 18 September 2020 (UTC)[reply]

──────────────────────────────────────────────────────────────────────────────────────────────────── @Kutchkutch: The part about Skt. being a catch all term for all OIA dialects in North Western India was added by me when I was trying to re-word Woolner's statement. The original statement was something to the effect of, "if Sanskrit is taken to mean Vedic and all Old Indic dialects then the Prakrits are derived from Sanskrit". I think I ought to update it when I have more time on my hands as OIA was also spoken in the east. The part about Sanskrit speakers arriving in the northwest of India was already there. As for Vedic being North Western OIA, it's beautifully illustrated here that Vedic too was a mixture of several dialects. The hymns and verses that would later be included in the Rigveda were composed near the confluence of the Sutlej and Beas rivers in the Punjab region but the Rigveda was redacted in modern day Haryana where the dialect was slightly different so the original Rigvedic dialect was "filtered" through the phonetics of the Kurukshetran dialect, also called "Western Vedic" which was spoken where the redaction of the text took place. Finally Panini's Classical Sanskrit comes from the Northwestern dialect of OIA spoken in Gandhara. -- Bhagadatta (talk) 10:43, 18 September 2020 (UTC)[reply]

@Bhagadatta: Thanks for the explanation and sharing that link. I've read that before but wrongly assumed that it was an imagined story. When you have time, it would be very helpful if you could update Wikipedia with your understanding of this information. Interestingly, Chitrapur Math's Sanskrit lessons use a funny story about the difference between two Konkani dialects to explain what Panini did for Classical Sanskrit here. Apparently there is बड्गी dialect in North Kanara and a तेङ्की dialect in South Kanara (this is reminiscent of the first discussion at Talk:हांव). The female author of the Sanskrit lessons who speaks बड्गी marries a तेङ्की speaker and encounters a few difficulties.
Good ol' Panini, God bless his soul, being extremely sensitive to people's feelings, so no group would feel left out, and wanting to see everybody live happily ever after together, decided to act Pappamma, and brought all of them together under the aegis of "Sanskrit." He toured all over Bharatvarsha, noted every word used and put it all down on paper. Then he classified the words. AND HOW!!! ( To his credit ...he studied all existing grammar works and in his own work, has very religiously and faithfully accounted other grammarians' thoughts on the subject under discussion.) Once Panini's work became known to the people, the Sanskrit Badgis and the Tenkis of the days gone by became familiar with each other's vocabulary and very soon a mixture of the two became a single , common medium of communication. Much like my kids speak today!
The connection between Dardic and Punjabi probably comes from the article Dardic languages. There also may be some Dardic influence on Konkani_language#Pre-history_and_early_development
Dardic may in turn also have left a discernible imprint on non-Dardic Indo-Aryan languages, such as Punjabi and allegedly even far beyond.
Konkani shows a good deal of Dardic ( Paisachi ) influence. Even Magadhi has got a good deal of "Dardic" influence. The other languages on which Paisachi exerted influence are Sindhi, Punjabi, Kashmiri and Nepali in the north.
The influence of Paisachi over Konkani can be proved in the findings of Dr. Taraporewala, who in his book Elements of Science of Languages (Calcutta University) ascertained that Konkani showed many Dardic features that are found in present-day Kashmiri. Thus, the archaic form of old Konkani is referred to as Paishachi by some linguists. This progenitor of Konkani (or Paishachi Apabhramsha) has preserved an older form of phonetic and grammatical development, showing a great variety of verbal forms found in Sanskrit and a large number of grammatical forms that are not found in Marathi.
The link that you provided also demonstrates that some work on the Northwest Zone has been done. Perhaps a summary is in order.
For Paisachi:
The most iconic feature of Paisachi is the apparent devoicing of intervocalic stops (Compare Sanskrit bhagavatī with Paiśācī phakkavatī). The grammatical rules at work in Paiśācī could simply be the reverse application of the voicing rules applied to produce the other Dramatic Prakrits. For von Hinüber (1981), however, the supposed devoicing in Paiśācī is actually a fiction of orthography. According to his theory, at some point in the development of Middle Indic, the character <g> no longer represents voiced velar stop /ɡ/ but rather voiced velar fricative /ɣ/ due to lenition. After this shift, the character <k> is repurposed to mark /ɡ/. For von Hinüber, the odd appearance of Paiśācī is due to the distorting lens of this orthographical shift.
For Sindhi:
The five major dialects of Sindhi are Vicholi, Lari, CAT:Lasi language, Thari and Kachchi. CAT:Memoni language, CAT:Kholosi language and CAT:Luwati language are often included as well. Thari is perhaps another name for Dhatki mki, in which case it may actually be a Rajasthani language. Some sources say that there is a Saraiki dialect of Sindhi and another Saraiki dialect of Punjabi.
Vicholi is the standard dialect in central Sindh. Lari is the dialect of southern Sindh. Lasi is spoken on the western frontier of Sindh and in Balochistan. Thari is the dialect of the Jaisalmer district of Rajasthan. There is a dialect map for Sindhi at [1]. Implosives are explained as the outcomes of geminated voiced stops in MIA (MIA /bb//ɓ/). This is slightly different from the analysis of Vracada Apabhramsa above. The number of voiced implosives differs from dialect to dialect (similar to how the number of tones differs from dialect to dialect for Punjabi). All Sindhi dialects have at least one implosive, and curiously none have a dental. *[tr] → /ʈ/ in ٽي (three).
For Punjabi:
Fariduddin Ganjshakar (1173 - 1266) was one of the first Punjabi writers. Fariduddin Ganjshakar wrote in the Shahmukhi script. Perhaps the name Shahmukhi is only used to distinguish it from Gurmukhi. Fariduddin's literature was included in the Adi Granth. The excerpt interestingly claims that there was some contact between the writers of Old Punjabi and Old Marathi.
The dialects of Punjabi are divided in to Western Punjabi and Eastern Punjabi. Here are some maps: 1 2 3
The Eastern dialects are Majhi, Doabi, Malwai and Puadhi. Doabi is spoken Beas and the Sutlej (perhaps the same region in which the the Rigveda was composed). Malwai and Puadhi are spoken south of the Sutlej along the boundary of the Haryanvi language area. Arya Samaj's promotion of Hindi in the Punjab is often cited as the reason for the blur between Hindi and Eastern Punjabi. Lahnda is an exonym for Western Punjabi coined by William St. Clair Tisdall and this term was also used by Grierson. Two major groups within Western Punjabi are Saraiki and Hindko. Hindko is spoken in discontinuous areas of Khyber Pakhtunkhwa and in frequent contact with Pashto. Kutchkutch (talk) 10:18, 19 September 2020 (UTC)[reply]
@Kutchkutch, Bhagadatta, Inqilābī: Thank you all for your input!! I have been a bit busy, but I have read and tried to do my own research as well.
  • I want to clarify, Proto-Central IA, Proto-Northwestern IA, etc. were only a temporary measure for organization in our modules. I don't expect that we will be reconstructing (most of) these and we should strive to replace them with attested languages if possible. Hence, why I brought up Paisaci as a possible substitute for Proto-NW IA. (Also, most sources I found treated Ahirani as a Central IA lect, hence the reclassification.)
  • The status of Paisaci seems to be difficult to ascertain. Upadhye (1939-40) in their review of Paisaci literature classifies it as an Old MIA language, on the order of Ashokan and Pali. This would seem to explain the rarity of it in written texts; probably, since no religious group promoted it (as Buddhists and Jains did with other MIA lects), there was little incentive for its recording, and so all we are left with are the statements of grammarians.
  • But it should also be noted that early Buddhist texts claim that Paisaci was used by one of the schools of the Vaibhāṣika Sthavira (sub-sect) of Buddhism, based in Kashmir. And going through Pischel again, I find that all of the evidence does point to Paisaci being a Northwestern IA language The only argument put forth for it being anywhere else (namely the Vindhyas) are the presence of retroflex ḷ (suggested by Rudolf Hoernlé) but pretty much everyone that followed has found this to be insufficient evidence. It should also be noted that (some lects of) Punjabi have developed ḷ natively too, and I think some of the Dardic languages too.
  • Then the problem follows, what is a descendant of Paisaci? I strongly doubt that Konkani is involved, although there certainly hasn't been enough historical linguistic work on it; I would think a lot of Konkani's archaisms are rather due to re-Sanskritization, perhaps earlier than occurred for other IA languages. The Dardic languages are often tied to Paisaci, but Pisaci only preserves one sibilant while the Dardic languages have all 3 generally. Maybe we ought to keep it as only the ancestor of Sindhi and Punjabi? I also think a code for Vracada (as the ancestor of Sindhi etc. as Kutchkutch investigated) is necessary and uncontroversial. The devoicing really reminds me of Punjabi, which has done that to its voiced aspirated series and has resulted in tones, but if it's purely an orthographic difference then that is not the same process. Upadhye gives a very interesting idea about Culika dialect as being a form spoken by Sogdians who came to India, suggesting the speakers of Paisaci migrated inland later on, which would result in dialect admixture with Sauraseni. But I think it's very difficult to draw satisfactory conclusions.
  • Of course, there are probably discussions to be had about Dardic's placement as well. Probably, Shabazgarhi/Mansehra Ashokan Prakrit are a good proxy for the ancestor of Dardic, but it is confusing where Gandhari comes into play. We will not put it under Paisaci at this time.
  • Finally, I'm afraid the tree model is too simplistic for IA in general, as Inqilābī and Kutchkutch pointed out. On a purely lexical basis, as examined from the view of lexicostatics, Kogan (2016) finds Punjabi to be closest to Hindi, grouping Sindhi closer to Gujarati, Rajasthani, Lahnda, etc. Our grouping is quite different. Hindi itself, as we know, is a highly mixed language that developed in Delhi from contact between many languages in a political centre, as reflected in its early forms such as Sadhukkadi, in which case even it is not a purely Central IA language and probably is highly influenced by other Central (Braj, Haryanvi), Northwest (Punjabi), Western (Rajasthani), Eastern (Bihari lects) and the Ardhamagadhi lects of IA.
  • Overall I am not opposed to the placement of Punjabi and Sindhi (and related languages) under Paisaci, and Sindhi and its relatives under Vracada Apapbhramsa. Reconstructing Paisaci seems difficult however, although I am curious whether late MIA can be reconstructed at all so it may be an interesting experiment to undertake. However, I wonder if we should only rest on geographical groupings as the best we can have, since the Prakrits seem to be difficult to tie directly to NIA languages.
AryamanA (मुझसे बात करेंयोगदान) 21:50, 19 September 2020 (UTC)[reply]
@Bhagadatta, Kutchkutch, Inqilābī: Also check out [2] from Suniti Kumar Chatterji's work on Bengali. I think this is the best tree-based chart we'll be getting! —AryamanA (मुझसे बात करेंयोगदान) 02:38, 20 September 2020 (UTC)[reply]
@AryamanA: Thanks for the update and the research behind the update. Although we made some progress in furthering our understanding, this is certainly a complex set of issues that are unlikely to resolved anytime soon. The tangible result is that there is now a code for Vracada. However, the usefulness of that code for Vracada remains to be seen. There are probably shortcomings of the tree model in every Zone.
Vracada Apabhramsha's role in forming Sindhi is another reminder that Late MIA (Apabhramsha) is an important stage of I-A. Apabhramsha is often treated it as a single entity rather discrete regional entities because there was a dialect continuum like Ashokan Prakrit. In the bazaar scene of Uddyotana’s Kuvalayamālā c. 779, the narrator quotes small bits of eighteen different languages, some of which sound remarkably similar to the spoken languages of today rather than the Prakrits. Thus, the original names given to the various Apabhramshas (Nagara, Upanagara, Vracada) are like the dots in MOD:inc-ash-dial-map. Most of the Apabhramshas like Vracada are known by name only with little information about them. Kutchkutch (talk) 09:38, 20 September 2020 (UTC)[reply]
@Kutchkutch: Thanks for sharing the link. It's really interesting and refreshing to see that the idea of Sanskrit having different dialects is being alluded to by an Indian guide on Sanskrit. Because I've seen otherwise, most people usually believe that Sanskrit, right from the Rigveda to the elementary lessons on a Class 8 Sanskrit textbook, is of a single form. Konkani being related/influenced by Paisaci/Dardic is news to me but I can already see why someone would think so. It was the language(s) brought in by the GSB people that would eventually become Konkani and the GSBs are said to have arrived from the banks of the Saraswati river. But then, Konkani is after all descended from pmh so Paisaci influence if any would be comparable to the influence exerted by a substrate which probably has been lost by now.
@AryamanA: That chart definitely explores the relations between IA languages better than how it's presently done on this site. I'd say the Romani languages are perhaps descended from Paisaci if it weren't for that chart which lists Romani as being influenced by it and Dardic. -- Bhagadatta (talk) 12:21, 20 September 2020 (UTC)[reply]
@Bhagadatta, Kutchkutch: Another non-controversial change we can make is replace Proto-Northern Indo-Aryan with Khasa Prakrit, removing the Pahari languages from the Sauraseni fold. This would need a bot run to replace current instances of inheritance from Sauraseni in those languages.
I'm still a bit uncertain about Paisaci reconstructions, but it seems we have consensus to put it as the common ancestor of Sindhi and Punjabi and Lahnda lects. I will do that. We still need an Apabhramsa for Punjabi, but it seems there were several in the Punjab area if we go by Grierson and Chatterji. The primary one of these is Ṭakka Apabhramsa, and we can classify the other ones (Kekaya, Madra, Upanāgara, etc.) as varieties of this. Like Vracada, these aren't well attested. —AryamanA (मुझसे बात करेंयोगदान) 20:24, 20 September 2020 (UTC)[reply]
@AryamanA: The new codes and reorganisation will help with descendants trees. The most important idea conveyed by the recent changes is that MIA doesn't become NIA in one fell swoop. Replacing Proto-Northern Indo-Aryan with Khasa Prakrit would allow a replacement for ambiguous term Pahari in descendants trees. If Takka is the same as Pischel's Dhakki, then Takka Apabhramsa could have attested entries.
The only other discussion about Apabhramsa appears to be User_talk:DerekWinters#Apabhramsa. Some interesting points from that discussion are:
This shows that there must have been a language spoken in Kalinga (historical region) that led to Oriya being separate from Bengali-Assamese.
Principle Apabhramshas are Takka Apabhramsha in Central Punjab and Vrachada Apabhramsha in Southern Punjab. By 1200 AD these Apabhramshas had few inflectional morphemes left. During Middle Ages Takka Apabhramsha developed into Lahori dialect and Vrachada Apabhramsha developed into Multani dialect. Arab and Persian travellers, specifically Al-Biruni in his book تَحْقِيق مَا لِالْهِنْد (taḥqīq mā li-l-hind), had declared that even before the advent of Islam in Sindh (711 A.D.), Vrachada was prevalent in the region.
(@Bhagadatta:) It's still unclear whether there should be reconstructions for Vrachada, Takka and Paisachi. Comparisons could be made with reconstructed Ashokan Prakrit (Proto-Prakrit) and Proto-Dardic. However, the difference is that reconstructed Proto-Prakrit and Proto-Dardic appear in the literature, and Paisaci and the Apabhramsas don't appear to have a tradition of reconstruction unless en.wikt wants to start that tradition. The sound laws the affect Paisaci and the Apabhramsa would need to be established somewhere (such as User:AryamanA/Prakrit). Without reconstruction, there would be a lot of categories of the type CAT:Vracada Apabhramsa term requests. Kutchkutch (talk) 10:26, 21 September 2020 (UTC)[reply]

@AryamanA, Itsmeyash31, Atitarev, Bhagadatta I am planning to remove this category unless someone comes up with a good reason to keep it. It has only 33 entries in it, is badly named, and duplicates Category:Hindi terms inherited from Sanskrit. Benwing2 (talk) 04:50, 17 September 2020 (UTC)[reply]

@Benwing2: Oh man, consensus to delete this category had been reached years ago, I can't believe it's still there... I thought it was deleted. -- Bhagadatta (talk) 05:16, 17 September 2020 (UTC)[reply]
@Benwing2: Since we do not have Category:Hindi Tatsama and Category:Hindi Ardhatatsama, then how did this one linger thus long? Obviously, delete it. inqilābī [ inqilāb zindabād ] 13:18, 17 September 2020 (UTC)[reply]

Turkish noun inflection edit

According to a comment in Requested entries (Turkish) (Special:Diff/60412898) the Turkish noun inflection template is incomplete. Some possessive forms can end in -in or -ini and only the first is generated. Could somebody who knows Turkish explain what needs to change? Vox Sciurorum (talk) 13:46, 17 September 2020 (UTC)[reply]

It is not a simple matter. Turkish nouns can have an optional number suffix (marking a plural), an optional possessive suffix, and an optional case suffix, in that order. For example, kitap-lar-ım-da = “book-plural-mine-in” = “in my books”. So the generic form is noun stem + number + possessive + case. Counting the absence of a marking as the presence of a null segment denoted, for the purpose of exposition, by ∅, some possibilities are:
  • kitab-∅-ım-da = “in my book”
  • kitap-lar-∅-da = “in (the) books”
  • kitap-lar-ım-∅ = “my books”
Including the null segment, there are two number suffixes, seven possessive suffixes, and six case suffixes, giving 2 × 7 × 6 = 84 combinations. Some forms are shared, but are analytically and semantically distinguishable. The inflection tables are in comparison rather simplified. They are essentially two tables: a main one for noun stem + number + ∅ + case, and an optional one for noun stem + number + possessive + ∅, reducing the number of combinations to 2 × 1 × 6 + 2 × 6 × 1 = 24. The requested entry bisikletini (which is the definite accusative of both bisikletin (“your (singular) bicycle”) and bisikleti (“his/her/its bicycle”), and therefore a shared form) contains both a possessive suffix and a case suffix, so it is not included in the 24 forma provided at bisiklet#Declension. All this is peanuts compared to Turkish verb conjugations, where you’ll have a real combinatorial explosion if you try to include all possible forms. I think this is more a grammatical issue than a lexical issue, but ultimately it is a policy issue; does the maxim “all words” really mean all 10,080 possible, completely regular and predictable inflections of some base form in some agglutinative language?  --Lambiam 11:25, 18 September 2020 (UTC)[reply]

The Azerbaijani inflection table has also the same problem but it has the complete inflection, then why can't we do the same with Turkish? Some Wiktionaries have already the completed one. If we add the inflection for bisikletini we can just add as second person possessive and third person possessive in the accusative form. If we compare with examples as cases: Bisikletin nerde? = Where's your bicycle?(nominative second person possessive). Bisikletini istiyorum = I want your bicycle(accusative second person possessive). Lagrium (talk) 14:01, 18 September 2020 (UTC)[reply]

Perhaps Kyiv should be promoted from an “alternate spelling,” in the wake of the renaming of w:KyivMichael Z. 2020-09-17 16:49 z

The city hasn't been renamed AFAICT; as long as more English-language sources call it Kiev than Kyiv, I think {{alt sp}} is still right. Kyïv, Kyjiv, and Kyyiv may also be attestable, but none of them have an entry yet. —Mahāgaja · talk 17:27, 17 September 2020 (UTC)[reply]
Kyiv and Kiev are Romanizations of different names, in different languages, for the city. So they are, IMO, “alternative forms” rather than “alternate spellings”.  --Lambiam 11:40, 18 September 2020 (UTC)[reply]
The difference in their written form is only their spelling—not capitalization, not diacritics, or anything else (not that I care much about the label). Michael Z. 2020-09-19 16:37 z
Alternative spellings have to share an etymology. These two forms don't. Ultimateria (talk) 17:37, 19 September 2020 (UTC)[reply]
The same could be said about "castle" and "chateau". Would you say that Myanmar and Burma are alternative spellings? How about Beijing and Peking?
Interesting comments. (Who says alternative spellings have to share an etymology?) They sort of do share an etymology, because the English name was strongly influenced by both Russian and Ukrainian at a time when written Ukrainian was suppressed in the Russian empire, when Ukrainian was often referred to as “Russian” or “Little Russian,” especially by the literate class in the Russian empire, and when the old orthography could write it exactly the same in two languages: Киѣвъ (the letter yat was pronounced differently and led to different sounds and letters in modern Russian and Ukrainian). There’s a lot of historical Ukrainian influence on English that still carries that legacy.
The spoken name \key-ev\ is not two different words. It can be transcribed with either of two spellings, depending on whether you use a current style manual or still use your grade four textbook.
Peking and Beijing came about similarly, through transcription of Cantonese and Mandarin languages, respectively. Pronounced differently but written similarly, like Киѣвъ > Кіев/Київ. But English Kyiv/Kiev are spoken the same. Michael Z. 2020-09-19 19:08 z
I came here looking for English pronunciation and the entry currently lacks one. I find your remark about it being the same as Kiev very surprising. Why would anyone bother to change from Russian-derived spelling to Ukrainian-derived spelling while retaining Russian-derived pronunciation detached from the spelling? I would expect the situation to be exactly the same as with Peking and Beijing from the POV of English language (which have different pronunciation despite coming from exactly the same source, unlike Kiev and Kyiv). But in any case, we are missing the pronunciation and I don't know what is actually used. For that matter AHD gives[3] "Ki·ev (kēʹĕf, -ĕv) or Ky·yiv (kēʹēo͝oʹ)" (I suppose Kyyiv is a simple alt sp of Kyiv) while Oxford's Lexico gives[4] /ˈkɪjɪf/ for Kiyv (vs /ˈkiːɛv/, /ˈkiːɛf/ for Kiev). —mwgamera (talk) 22:49, 23 August 2021 (UTC)[reply]
That's a good point. {{alt form}} is better than {{alt sp}} for this. —Mahāgaja · talk 16:28, 18 September 2020 (UTC)[reply]
Indeed the city hasn’t been renamed in Ukrainian or in Russian, but it is in the process of being renamed (re-spelled) in English, much like Peking→Beijing. More current sources are now writing Kyiv, and according to current style manuals Kiev deserves the label “dated.” Wikipedia has changed its practice as a follower, not a leader, and this only after the delay of a six-month moratorium (gag order) and three months of debate on the associated talk page. Michael Z. 2020-09-19 16:43 z
The changes in style guides are motivated by politics. Kiev is still more common. --Vahag (talk) 18:10, 19 September 2020 (UTC)[reply]
Our dictionary shouldn’t share your political prejudices. It is based on usage, which we ascertain through references. And it shouldn’t give our readers writing advice that makes them look ignorant of current standards. Michael Z. 2020-09-19 19:09 z
Google Ngram is too coarse. The last data point is all of 2019 averaged out. There was a tipping point in usage during October–November. If you’re interested in politics, the very conservative BBC and New York Times use Kyiv, Breitbart uses KievMichael Z. 2020-09-19 19:20 z 19:20, 19 September 2020 (UTC)[reply]
@Mzajac: "Peking" is not Cantonese (it would be pronounced like the English word "Bugging"), and "Beijing" is not Mandarin. The first one is a Chinese postal romanization, the second one is pinyin, based on standard Chinese, which again is based on the Beijing reading of Chinese characters. You seem to have the general misconceptions about Chinese. There is no such thing as the dichotomy between "Cantonese and Mandarin languages". You are comparing a city speech to a whole continuum. Cantonese is a prestige dialect representing the Yue linguistic area. It uses the Guangzhou/Canton reading of Chinese characters. Mandarin, on the other hand, is a linguistic area covering a vast area of China, with many cities having their characteristic dialects that can't be understood by outsiders. Thus, using mutual intelligibility as a criterion to define "languages" in China is naive thinking. Two neighboring cites can have problems communicating with each other despiting speaking variants of the same dialect group (Mandarin, Jin, Gan, Wu, Xiang, Yue, ...).--89.246.121.221 16:01, 19 April 2021 (UTC)[reply]
Thanks for the patient explanation. It is still my understanding that the Latin transcriptions in Postal Romanization and Pinyin were based on different regional and historical pronunciations of the same written Chinese, regardless of what one defines as a language or dialect (and perhaps based on different applications of Latin). Isn’t that correct? Similarly, Kyiv has had many English spellings over five centuries, influenced directly and indirectly by Ukrainian, Polish, Latin, German, Serbo-Croatian, and Russian.
Anyway, some of this is a distraction. Our guidelines state that wt:Alternative forms are all equal, but the choice of main entry should take into account the requirement for usage labelling. The nut is that the spelling recommended by every up-to-date English style guide should be the main-entry form, not one that is dated in formal English. Michael Z. 2021-04-19 16:35 z 16:35, 19 April 2021 (UTC)[reply]

Appendix for strings in unidentified or uncertain languages? edit

Over in the User_talk:Karaeng_Matoaya#The_enigmatic_poem_of_Nukata_no_Ōkimi thread, a few of us where discussing a particular poem in the Man'yōshū anthology of Old Japanese poetry, completed around 759 CE. Poem number 9 has frustrated readers for centuries, as the first two stanzas may be written in a different language entirely.

That discussion gave rise to a question about whether Wiktionary might have space for collecting snippets of text like this, where the underlying language might not be known. Clearly, a mainspace entry would be inappropriate. But what about an Appendix page?

Do we perhaps already have such an Appendix area set up? ‑‑ Eiríkr Útlendi │Tala við mig 19:05, 17 September 2020 (UTC)[reply]

Why is mainspace inappropriate? It's attested, and it's what we already do: see Category:Undetermined lemmas. —Μετάknowledgediscuss/deeds 19:14, 17 September 2020 (UTC)[reply]
The Buyla inscription has word breaks so you know what to make entries for. But these verses (there are other examples especially in ancient Chinese sources, as discussed there) are entire sentences (sometimes entire songs) of words in an unknown language where even academics can’t agree on where to split the word boundaries, and it doesn’t seem quite right to treat sentences or songs as mainstream entries.--Karaeng Matoaya (talk) 22:36, 17 September 2020 (UTC)[reply]
In that case, we could create entries for each character (as for the Phaistos Disc signs)... although I don't object to creating an appendix, which would I presume present the text and various scholarly ideas of where to break it up and what it might mean? - -sche (discuss) 18:43, 19 September 2020 (UTC)[reply]
@-sche: Unlike the case of the Phaistos Disc where the characters are literally undeciphered, the main languages that brought about this discussion are all transcribed in conventional Chinese characters that we already have CJKV entries for, so it's only a matter of reconstructing the pronunciations and orthographic practices at the time and place of the transcription and comparing the resulting sequence to known languages from those parts of Eurasia. Scholars have gone quite a far way to deciphering many of these cases, but of course each interpretation attempt only makes sense as a whole; if a certain passage is determined to be Proto-Turkic from the beginning, you're going to have entirely different results from if you decided that it was Para-Mongolic. So having separate-character entries would not be particularly productive, while discussing the passage as a whole would. This is why I think an appendix entry like Appendix:Song of the Yue Boatman would work better.--Karaeng Matoaya (talk) 01:06, 20 September 2020 (UTC)[reply]
OK, so, Appendix:Undetermined language? With sections or subpages for each text? - -sche (discuss) 03:35, 30 September 2020 (UTC)[reply]

Quote adder redux edit

In a recent discussion about templatising citations it became clear that some editors don't use templates for citations because it's more difficult (parameters must be remembered, brackets matched etc.).

Other communities have already addressed this problem and created the Citoid extension. It extracts metadata (author, date, title etc) from a URL/ISBN/DOI and generates template code which can be directly inserted into the page. This has been suggested before, but not much has happened since. I'd like to get the ball rolling, if there's interest we'd need to set up a vote to get the extension installed, and create mapping files to work with our templates. – Jberkel 13:58, 18 September 2020 (UTC)[reply]

Calling other users involved in that discussion (@Equinox, DTLHS, DCDuring) for their support. inqilābī [ inqilāb zindabād ] 18:11, 19 September 2020 (UTC)[reply]
It looks like it takes a fair amount of work to set it up. I'd like to see how a version configured for Wiktionary does on older cites using URLs from Google Books and no ISBN (ie, pre-1970). How does it do on translated works? On edited works with many authors of individual works? And quotations taken not from the speaker/author's own work, but from some secondary source? What about digging out when the work was actually written rather than when the specific print edition was published (though that publication date might be of supplementary interest)? I expect that a lot of manual work will be required for these situations in which the relative ease of Citoid will lead folks to fail to perform the whole job. But the current setup does a pretty good job of that already. DCDuring (talk) 00:48, 20 September 2020 (UTC)[reply]
BTW, is anyone working on tools like those developed by ELEXIS? Does anyone have access to and opinions about these tools? Also, has anyone worked with Sketch Engine? DCDuring (talk) 01:19, 20 September 2020 (UTC)[reply]
@DCDuring: I don't think the set up will take too much time, it's a matter of adapting it to our template conventions. It's true that it probably won't work well for some of the use cases you mentioned, but it seems to be doing a good job for getting references from contemporary sources. Yes, it's biased towards English and scientific sources, but the general metadata extraction part is generic and can be extended. I think it would especially help editors who routinely add quotations from newspapers / news websites. These often have already good metadata (author, publishing date, title) embedded in the HTML which can be extracted without any custom code. – Jberkel 07:47, 24 September 2020 (UTC)[reply]
What I would ideally like (and I know it's a "big ask") is for templates in general to be able to auto-complete or auto-predict their contents, rather like Microsoft Visual Studio does when you write a function call in some programming source code. So I would type the opening braces {{ and it would immediately pop up an alphabetical list of all available templates (without impeding my typing: this would just be an optional pop-up menu that I could navigate with the arrow keys), and perhaps I would choose "en-noun", and then type the pipe | and it would pop up a useful hint, like "en-noun parameter 2: Language code, e.g. en for English". I realise this would take a lot of work both in implementing the popup stuff and in actually filling in the documentation, but I think this would be amazing, and would convince a lot more people to use the templates. TLDR: I'm not specifically focused on citations but on templates in general; I do realise however (see my talk page) that I am a major editor and some people would like me to format my large numbers of book quotations in a proper manner, which I haven't forced myself to learn yet. Equinox 08:51, 24 September 2020 (UTC)[reply]
I'm not sure that typing in an ISBN or whatever reference number would help me at all because I'm usually copying and dropping in some text out of Google Books. I don't want to look stuff up, and I usually don't have any idea what the ISBN is. I just want to put the text in there as proof that a word exists. Equinox 08:53, 24 September 2020 (UTC)[reply]
@Equinox: When you cite from Google Books, you still need to perform a few manual copy&paste operations (select author, select title, etc.) If you pop this Google Books URL into Wikipedia's Cite tool, it creates the following snippet for you:
{{Cite book|date=2005-03-23|first=James|isbn=978-0-596-00847-5|language=en|last=Avery|publisher="O'Reilly Media, Inc."|title=Visual Studio Hacks: Tips & Tools for Turbocharging the IDE|url=https://books.google.es/books?id=ux3-AgXEenoC&pg=PA137&dq=visual+studio&hl=en&sa=X&ved=2ahUKEwin4byBxYHsAhULyxoKHXD6CaUQ6AEwAXoECAMQAg#v=onepage&q=visual%20studio&f=false}}
All you have to do is add the text passage, the boring details are pre-filled. Obviously, in this example the template and parameters are Wikipedia-specific, but that's part of the mapping we'll have to customize. – Jberkel 10:21, 24 September 2020 (UTC)[reply]
Ah, I see; I thought I would have to type an ISBN into a text box or something. Yeah, sounds good. Please HIT ME UP (as the kids say, or did a decade ago) if you want to do any pre-release testing of this. Equinox 10:28, 24 September 2020 (UTC)[reply]
I'd be happier if the ISBN didn't display, just as I would like to see publisher's addresses and much other material in many of our RQ templates fail to display or, better, appear only in a bibliographic entry outside of principal namespace, possibly in yet another namespace. — This unsigned comment was added by DCDuring (talkcontribs) at 14:21, 24 September 2020 (UTC).[reply]
I've drafted Votes/2020-09/Install Citoid. Before it starts I'll double-check that we meet all requirements. If I understand it correctly, the extension is already installed and just needs to be enabled. – Jberkel 09:36, 26 September 2020 (UTC)[reply]
After a successful vote this is now enabled on enwikt. – Jberkel 20:30, 12 November 2020 (UTC)[reply]

CFI for misspellings edit

Hi, I've created a vote at Wiktionary:Votes/2020-09/CFI for misspellings to begin next week. I appreciate any feedback. Please let me know on the talk page of any relevant discussions I've missed. Ultimateria (talk) 19:30, 18 September 2020 (UTC)[reply]

I've renamed the vote to reflect its scope of rewriting the Spellings section of CFI: Wiktionary:Votes/2020-09/Misspellings and alternative spellings. Ultimateria (talk) 21:25, 18 September 2020 (UTC)[reply]

FWOTDs edit

For the past year; I have been selecting the Foreign Words of the Day (FWOTDs) here on Wiktionary. However, I'm at a point where I won't be able to do them any longer and would like to wash my hands of them, so I'm seeing whether anyone would be interested in handling them. (To set them, choose words from the nominations and add them to the subpages here using {{template:FWOTD}}.)

Even if you're not willing to take on the job of setting FWOTDs on a permanent basis, setting a few would be immensely helpful.There's currently a shortage of FWOTD nominees. Nominating more here would also be great; remember to follow the guidelines on that page. I'm open to the idea of potentially changing how Wiktionary does FWOTDs, as long as I'm not involved. As I said, I'd like to wash my hands of them.

If you have any questions, look at previous FWOTDs (e.g. September) or ask me.Hazarasp (parlement · werkis) 05:39, 20 September 2020 (UTC)[reply]

If my comment provokes a lot of discussion, I'll move it to its own thread so as not to hijack yours, but : I note that both WOTD and FWOTD seem to burn out their maintainers over time (I speak from experience!). I wonder if we should take a page from e.g. de.Wikt and do a "word of the week" instead. - -sche (discuss) 07:30, 20 September 2020 (UTC)[reply]
That is something I'm considering - that's why I mentioned "potentially changing how Wiktionary does FWOTDs". @Metaknowledge also suggested the idea of recycling previous FWOTDs; I think that's what I'll do (at least for now) if this doesn't attract much interest. Hazarasp (parlement · werkis) 08:32, 20 September 2020 (UTC)[reply]
How is island (FWOTD 2020-09-20) a Swedish word? Are there any more concrete requirements than that the term be "interesting"? Looking at the entries for this month, it is not apparent in what sense they are more interesting than any other random word, such as stjórnborð, တီကောင်, débarrassâmes, revisión por pares, or limitarere, generated by pressing "Random entry".  --Lambiam 19:53, 20 September 2020 (UTC)[reply]
Look more closely: the word is ö. As for interestingness, that was a major concern for me, and my pursuit of that is what burned me out, but I can't speak for Hazarasp. —Μετάknowledgediscuss/deeds 19:55, 20 September 2020 (UTC)[reply]
Interestingness is subjective, and if you're not familiar with a particular language it might not be obvious. What I find often lacking is context, or some connection between words. Focus or theme weeks might provide that, or what about simply merging FWOTD with the WOTD? Call it "Words of the Day": perhaps a translation or equivalent word in another language, or something that is connected to the WOTD, a cognate, false friend etc. – Jberkel 21:13, 22 September 2020 (UTC)[reply]
Another thing which just occurred to me: it would also help to work on the entries which have been nominated, but which are currently stuck in "not ready". Often it's just a missing citation or translation. – Jberkel 11:27, 24 September 2020 (UTC)[reply]

@Hazarasp I'm willing to look after FWOTDs, just tell me when and with what FWOTD date I can start. I expect to end up the same as the other people who've selected FWOTDs, but ¯\_(ツ)_/¯. ←₰-→ Lingo Bingo Dingo (talk) 11:33, 24 September 2020 (UTC)[reply]

The first unassigned FWOTD is October 1, though October 2 and 3 are both assigned. After that, things are clear (with a few sporadic exceptions). Hazarasp (parlement · werkis) 06:10, 25 September 2020 (UTC)[reply]

Where to lemmatize Literary Chinese terms attested only in Korean sources? edit

There are some Literary Chinese terms which were used chiefly or only by Koreans, back when Chinese was the country's language of writing. These words were not used in some heavily Sinicized form of what is nevertheless Korean, but in entirely Chinese compositions that would otherwise be perfectly understandable to an educated Chinese reader versed in Literary Chinese. In many cases they actually have no corresponding Korean-language term at all.

Should these be lemmatized as Chinese?--Karaeng Matoaya (talk) 08:47, 21 September 2020 (UTC)[reply]

I think that is appropriate. {{lb|ko|Korea|literary|historical}}? Or maybe {{lb|zh|Korean literary}}, with an explanation link. —Suzukaze-c (talk) 09:17, 21 September 2020 (UTC)[reply]
Or {{lb|Korean Classical Chinese}}? -- 11:59, 21 September 2020 (UTC)[reply]
@Karaeng Matoaya, could you give us some examples? --Frigoris (talk) 12:05, 21 September 2020 (UTC)[reply]
@Frigoris, Some examples as I understand it are:
  • 白文 (báiwén) in the sense of "unapproved document"
  • 次知 in the sense of "someone in charge"
  • 啓聞启闻 (qǐwén) in the sense of "to report to the central government"
  • 原情 (yuánqíng) in the sense of "to sue, to petition"
  • 發明发明 (fāmíng) in the sense of "a criminal's excuse or alibi"
But I'm not very well-read in Chinese, have you happened to encounter any of these words in these senses in non-Korean works?--Karaeng Matoaya (talk) 12:29, 21 September 2020 (UTC)[reply]
That sounds a lot like Medieval Latin, with non-native speakers communicating in the default international language of the day. If it's distinct enough, I suppose we could come up with an etymology-only code, but given the multilingual nature of Chinese even in China, it's probably better to treat it as just one more form of Classical Chinese, with nothing more than a regional label. Chuck Entz (talk) 14:19, 21 September 2020 (UTC)[reply]
@Karaeng Matoaya, thanks! Those words were indeed rarely used for the senses in Chinese. For example, 原情 (yuánqíng) since Han dynasty could mean "to pursue/seek the truth" > "to investigate a legal/criminal case"; however in "Chinese Chinese" it was a very loose compound. The word indeed chiefly appeared in legal contexts, but the meaning was not identical to the one you listed (though related). The phrase now survives in fossil form in the Chinese chengyu 情有可原.
I presume that you do have the textual material in which these words appeared; if those material were clearly in a historical form of Chinese (as opposed to Korean), I think it's appropriate to add them under Chinese with a suitable label. --Frigoris (talk) 15:15, 21 September 2020 (UTC)[reply]

Multiple homographs for Ojibwe finals edit

I'm looking for advice/rules on how to edit for multiple meanings of the same form when i have no information on the etymology. The example the Ojibwe final -i (a "final" is a type of morpheme specific to Algonquian languages). It has four different applications, which i have listed with separate POS headings. I'm wondering if that is the most appropriate way to show this phenomenon. I don't think gloss numbers 1, 2, 3, 4 works either, because the most important information - derived terms - doesn't fit under glosses. The English parallels use Etymology 1, Etymology 2, etc, but as i have information on the different etymologies, listing that way seems inelegant. SteveGat (talk) 15:46, 21 September 2020 (UTC)[reply]

To have multiple etymology sections, you don't have to know what the etymologies are, you just have to know that the etymologies are different from each other. That said, I have an uneasy feeling that we may be dealing with something like a Swiss Army knife, that can be used to slice carrots, pull a cork, file your nails, or saw off a branch without having to swap it for something else. In my experience American Indian languages tend to twist categorization systems based on European languages into macramé. It may be better to use a morphologically-based POS like "suffix" or "particle" combined with senses and subsenses for the different functions. Pinging @-sche, who has experience with this. Chuck Entz (talk) 03:19, 22 September 2020 (UTC)[reply]
The POS isn't the concern here. Finals are a morphological category, according to the literature (see a quick explanation here). As for the Swiss army knife, i think it is possible that the multiple-etymology analysis may be flawed (or at least forced), but it is the analysis adopted by the only authority that deals with the language systematically and is accessible to non-academic readers, the Ojibwe People's Dictionary. But if i read you correctly, the best approach (if we accept that each of the finals has a different "meaning"), then we can enter each one under a different etymology (Etymology 1, 2, 3...) without actually providing an etymology. Is that right? SteveGat (talk) 20:53, 22 September 2020 (UTC)[reply]
@SteveGat In general, putting different meanings under different etymologies should only be done if the different meanings have etymologically diverse origins. However, if you feel this is correct, there is no particular requirement to provide an etymology for each etymology section. For example, when I generate non-lemma forms of a word and there's already a lemma on the same page for the same language, I use a different etymology sections, with no etymology given for the non-lemma form. See Russian пар (par) for an example. Otherwise, if you think they are all etymologically the same, doing what you've currently done is fine, using separate POS headings. Or alternatively, put them as different definitions under the same POS heading, and if you want to separate out the derived terms, you can put the terms derived from different meanings under different boldfaced headers in the same "==Derived terms==" section (e.g. put a semicolon at the beginning of a line to boldface it). Benwing2 (talk) 02:15, 23 September 2020 (UTC)[reply]
(e/c) If we know, or reasonably suspect, that they have different etymologies (even if we don't know what those etymologies are), then yes, they can have blank Etymology 1, 2, etc headers. (If it seemed more likely that the same root diversified into several functions, then just different POS headers or different sense lines would suffice.)
I question some of the things listed as "derived" from the final final (the one defined as "occurs in adverbs, numbers, and other uninflected words"). In niswi, for example, the ending (w)i seems to have been present since Proto-Algonquian (*neʔθwi) if not earlier, and in niizhwaaswi and ingodwaaswi it seems that the ending may have been changed to the ending wi (not a final or suffix -i) at some early date when the words (which ended in ika in PA) were adapted to have the same form as other PA numerals which ended in wi. (In several other languages, the ika was simply dropped.) If -i were a meaningful morpheme marking numerals in Ojibwe (which is questionable, since other numerals lack it), it might make sense to say the words superficially have -i, but when the -i in question is just defined as a meaning-less string that "occurs in adverbs, numbers, and other uninflected words", I question whether such an analysis is useful. - -sche (discuss) 02:25, 23 September 2020 (UTC)[reply]
Thanks for all this help. @- -sche and @Benwing2. I think the morphology offered by the OPD is weak at best. The various finals -i are the result in my view of OPD's attempt to create initials that can be dissociated from the finals, when the more likely linguistic phenomenon is that the -i is simply suppressed when various other finals or suffixes are added. Compounding the problem is that -i in Ojibwe plays roughly the same role as a schwa, it's just vowel filler between consonants, so trying to force out a meaning seems illusory. Unfortunately, the OPD is the most authoritative resource on this issue, and it treats them all as separate finals. Given that the curent layout is "ok", i'll leave it unless and until i find something more clear on what is going on. Thanks again.SteveGat (talk) 13:55, 23 September 2020 (UTC)[reply]

Gheg vowels edit

Hello,

I have to bring up here the issue of Gheg vowels according to the article given. I never heard the vowel [ɒ], designated with "ä" in the vowels grid. My overall impression was that nowadays Gheg speakers can change with rather less restraint between [ɛ] and [e]. All terms in Tosk or Standard Albanian with the vowel [e] or [o] due to a schwa at the end have got the same vowel in Gheg as well. I wondered whether the denotation of the vowels [o] and [ɔ] extends any further in Gheg than if one would orthographically distinguish the Standard Albanian or Tosk pronunciation difference on account of the word-final schwa of nouns. Due to Tosk influence or whichever preferrence of the speaker, it almost does not make sense to separate [a] and [ɑ]. Both occur in Gheg words that would have their own entries so, as a basic approach, two Gheg entries would be needed for a single Gheg variant with this vowel. For example the term mas can surely have mâs too despite the overview given in the table. Presumably, combinations can occur freely in one word. Lastly, nobody uses all of these circumflexes in Gheg writing, especially not all together like a thoroughly adopted orthography, excluding also large parts of Albanian linguistics along with Vladimir Orel. They were more frequent at the end of the 19th century and probably the beginning of the 20th century. Perhaps it would be better to abolish one or more of the circumflex vowels and denote different pronunciations in the IPA table. HeliosX (talk) 22:11, 28 September 2020 (UTC)[reply]

Now I gathered examples for my doubts about the complete accuracy by transcribing the first two stanzas of the Gheg song "Gabim" by Dhurata Ahmetaj into the current Wiktionary orthography with all these circumflexes and I had to listen very closely, sometimes to each word. I have added all word-final schwas of Tosk, unfollowed by consonants, but they do not change the pronunciation in the distinctiveness of the orthography.
Gabim:
Sa ndrrovê
S’jê i njejti mo, s’jê i njejti mo, vetën ê harrovê
Çka menovê?
Sê ê jotja kôm m’u kônë
Dêri n’fund u mashtrovê
Ti prêmtovê qaq shumë rrêna saqê êdhê vêtes i besovê
Ê tregovê ftyrën tônê n’atë môment kur me tjera m’krahasovê
Êj, nuk um mêtë sên mâ
Tash nuk kena ça me bâllë
Sê prij mejê, zêmër, s’ditê ça pô dôn
Ti pô dôn mu mê m’pasë n’kôntrôll
Shumë pô dôn, amâ nuk ka mê ndôdhë, jô
Bêjbi, bêjbi, pak kôntrôll
S’ka mê ndôdhë, môs u lôdh mâ
The term "bêjbi" can also be written as "baby" if the speaker prefers that spelling. The preposition "prij" is usually "prej" also in Gheg. The transcription shows that Gheg may have most of the vowels in the table of the article linked but it also becomes apparent that they are interchanged easily as was my overall impression beforehand. The interchangeability occurs in the infinitive particles "me" and "mê", the adverbs "mo" and "mâ" and the reflexive pronouns "vetën" and "vêtes". In another Gheg song of the same year by another artist, there were multiple times "pâsë" and "mênu" instead of "pasë" and "menu" as one would think based on this. HeliosX (talk) 18:00, 23 September 2020 (UTC)[reply]

Entries and contributions being rejected -- CFI, idiomaticity and sum of parts (SoP) woes edit

Hi all,

First, let me say that I'm happy that you all use your valuable time to help grow and maintain this wonderful tool called Wiktionary. You've helped me learn so much over the years. Atitarev recommended that I start a topic here. He's right. It will probably lead to meaningful dialogue about an issue which probably affects all of us here. The issue, in the form of a question, really is:

As a criterion for inclusion, are some of us being too rigid about idiomaticity?

Here's the problem in question using contrived examples for simplicity.

SITUATION 1 (collocations, compound words, lexical chunks): Imagine that Wiktionary lacked the following word in some foreign language: central processing unit. To my mind, this is clearly a valuable entry for the dictionary regardless of idiomaticity. For me, it's simple: it can be found in published works and it deserves to be here. However, not everyone shares my opinion. Some might argue that it has "low linguistic value" and, perhaps, we should only include CPU. They would say that we should define one of the words central, or processing, or unit and that we should:

  • include examples for central processing unit in the central, processing, and unit pages.
  • include a translation on some other page like CPU. The translation must, however, be presented as separate words to avoid a red link (non-existent entry), presumably.

Through a combination of searching and encountering examples and translations, the Wiktionary user will know what a central processing unit is without ever having a dedicated page for it. Meanwhile, at someothersite.com, we can just type central processing unit and voilà. This leads to someothersite.com appearing above us in the search engine results by virtue of having a page with the correct title and heading which the user is interested in. (I'm assuming that being a popular dictionary matters to us too and that we want to lead the pack).

SITUATION 2 (prepositional phrases): We have the word regard in some foreign language. A Wiktionarian would now like to contribute the following:

The contributions are shot down and the Wiktionarian is told that sum of parts contributions are not welcome here. They are told: we already have a regard page. You can simply put all those entries as examples on that page. Problems:

  • The page for regard becomes a huge mess as it now has to explain all its derived terms.
  • The next random editor can remove or change the example. (A dedicated page for those entries would not be subject to that arbitrariness).
  • The derived terms listed above probably have different rules about their usage depending on the language. So, for example, one term might require the genitive case, another might require the instrumental; another might require only the singular number, etc.

But why are we doing this to ourselves? Why are we rejecting entries which are clearly helpful? What are we afraid of? This is not a paper dictionary. We will neither saturate the server nor lose the ability to find anything. The search function seems pretty awesome and I haven't heard about any Wiki projects complaining about hard disk space.

  • We have unlimited storage space which we should use to make this dictionary comprehensive.
  • We are are not trying to document infinite word combinations, we are documenting frequently used lexical chunks.
  • By having specific entries, we are doing ourselves a favour in respect of search engine optimisation.
  • We can reduce clutter and increase the utility of pages by having specific entries for derived and related terms.
  • We can include more details about the usage of terms and expressions when they have their own pages and aren't just a footnote or example somewhere else.
  • We can link more translations to actual pages when those pages actually exist so that translation hubs aren't a blur of red.
  • We make our dictionary more attractive when users immediately see what they are looking for in the autocomplete.

Can we agree that the policy should be: if a term or expression appears in a published dictionary, it is good enough for inclusion in Wiktionary?

All the best. -- Dentonius (talk) 04:41, 24 September 2020 (UTC)[reply]

This has been discussed a billion times, but in response to your concerns about storage, I will just note that NOTPAPER in itself is a poor argument because that could also support having entries for "green leaf" and "large green leaf" and "cute furry kitten". Equinox 04:43, 24 September 2020 (UTC)[reply]
I also don't like your "published dictionary" rule because that basically makes us beholden to others instead of potentially leading the way. Also, other dictionaries can sometimes be worse than us, perhaps not often, but it can happen. (If you want to look up policy, we usually refer to copying other dictionaries' habits as following the LEMMINGS. See WT:LEMMING.) Equinox 04:45, 24 September 2020 (UTC)[reply]
if a term or expression appears in a published dictionary, it is good enough for inclusion in Wiktionary Absolutely not. Different dictionaries serve different purposes. They have different criteria for what to include. This diversity is a good thing. We are not trying to be all dictionaries, we are trying to be a particular dictionary with particular editorial practices. DTLHS (talk) 04:47, 24 September 2020 (UTC)[reply]
@Equinox, it's been discussed a billion times but maybe this time will be better. :-) I've never come across green leaf, large green leaf, or cute furry kitten in any published dictionary. Yes, some published dictionaries can be worse than us. But the reverse is true, we can be worse than them too. Their lexicographers are pretty smart people. Who are we to say that we know better about their art? @DTLHS, and what purpose are we trying to serve? I didn't sign up for your editorial practices, I signed up because I like this tool which enabled me to learn several languages. For me, it's about the utility. Why boast about an online universal dictionary which falls short of a paper dictionary? - Dentonius (talk) 05:03, 24 September 2020 (UTC)[reply]
We are lexicographers because we practice lexicography. The OED does not have a magic inaccessible spell which we lack that allows them to write dictionaries. I would much rather debate on the merits of our particular editorial policies than throw them out all together because we decided the editors of other dictionaries know better than us. DTLHS (talk) 05:12, 24 September 2020 (UTC)[reply]
Well, we have hundreds of entries that aren't in any professionally published dictionary; indeed I've seen some fairly compelling evidence that some of them (including the OED) occasionally refer to us to find new words. I am glad you signed up but maybe it's a good idea to look around and get a feel for the place, and its policies (developed by votes by hundreds of people over more than a decade) before immediately taking a sledgehammer to it all. Equinox 07:11, 24 September 2020 (UTC)[reply]
I've been using Wiktionary for years Equinox, long before I signed up. I love the concept but I always had other dictionaries I used because I knew that multi-word terms are usually a problem for Wiktionary. For example, for Spanish, I'd look up expressions and wouldn't find them here. Thankfully, there's spanishdict.com. For Russian, I'd look up certain expressions. Same problem. Thankfully, there's openrussian.org. For a bunch of other languages, I just go to wordreference.com which is pretty solid. For German, nothing even comes close to dict.cc. That's a project we should try to emulate. But, don't get me wrong. I *love* Wiktionary. There's nothing out there like it, but it still has a lot of growing to do. Now, as regards that sledgehammer, I haven't come to take away or destroy. I have come to give. I'm saying that we should stop thinking that we're better than professional lexicographers and let entries which are in published dictionaries exist here too. DTLHS, we aren't lexicographers (not unless that's your actual profession in real life). I would never try to make their profession out to be something that any lay person can do. In fact, I would sooner trust a published dictionary than anything I see here. For academic purposes, it is only a published dictionary which I can cite. For the amount of time that Wiktionary has been around, it is a crying shame that we haven't overtaken all the other dictionaries out there. - Dentonius (talk) 07:31, 24 September 2020 (UTC)[reply]
I don't think you understand. We aren't saying we are "better" than professional lexicographers. We don't necessarily compare ourselves to them at all. Forget that they exist. We are trying to build a dictionary. In my experience it is not efficient use of volunteer time or resources to create lots of "sum-of-parts" terms: look at the current ongoing discussion around "flu jab", where it's been pointed out that you can have "ANYTHING jab" for any disease that gets a vaccine (e.g. "tetanus jab", "rubella jab"). Now you may be approaching this from the point of view of a TRANSLATOR (do you work as a translator?) where it's important to translate entire phrases into other entire idiomatic phrases. Currently we aren't really a translator's dictionary; that is a different thing from a general-purpose dictionary. Equinox 08:57, 24 September 2020 (UTC)[reply]
That's a sound argument, @Equinox. I like it. You're right: I do approach it more from the point-of-view of a translator. Aside from the fact that it wouldn't scare me to add all those terms for all named diseases, I can appreciate what you're saying. Your answer was really helpful :-) -- Dentonius (talk) 09:02, 24 September 2020 (UTC)[reply]
Sometimes it takes a good example, which starts to make sense, doesn't it? We don't want to create all types of entries for all types of jabs and we don't want to make entries for all types of workshops either, like automobile repair shop, refrigerator repair shops, computer repair shops, etc. --Anatoli T. (обсудить/вклад) 09:15, 24 September 2020 (UTC)[reply]
I see what you're saying Anatoli T.. I'm just trying to be funny now: but surely there aren't any "wooden chair repair shops", "pet turtle repair shops", etc. I would still assume that the entity in question must correspond to something in real life? But, yes, it makes sense. The line has to be drawn somewhere. Thanks again for your time and patience. ;-) -- Dentonius (talk) 09:20, 24 September 2020 (UTC)[reply]
I would like somebody to explain this to me: How does it take away from our project or diminish our standards if we allow all entries to be created which can be found in real world dictionaries? We're not being lemmings. The published dictionaries are the standard! - Dentonius (talk) 07:31, 24 September 2020 (UTC)[reply]
You say they are the standard. But as I stated above, sometimes other dictionaries even copy stuff from us. So sometimes we are the standard, sometimes we "lead the pack". If we cower in the background, always waiting for a "REAL" dictionary to do something before we can do it ourselves, then we will never amount to shit. Equinox 08:59, 24 September 2020 (UTC)[reply]
I agree with you 100% here, @Equinox. We shouldn't wait for "REAL" dictionaries to do stuff. I also absolutely believe you when you say that others have copied from us too. But, I didn't mean we should limit ourselves to their entries. I just thought it would be helpful to have a CFI which explicitly states, the other criteria for inclusion don't apply to published dictionary entries; i.e. we should have no reservations about adding terms from paper dictionaries. However, I saw what you said about us not being a translator's dictionary. It makes sense, especially with what you said about volunteer time. However, it's still a little sad that so many useful terms in published dictionaries will just go to waste or be relegated to a footnote just because. I'll get over it :-) -- Dentonius (talk) 09:10, 24 September 2020 (UTC)[reply]
No, it wouldn’t be helpful. Editors let themselves already be helped by them even if they aren’t mentioned there. We always should have reservations. There are bare wrong things in published dictionaries or other sources. But luckily we can be “better than professional lexicographers”. And for the inclusion matter it is rather that there are unfitting things in the published dictionaries: you ask “how does it take away from our project or diminish our standards if we allow all entries to be created.” The answer is that “an entry” in a paper dictionary is not like an entry at this website. Here one accesses words in a different fashion, not by alphabet browsing but by typing in into the search or another search engine, and on the other hand one entry bears the danger of the creation of similar entries on its example. Whereas for a paper dictionary you cannot even make out a distinction in whether something “is a lemma” or just a usage example and on the other hand the public cannot fit new words into it. When under مِثْقَال (miṯqāl) Wehr’s dictionary ”includes” مِثْقَال ذَرَّة (miṯqāl ḏarra, the weight of an atom, often negated to mean a trifling amount) this 1985 dictionary does not say anything about whether this word combination should be included as an own web page. And an English-Russian dictionary does not put brackets onto its translations to signify whether a Russian term is SOP (as they do not hyperlink). The “entries” here and there are incommensurable. That’s why we cannot parallel them formally, why one should “forget that they exist”, because the notion that they argue for what should be included here is a fallacy – they don’t. I never argue by other dictionaries for whether and how a term should be included – although surely I let myself be helped and guided by them. Fay Freak (talk) 12:10, 24 September 2020 (UTC)[reply]
I certainly agree that Wiktionary is lacking in many of the respects you say. However, I don't think the solution is just to create mainspace entries for everything. We need some sort of system to allow for collocations, which I have been begging for since I joined, but which doesn't have enough support to be implemented. Wiktionary sucks as a translation dictionary. I love it for everything else, but when I do translation work, it's next to useless. I really wish we did something about this, since it's well within our scope, but sadly, there doesn't seem to be sufficient interest. Andrew Sheedy (talk) 02:35, 27 September 2020 (UTC)[reply]

Where to lemmatize terms in these two reconstructed languages? -- 10:42, 25 September 2020 (UTC)[reply]

Global ban RFC for Nrcprm2026/James Salsman edit

Nrcprm2026, better known as James Salsman, has an active discussion regarding a possible global ban.--GZWDer (talk) 07:56, 26 September 2020 (UTC)[reply]

Create a category for past participles with an active sense edit

For example, depart reads

The past participle, departed, unlike that of the majority of English verbs, has an active, rather than a passive sense when used adjectivally: not even a legible inscription to record its departed greatness 

--Backinstadiums (talk) 08:22, 26 September 2020 (UTC)[reply]

Lemmatization of German superlatives edit

(Notifying Matthias Buchmeier, Kolmiel, -sche, Atitarev, Jberkel, Mahagaja): Currently we lemmatize German superlatives at forms ending in -sten. However, some dictionaries, e.g. dict.cc, appear to lemmatize at -ste, or in some cases at -st, e.g. kleinst. Are we sure our way is correct? There are certain words even in Wiktionary where the -ste form is given as the lemma, e.g. schwerste, aggressivste, and certain others in Wiktionary where the -st form is given as the lemma, e.g. kleinst, höchst, mindest. Are these last three properly adjectives at all? If so, what form are they and what do they mean? Benwing2 (talk) 22:56, 26 September 2020 (UTC)[reply]

@Benwing2: IMO, we should lemmatise by the -ste form, following the terms with superlative meanings like letzte and ordinal numbers erste, zweite, etc. It's not too intuitive and Duden online just lists forms erste/erster/erstes (and others) together in one article, so it's a matter of choice for dictionary makers. --Anatoli T. (обсудить/вклад) 01:55, 27 September 2020 (UTC)[reply]
I find -ste the least intuitive. -st makes sense because then the superlative lemma has as much endings as the lemma form of the positive. Fay Freak (talk) 02:05, 27 September 2020 (UTC)[reply]
@Fay Freak, Benwing2: I have no objection if they are lemmatised on -st. I've just checked - that's what Langenscheidt does. We need to move existing superlatives, ordinal numbers and words like letzte for consistency then. BTW, Oxford German uses -ster. --Anatoli T. (обсудить/вклад) 02:57, 27 September 2020 (UTC)[reply]
The -st forms aren't generally used isolated. These words often appear after a definite article: "der/die/das kleinste". Why would we use non-words as a lemma form? I personally would prefer -ste (e.g. "kleinste"). But, looking at it again, the most consistent and self-explanatory form of them would be the "am ...-sten" form (e.g., "am kleinsten"). Your call, guys. Whatever you want is fine, by me. -- Dentonius (talk) 07:11, 27 September 2020 (UTC)[reply]
Isolated? Well at least they virtually always exist because this is the adverb form, although predicative use is also possible – like with positive forms; they are just less likely as superlatives in general. Fay Freak (talk) 11:14, 27 September 2020 (UTC)[reply]

Hang on a second guys, aren't these superlatives all non-lemma forms? "kleinst", "kleinste", "am kleinsten", (whatever), would all be non-lemma forms. The only lemma form here is "klein". What discussion are we really having? Are you all trying to decide which adjectival forms should not have their own dictionary entries? -- Dentonius (talk) 07:19, 27 September 2020 (UTC)[reply]

We had formerly already some discussions on superlative lemmatization, I think for German and I think also for Latin. They resulted in Wiktionary:Votes/2018-07/Restructure comparative and superlative categories which I think Benwing wants to execute now. I agree it would be easier to have all these forms under {{inflection of}}. I probably voted pro on that vote because the state before was even more chaotic. Fay Freak (talk) 11:14, 27 September 2020 (UTC)[reply]
@Dentonius Comparatives, superlatives, participles etc. are weird in that they're in some sense non-lemma forms, but also have non-lemma forms of their own. When we say "lemmatize" we mean the entry under which to put the {{superlative of}} template invocation, and hence the entry that goes under Category:German superlative adjectives. The remaining non-lemma forms go under Category:German superlative adjective forms. Currently, using -sten as the "lemma", a page like neusten goes both under Category:German superlative adjectives and Category:German superlative adjective forms, because it's both the "lemma" form of the superlative and several non-lemma forms (see the entry for the full list). We don't currently (but easily could) put a declension table under the "lemma" form of the comparative and superlative. This is currently done for Latin superlatives such as optimus (best), as well as for Russian participles such as де́лающий (délajuščij, doing) and де́ланный (délannyj, done). IMO putting the superlative lemma under -sten is strange because it's actually the weak dative singular (coming from the expression am neusten), rather than something more expected like the strong nominative singular (in -ster), weak nominative singular (in -ste) or the bare stem (in -st). Benwing2 (talk) 17:52, 27 September 2020 (UTC)[reply]
BTW IMO there is precedent for putting the lemma under -st even if it's not an adjective form; it's the bare stem, and we do the same for nouns and adjectives in Sanskrit. Benwing2 (talk) 17:53, 27 September 2020 (UTC)[reply]
@Benwing2, re: stem (-st) for superlative lemma. Further up, it was said that most even exist as words. The inflections in all the cases share those stems. Seems like you know what to do, then. Your plan sounds great. We should use "-st". Does anybody object? -- Dentonius (talk) 18:30, 27 September 2020 (UTC)[reply]
That's OK for me, as well as the other options, as long as it's doneconsistently. After all, other Wiktionaries (like the German) don't even lemmatize them. Matthias Buchmeier (talk) 05:41, 28 September 2020 (UTC)[reply]
-st seems to be a sane option. – Jberkel 09:11, 28 September 2020 (UTC)[reply]
It took me a minute but I tracked down the discussions this reminded me of, Wiktionary:Tea room/2019/May#erste_vs._erster, User talk:-sche/Archive/2015#German_ordinal_numbers. One could envy the Duden being able to sidestep this question by putting all the forms on one page. If we were to put these on kleinst shouldn't we reconsider our move to make dritte rather than dritt the lemma? Bah. I may try and ponder what form might have the fewest downsides, and see what other authorities suggest, later. - -sche (discuss) 11:09, 28 September 2020 (UTC)[reply]
It may appear that unlike with kleinst, derbst etc., the ordinal superlatives need to be put at erstens, zweitens – also anderns –, drittens, letztens, and also bestens, because these are the predicative and adverbial forms. Then links and rechts. And oben, unten should have adjective entries for what is at oberer, unterer? This is however not at all intuitive. vorkantscher is one of the well-known cases of syncope of schwa that can only appear in the inflected forms (like andrer, eigner etc.) and should have vorkantisch as it main form, similar to papieren from -en#German Etymology 3 – the form papiern (without endings), in spite of its usage (search "papiern" klingt, or geklungen etc. ) is strange and artificial, but correctly now placed as “alternative form of papieren”. Fay Freak (talk) 14:43, 28 September 2020 (UTC)[reply]

Please undo page move edit

Can someone undo this page move?  --Lambiam 23:40, 26 September 2020 (UTC)[reply]

  Done Equinox 16:24, 27 September 2020 (UTC)[reply]

Sundanese romanizations edit

I have discovered this problem since several months ago. The category Sundanese romanizations contains around 439 Sundanese entries labeled as "romanization" of the equal term in Sundanese script. The thing is, Sundanese script is not widely used by Sundanese themselves. They (or I would rather say: we) use mainly the Latin script to write Sundanese of every day life. It is used in Sundanese education in primary schools through university, magazines, on public signs and announcements, and even on Sundanese Wikimedia projects (w:su:, su:). Although some schools teach Sundanese script in their curriculum, we don't actually use it often. We use the script mainly for decorative purposes, like in road signs or the Sundanese Wiktionary logo but not for actual text. I found that the entries were edited through a work of one user but they edited last on February last year. The term that they were editing perhaps had the correct definition in the Latin entries before being moved to the one in Sundanese script. Consider an example: manusa before and after. They seem to have also made similar edits of "romanization" for Balinese and Javanese, both written chiefly also using Latin in Indonesia. They also discussed of using the script for the Sundanese Wiktionary in 2015 to which another user (he happens to be also an administrator in the Sundanese Wikipedia that I know) explained that the script wasn't used widely. Would it be possible for the Sundanese definitions in the Latin entries be recovered and with that maybe added an explanation of "Alternative form" for the Sundanese script, similar to Malay Latin v. Jawi? I have made one for manusa and ᮙᮔᮥᮞ. RXerself (talk) 11:57, 27 September 2020 (UTC)[reply]

Hello, I request that my name be added to the list. The task I'm looking at right now is to do some mass nominations of pages at Special:BrokenRedirects, but there may be some other edits I could do with the tool in the future. I have AWB rights at commonswiki, dewiki (although there every patroller may add himself to the list), enwiki, metawiki, and ruwiki.

And while you're at it, you could replace "character references are generally discourage" by "character references are generally discouraged" in the page content. 1234qwer1234qwer4 (talk) 14:26, 27 September 2020 (UTC)[reply]

What's the point? You're not doing anything with the entries, you're just moving them from one page to another. An admin is still going to have to check them and delete them- something they could do while the entries are still at Special:BrokenRedirects.
The main difference is that an important maintenance category is suddenly going to be filled with hundreds of entries that no admin has considered important enough to spend their time clearing out. The speedy deletion category should be reserved mostly for things that need immediate attention, especially vandalism and spam. Chuck Entz (talk) 16:18, 27 September 2020 (UTC)[reply]
@Chuck Entz Well, I thought I would go through the list and retarget what could be retargeted, while nominating what could not. Maybe it's not the best way though, and I should just leave everything to be deleted there. 1234qwer1234qwer4 (talk) 16:45, 27 September 2020 (UTC)[reply]

Kurdish edit

I notice we have over 2000 "Kurdish" lemmas. Unfortunately Kurdish is not a single language but a set of related languages. Does anyone have enough knowledge of these languages to be able to determine whether these are predominantly or exclusively in one language (e.g. Kurmanji)? If so maybe they can be split or moved in a semi-automated fashion. Benwing2 (talk) 04:38, 28 September 2020 (UTC)[reply]

I expect them to be (almost) exclusively to be Kurmanji terms. The Kurdi Wiktionary seems also to be a Kurmanji Wiktionary: Kategorî:Kurdî is about Kurmanji, while Kategorî:Soranî is about Sorani lemmas. A bot might be able to check the L2 language over there. I guess Pehlewani terms are written in the Persian alphabet. The Kurdi Wiktionary has three members in Kategorî:Pehlewî, which are written in the Latin alphabet, but two of which are represented by their consonants only.  --Lambiam 13:50, 28 September 2020 (UTC)[reply]

@Calak, Vahagn Petrosyan. 212.224.227.44 14:48, 28 September 2020 (UTC)[reply]

They cannot be split in an automatic fashion. Only a native speaker like User:Calak can split them by going over the lemmas one by one. And even in that case there will remain some lemmas found only in {{R:ku:Justi}}, which does not distinguish between Kurdish dialects. --Vahag (talk) 18:16, 28 September 2020 (UTC)[reply]
Can we trust the language categorization of the Kurdish Wiktionary?  --Lambiam 18:35, 29 September 2020 (UTC)[reply]
I don't know if Kurdish Wiktionary is reliable. --Vahag (talk) 05:23, 30 September 2020 (UTC)[reply]
Hi @Vahagn Petrosyan, Lambian. I have been meaning to talk about this here. Thank you for starting this discussion. I'm an admin on Kurdish Wiktionary. I've been working on the ku.wikt like for almost a year now and what i can tell you is on ku.wikt we seperate Kurmanji and Soranî words. But all Kurmanji words are categorized as ku:Category:Kurdî (Kurdish) while all Sorani words go to ku:Kategory:Soranî. If a word is used in both Kurmanji and Sorani in that case we use kmr as the language seperator just out of respect for Sorani. Do you have specific policies for Kurdish? If not then what i can propose you guys is that it'd be better if Category:Kurdish language did not have any lemmas in it but just links to Northern Kurdish (Kurmanji kmr), Central Kurdish (soranî - ckb) and Southern Kurdish (kurdiya başûrî - sdh) like an umbrella.--Balyozxane (talk) 05:24, 3 October 2020 (UTC)[reply]
@Balyozxane Hi. Thanks very much for your comments. I completely agree that all lemmas in Category:Kurdish language should be moved to the appropriate language variety they actually belong to. I might be able to do this by bot provided there's a reliable way of identifying which variety a given lemma belongs to. I looked at about 10 nouns randomly chosen and all appear to be Kurmanji. One way to do this is if you could go through all the Kurdish lemmas here at enwikt and identify the ones that are *not* Kurmanji, since that should presumably be a small minority. Perhaps it's also possible to look up the words on the Kurdish Wiktionary and see how they're categorized. Benwing2 (talk) 03:53, 4 October 2020 (UTC)[reply]
@Benwing2 Hi. I've had a list of all Kurmanji pages from ku.wikt and compared the list with the AWB list comparer and there are 343 unique pages on en.wikt which are listed here ku:Bikarhêner:Balyozxane/list. I guess if i only check this list, it would be safe to assume all the others are Kurmanji words. Should I check this list only or do you want an actual thorough check? --Balyozxane (talk) 07:54, 4 October 2020 (UTC)[reply]
@Balyozxane I think just checking those 343 entries should be enough. One thing you might want to do if you can is look at all the Sorani pages from ku.wikt and see how many overlap with the pages here on en.wikt; any pages that are listed as both Kurmanji and Sorani lemmas probably need postprocessing by hand after I move them to be Kurmanji lemmas. Benwing2 (talk) 19:58, 4 October 2020 (UTC)[reply]
@Benwing2 I checked that list and there are some that I'm not sure if they are actually kurmanji or some other kurdish language. Also i am not sure about the arabic script entries. You can move all the latin script ones to Northern Kurdish except for these: ren, spar, tab, Tiwana, varsin, xwigî. I can't find any info on them.--Balyozxane (talk) 12:14, 5 October 2020 (UTC)[reply]
@Balyozxane, Calak, Şêr Just FYI, I am cleaning up the Kurdish templates in preparation for splitting the Kurdish entries into Northern Kurdish and Central Kurdish (in practice this mostly means renaming Kurdish -> Northern Kurdish). User:Calak has already done this for several lemmas; examples are Northern Kurdish file and Central Kurdish مازوو (mazû). Hence we have 86 Northern Kurdish lemmas, 334 Central Kurdish lemmas and 48 Southern Kurdish lemmas, as well as around 2000 "Kurdish" lemmas, almost all of which are actually Northern Kurdish (not to mention Laki, Zakaki, and Gorani, which for some reason we call "Gurani"). (Strangely, of the 86 Northern Kurdish lemmas, 7 are in Cyrillic, 7 in Arabic script and one in Armenian script. Meanwhile, of the 334 Central Kurdish lemmas, 12 are in the Latin script, and of the Southern Kurdish lemmas, 22 are in Latin script. I suspect we need to move the Northern Kurdish Arabic script terms to Latin script, and vice-versa for the Central and Southern Kurdish Latin lemmas.) BTW I think the approach followed by User:Calak of using {{ku-regional}} to point to the corresponding lemmas in the other languages is the right approach. Benwing2 (talk) 00:26, 7 October 2020 (UTC)[reply]
I will normalize the Cyrillic and Armenian script lemmas to Latin. Can't do that for the Arabic. --Vahag (talk) 08:00, 7 October 2020 (UTC)[reply]
@Vahagn Petrosyan: Thanks! Benwing2 (talk) 13:23, 7 October 2020 (UTC)[reply]

Wiki of functions naming contest edit

21:16, 29 September 2020 (UTC)

I would like to suggest WhatTheFunc. Equinox 21:23, 29 September 2020 (UTC)[reply]
Da Func. – Jberkel 22:02, 29 September 2020 (UTC)[reply]
Wiki.Exe. – Jberkel 22:05, 29 September 2020 (UTC)[reply]
The winner is gonna be Wikifunction(s) anyway. --Daleusher (talk) 23:44, 29 September 2020 (UTC)[reply]
I said that before looking at the vote. And it abbvs to WF, WIAAGT. --Daleusher (talk) 23:45, 29 September 2020 (UTC)[reply]
Or Wikilambda. – Jberkel 08:14, 30 September 2020 (UTC)[reply]

Page layout: separation between languages edit

Is anyone especially attached to the present layout whereby language sections terminate with a horizontal rule? I would get rid of this. For example, as regards the separation between "English" and "Dutch" sections, I prefer this to this. I don't see the point of the extra horizontal rule: to my eye it looks almost as if something has gone wrong. Mihia (talk) 21:59, 29 September 2020 (UTC)[reply]

Yes, I think it's probably time to do this- it wouldn't be very difficult, just a lot of pages to edit. DTLHS (talk) 22:03, 29 September 2020 (UTC)[reply]
Could it be automated? Mihia (talk) 22:03, 29 September 2020 (UTC)[reply]
Yes, easily. DTLHS (talk) 22:26, 29 September 2020 (UTC)[reply]
This was discussed before somewhere just recently, and someone pointed out (and I agree) that having the ---- is good for making the wikitext itself more legible (for humans). A suggestion was made to simply make the ---- not display, either by default for all users or via preferences; I believe this is easy to do via css? - -sche (discuss) 00:14, 30 September 2020 (UTC)[reply]
OK, right, I didn't see that other discussion. I wonder, if we made it not display, would we then get too much vertical whitespace? Also, would it suppress horizontal rules everywhere, including other unrelated uses that we might want or need to preserve? If we were to make it not display, I think it should be the default, otherwise probably only 0.0001% of Wiktionary users would ever be aware. Another option would be to use an inline HTML comment <!-------------------->. Simply replacing the existing "------" with an inline comment does also create too much vertical whitespace, but if one newline is also deleted then it seems to look OK. Mihia (talk) 11:23, 30 September 2020 (UTC)[reply]
@-sche: Do you remember the location of the recent discussion that you refer to? I found this, this and this, but none are what you could call recent. Mihia (talk) 09:59, 6 October 2020 (UTC)[reply]
Wiktionary:Grease_pit/2020/September#Line_gone_missing, where DCDuring points out the use in making the wikitext more legible. If, as I infer from that thread, the line can be made visible or invisible by css, then I think we could make this a pref / provide css people could use to opt in to one display or the other. - -sche (discuss) 03:01, 7 October 2020 (UTC)[reply]
I used my browser's element inspector to examine both the line between sections and the the one under the L2 header. The line between is a simple <hr> tag, while the other one is a bottom border for the <h2></h2> surrounding the header text, added by css. If the <hr> tag can be hidden via css, that's one option, The other option would be to have the css add a top border to the <h2></h2> to match the bottom border, and get rid of the <hr> tag altogether. I've never worked with css or js, but from what I've read, I'm pretty sure it would be fairly simple for js to change the css from specifying a top border to not specifying it, or vice versa, and I'm sure the distance between the top border and the text could be adjusted to make the top border look exactly the same as the current no top border with a preceding <hr> tag. The big difference is that getting rid of the <hr> tag created by the "----" would mean having a bot edit hundreds of thousands, if not millions of entries- a major project. Chuck Entz (talk) 04:20, 7 October 2020 (UTC)[reply]
Why would it be a major project? It seems to me, indeed as I think DTLHS says above, that it should be simple to write a program to automatically remove the "-----" lines from the appropriate places. It might take a long time to run, yes, but presumably it could be left chugging away in the background, not needing any human effort or intervention. Mihia (talk) 14:07, 7 October 2020 (UTC)[reply]
I personally like the current style way more, so it seems to be a matter of opinion. Does anyone know the original reason of adding ----? Thadh (talk) 20:08, 1 October 2020 (UTC)[reply]
According to someone here, "it was added for the visual effect". Mihia (talk) 13:43, 6 October 2020 (UTC)[reply]

ck vs. kk in archaic German edit

In an entomological paper published in 1803 I find "Die Fühlhörner vorgestrekkt" (meaning, approximately, "antennae elongated"). I assume this is the past participle of the verb now spelled vorstrecken. Is the change from kk to ck the result of one of the German spelling reforms? Is it a general rule that ck used to be spelled kk, or only true for some words? Vox Sciurorum (talk) 23:15, 29 September 2020 (UTC)[reply]

One can see vorgestreckt in papers from at least the 1600s through the present day, and one can find ck in other words, e.g. Zucker, going all the way back to Old High German. (In other cases, something like modern blicken goes back to MHG blicken goes back to OHG blicchen with cch.) So, it's not that ck used to be spelled kk by everyone (either in that word or in general), but rather, spelling used to be more variable and some authors preferred to use kk where other authors use ck. (Other obsolete tendencies include the use of ck in places where we now only accept k, like dunckel and türckisch.) I'm not sure whether some spelling reform specifically deprecated things like vorgestrekkt and dunckel or if they just fell out of use. Interestingly, although ck has long been the "normal" or "normalized" form, kk was in some respects treated as the "underlying" form for a long time inasmuch as ck used to be broken across line breaks as k-k, e.g. Zucker became Zuk-ker and blicken became blik-ken, until the recent reform to use Zu-cker and bli-cken. - -sche (discuss) 01:06, 30 September 2020 (UTC)[reply]
Sometimes it goes the other way, too. In 1909, Rudolf Thurneysen's Handbuch des Altirischen uses the spelling Ackusativ for what most people (both then and now) spell Akkusativ. —Mahāgaja · talk 06:53, 30 September 2020 (UTC)[reply]

Black speech and writing that's not AAVE edit

User:Lambiam and I have discussed this before, e.g. last May, and regarding Caucasity, but: Wiktionary has various words that are used in, or represent, "Black speech or writing" (sometimes, but not always, African American), which don't belong to the specific lect of AAVE, being used equally by e.g. black professors who speak standard English, and/or black speakers outside the US. Examples include Caucasity and bye, Felicia (used by black Americans who don't use AAVE, and judging by twitter searches for Caucasity + various Commonwealth spellings like colour, also by black people outside the US, as well as non-black people) and Afrikan (a spelling to connote black power, by speakers of otherwise-standard English and even outside the US). (There are also things like ebery and for#Particle which seem intended to represent black speech, not necessarily just in the US or just AAVE, and which may or may not be used by those speakers; compare various -ee words used to represent Chinese English.) What should we label these things? "African American English" for the AA-but-not-AAVE things? But what about when the terms are used equally by black Britons, like Caucasity seems to be? - -sche (discuss) 00:10, 30 September 2020 (UTC)[reply]

I’d like to know that, too, since Multicultural London English is needlessly restricted to London. Features of the phonology and lexicon of this sociolect, not to say the majority of them, exist well in many cities of England like Nottingham and even Liverpool – one talks about Scouse and does not mention the mention or name the modern developments, as if one portrayed the indigenous London accent merely as the now rarified Cockney. And Dublin. There is clearly a thing like Multicultural Dublin English, though as of now the term exists only in my private language. It’s also the so-called New Dublin English of Hiberno-English, but it seems even newer and also has that multiculturality. Pinging for this discussion @AdamBMorgan who enriched us with many MLE entries. Fay Freak (talk) 00:43, 30 September 2020 (UTC)[reply]
You shouldn't assume that "Multicultural London English" is only for Londoners, any more than "English" is only spoken in England. Of course English is spoken outside England and MLE perhaps outside London. Equinox 03:02, 30 September 2020 (UTC)[reply]
Oh, I don’t, I know it is copied, but my supreme knowledge does not hinder readers to make such assumptions and to fall prey to wrong associations. So far the “London” in “Multicultural London English” does mean London more than “English” means England. And it’s probably not that true either that this English, or English altogether, has spread circularly from London, like the inflammation from a tick bite, as you make it appear. This a maiore ad minus simile is bare wet. Fay Freak (talk) 04:30, 30 September 2020 (UTC)[reply]
I have seen some expressions credited to "black Twitter", another case where the underlying dialect is more or less standard English but the vocabulary is specialized. The one that comes to mind is snacc. Vox Sciurorum (talk) 14:59, 1 October 2020 (UTC)[reply]
As an aside, I suspect snacc should be made a mere alt form of snack, a spelling this sense also occurs in (more commonly and not just from black tweeters, in my anecdotal experience—I also see it used by white British and American women—though I'd try to assess relative frequency more thoroughly before assuming my experience was representative), like how thicc is but an alt form of thick. Obviously, the word could still be labelled. - -sche (discuss) 23:24, 1 October 2020 (UTC)[reply]
The one common assumption here seems to be that the words are used mostly by Black people when they are speaking English. So why not call it "Black English" (or African American English if a term is limited to the US) and, as appropriate, indicate that the origin is AAVE? SteveGat (talk) 15:38, 1 October 2020 (UTC)[reply]
I think that'd be a fine label ("Black English"), and it does seem to be used in literature. "African American English" (which could be used for US-specific words) is also used in iterature, although sometimes as a synonym of AAVE, it seems, which would mean we weren't really solving the problem of AAVE being the wrong label (for some words); we could go with "Black American English" instead. Anyone have objections or other suggestions? - -sche (discuss) 23:24, 1 October 2020 (UTC)[reply]
I hereby record my objection to the labels "Black English" and "Black American English.": categorising varieties of English by geography makes sense. Categorising varieties of standard English by ethnicity and race reveals an implicit bias towards "White English" as the standard. Words coined and used by Blacks speaking standard English are still standard English and need no special classification. -- Dentonius (talk) 22:30, 2 October 2020 (UTC)[reply]
When a word has limited use we try to mark it as such. It could be limited by register (e.g. informal), region (Yorkshire or America), or even sexual orientation (Polari). You could use a Yorkshire word without also obeying the Northern Subject Rule, and you could use one of the words we're discussing here without the grammar changes of AAVE relative to standard American English. Vox Sciurorum (talk) 19:21, 3 October 2020 (UTC)[reply]
As a follow-up, I added an "African-American" label separate from AAVE, finding that some entries, like ahun, already used it (even before my edit). - -sche (discuss) 21:06, 22 November 2020 (UTC)[reply]

Why are there so many N-word entries? Can't we consolidate them under the N-word page? edit

The N-word is just about the most offensive word in the English language. There are quite a few people here who get off on creating racist, ethnic slur entries. Many of these are clearly sum of parts, so they look for obscure references from bygone times to justify their presence here. I don't see the reason why niggerball, niggerfaggot, nigger killer (talk), nigger toe, nigger rich, etc. should all have their own page. Put them all on the nigger page. If racists are trying to document their own despicable past, they don't need so many pages to do so. We, people of colour, already know why racists are the way they are and that they love to come up with new words all the time to help them feel better about themselves and to put down people who don't look like them. If your objective is to piss off Wiktionarians of colour, then I'd say that's very antagonistic. I'm sick of seeing all these useless terms in this dictionary while so many good expressions are rejected as "sum of parts." They say this is not a translator's dictionary yet you want to waste volunteer effort on this garbage? What I really want to know is: can we consolidate racist terms under their racist roots? It really isn't necessary to have all these useless, archaic entries which you people damn well know you can't even say in real life outside your circle for fear of losing your jobs, among other things. Your help would be appreciated. -- Dentonius (talk) 07:53, 30 September 2020 (UTC)[reply]

niggerball (a large black sweet; basketball), nigger toe (brazil nut), nigger killer (scorpion; potato; rum) don't seem like sum of parts. —Suzukaze-c (talk) 08:02, 30 September 2020 (UTC)[reply]
Do you people even see yourselves? -- Dentonius (talk) 08:04, 30 September 2020 (UTC)[reply]
It's an unsavory past that exists.
"all these useless, archaic entries" — "useless" is subjective, and "archaic" is not a rationale for exclusion. —Suzukaze-c (talk) 08:07, 30 September 2020 (UTC)[reply]
Without discussing whether these terms fulfil the CFI, could we as a community decide that we don't want so many ethnic slur derivatives and just consolidate them under their roots? How could we go about doing that? Is this something which could be voted on, for example? -- Dentonius (talk) 08:13, 30 September 2020 (UTC)[reply]
@Dentonius: I really can't imagine how "consolidation under roots" would be done. I can only imagine it being awkward, like trying to shoehorn firefighter under fire. These compounds have a unique meaning only when their components come together, and except for the offensiveness of the roots, it is just a normal compound otherwise. —Suzukaze-c (talk) 05:54, 1 October 2020 (UTC)[reply]
@Suzukaze-c. Normal for you. I see. It can be done. But I understand your point of view and respect your right to have it. -- Dentonius (talk) 06:09, 1 October 2020 (UTC)[reply]
@Dentonius: Nice assumptions. You don't know me. I want to know more about your game plan. —Suzukaze-c (talk) 06:10, 1 October 2020 (UTC)[reply]
@Suzukaze-c:, my game plan? Wow. I'll be here for a while. We'll have plenty of opportunities to get to know each other. -- Dentonius (talk) 06:12, 1 October 2020 (UTC)[reply]

Just to explain what I mean by consolidating racist derivative terms under their roots: If there are Wiktionarians who strongly feel that their N-word a, N-word b, N-word c, should be documented, why don't they just provide them as examples under the N-word page? Is this something which could be voted on (and not even just for the N-word but for ethnic slur derived terms, in general)? There probably aren't many Wiktionarians of colour but it disgusts us to see how some people here talk about and rationalise all of this. For them, it's fun and games. For us, it's not. -- Dentonius (talk) 08:25, 30 September 2020 (UTC)[reply]

I like the idea of consolidating these entries, they often provide very little lexical information, and are frequent targets of vandalism. As a side-effect, this would also neatly address concerns regarding search suggestions (phab:T263818). Where would that information go, though? In an appendix, or a thesaurus page? Transcluded in the main entry? – Jberkel 10:06, 30 September 2020 (UTC)[reply]
@Jberkel, thanks for the feedback. Your comment was a breath of fresh air. - Dentonius (talk) 04:25, 1 October 2020 (UTC)[reply]
Our policy is to split derived forms onto separate pages based on spelling. Having a separate rule for terms derived from certain offensive words would cause difficulty. Some relevant reading: Hauptfleisch, D. C. 1993. "Racist language in society and in dictionaries: a pragmatic perspective." Lexikos 3, DOI 10.5788/3-1-1102. His conclusion: "Racist items should be included in the larger dictionaries but excluded from the smaller ones, such as school dictionaries." The paper is full of examples of bad words, both directly bad like kaffir and tainted by etymology like kafferkraai. When I first read the paper I thought, "jackpot!" (So many words to add!) I ended up not adding many of them because finding supporting evidence of meaning and use was too much work. We don't (or shouldn't) add words just because they appear on a list. Vox Sciurorum (talk) 13:42, 30 September 2020 (UTC)[reply]
@Vox Sciurorum, I agree especially with the last sentence.

Whatever we do, offensiveness alone absolutely must not be made a reason for deletion. Mihia (talk) 11:31, 30 September 2020 (UTC)[reply]

Hi, @Mihia. I'm all for documenting the way the enslavers and some of their descendants spoke and, to this day, still speak. I don't think we need to hide that. (As an aside, why do they have so many vulgar words to describe people of colour? I can only think of a few slurs which work the other way around and they're all very tame by comparison. One of them would even go great with tea.) Our grandchildren and their grandchildren need to know what kind of people we shared space on this planet with. But if we care anything about other Wiktionarians (and our readers) and how it impacts them, we would document these terms while minimising their negative reach and impact. One way to do that is to just limit those words and their derived terms to their root pages. Then, the racists can create all the spin-off terms they want on the root pages of those words. And let's be realistic, you can add the N-word to just about anything: next we'll have pages for N-word beer, N-word car, N-word house, N-word food, ... Now, for the "clever" racist reading this who wants to reply that we evaluate terms on a case by case basis and that we can't apply any slippery slope argumentation, save your breath. -- Dentonius (talk) 14:03, 30 September 2020 (UTC)[reply]
To answer the original question: we have so many words derived from nigger because there is attestable use of them.
Consolidating all the derived terms at nigger is not consistent with our practice with respect to terms derived from other words. It would require us to have some new set of practices with regard to RfVing derived terms that did not have their own entries.
I don't think any of the ethnic slur and other offensive terms are much fun to work on for most of us. But nigger and its derived terms are all part of history and appear in many writings about or set in the past. Therefore they are part of our language.
Wiktionary attempts to be, among other things, a historical dictionary. It is part of our identity to record all aspects of language so that all kinds of writings remain intelligible to modern and future readers. Just as WMF has not subordinated its core purpose to environmental advocacy, despite the assertions that 'if the planet doesn't survive, nothing of WMF will matter', so we maintain our devotion to our core purpose. DCDuring (talk) 13:58, 30 September 2020 (UTC)[reply]
@DCDuring, re: "It would require us to have some new set of practices [...]" Maybe we do need some new practices. Does this change scare you? Seriously, we all get how racist some people can be. You really don't need to remind us. Restrict those pages to their root words. You all speak about your high-minded ideals and "editorial practices" because it doesn't affect you personally because, surprise surprise, the ones defending it are white. If you were somehow able to put yourselves in the shoes of the people of colour reading this, you wouldn't argue that way. - Dentonius (talk) 14:11, 30 September 2020 (UTC)[reply]
Do you have some suggestions that are consistent with us being a historical dictionary and with lexicography?
Eliminating derived terms would eliminate many, many terms. They ARE part of the language. For us to have a platform for doing RfV for those terms, we would need to make the entry for each "root" term much larger. Our larger pages already load slowly. Perhaps you could help by solving the technical problems. DCDuring (talk) 17:41, 30 September 2020 (UTC)[reply]
@DCDuring, now give me one reason why I would care how quickly the N-word page loads? Add all these N-word derived terms there and put them as examples on the N-word page. They don't deserve more than that. I'm surprised that the racists haven't added nigger cock (talk). That's the one they're afraid of. -- Dentonius (talk) 19:23, 30 September 2020 (UTC)[reply]
Let me help you all to beat the CFI on this one. Since nigger cock (talk) is an SOP, if you want to get around it go dig up some comments by some hick in some book, probably a British book, which you can use to prove that it's a black rooster. -- Dentonius (talk) 06:41, 1 October 2020 (UTC)[reply]
While there would be some poetic justice to forcing all the racist terms off into some ghetto entry, it wouldn't accomplish much, and it would tie the dictionary up in knots. The unfortunate truth is that human society is and always has been full of racism, and human language reflects that. As a Wikimedia project, Wiktionary is not censored, and Wiktionary adopts a neutral point of view. Specific to Wiktionary is that we are a descriptive dictionary: we describe the language as it is and was, not how it should be. I'll have more to say about this when I have time, but for now: we no doubt do have a core of white regulars, but it's impossible to say precisely the race of many of our contributors, because it simply hasn't come up- nor should it. I know we have Jewish editors who calmly discuss some of the vilest antisemitic slurs ever recorded, and I don't doubt we have Muslims as well. We have at least one transgender person, we have a number of people from various south and east Asian ethnicities. I would say that plenty of our regulars have really awful ethnic slurs in our dictionary that are applicable to them. I understand why you're angry, though, as a Protestant of western European ancestry, I can never really know what it's like to be in your shoes. You certainly have a lot to be angry about, so I won't try to patronize you.The problem is that the N-word is just the tip of the iceberg. Anything in English that references Jews or Muslims or people of color that's more than a century or two old is likely to have some racist overtones, and many are full-out racist. Then you get into conflicts like at two-spirit where one minority is pitted against another, and queer, where a term is simultaneously a slur against a group and the preferred term for that group among some of them. Your solution would be a feel-good bandaid that would leave huge areas of offensiveness unaddressed. Chuck Entz (talk) 15:30, 30 September 2020 (UTC)[reply]
@Chuck Entz, thanks for the feedback. I (sincerely) like your way with words. The language was smooth and you know how to paint a picture for the reader. If you wrote a book, I would probably enjoy reading it. I got the impression that in real life you're a liberal (in my view, a good thing) as opposed to the others here who seem to be rather conservative. Alas, you're still on their side (which is okay. You have your position.) There is a major difference, however. You present with empathy whereas several others here seem incapable of that. Bravo. I respect that. -- Dentonius (talk) 04:37, 1 October 2020 (UTC)[reply]
I vehemently oppose throwing our goals and principles down the drain to avoid some potential offense. This hiding of information behind a more difficult search procedure (read: censorship) goes against the base goal of Wiktionary which is to include terms “that someone would run across it and want to know what it means.” I see no reason to hide away real terms that meet all of our inclusion criteria.
Concerning some of your points:
  • Wiktionarians creating these entries are racists themselves: this is a very serious accusation. Do you have any evidence? That they took the time to add lexicographic content to a dictionary proves nothing.
  • These terms are obsolete and useless: then they are in the good company of tens of thousands of other terms that are obsolete or useless and we include regardless.
  • These terms are garbage: yes, but they are terms. As a dictionary we list terms, garbage or otherwise.
Ultimately, I don’t think anyone is harmed by the existence of these pages that didn’t set out to be harmed: if they didn’t know that the term is offensive, they will benefit from learning that it is; if they knew it was offensive and looked it up anyway to complain about being offended, they will not stop at hiding the information on another page.
Wiktionarian emeritus Martin Gardner once wrote a simple sentence that stuck with me, and forever changed how I approach inclusion. In the original context, it was about prescriptivism, but I think it also applies to censorship: “If you don't want to know what a word means, simply do not look it up.”
Oh, and I know that the same accusation you’ve been levelling against other disagreers is coming my way. To that I reply: no I’m not, and I will not support the censorship of any other class of offensive terms — including those that target me. — Ungoliant (falai) 15:42, 30 September 2020 (UTC)[reply]
Eventually we're going to have to change the name from Wiktionary to Niggtionary when the racists get done creating duplicate versions of all the entries here prefixed with the N-word. And, believe me, these words will be attested. Their foulmouthed ancestors made sure to put all their nasty ways of expressing themselves down in writing. How silly of me to expect you people to empathise with us and our problems. Just for once, I'd like to see white people defending the right things. You all make it seem as if we're talking about changing the constitution. This is a malleable electronic platform. We don't have to accept any of this. And if enough of you actually cared, we could change it. Too many of you are comfortable with racism because you've never experienced it in your lives (and never will) and it shows. -- Dentonius (talk) 16:06, 30 September 2020 (UTC)[reply]
"You people"? C'mon, Dentonius, you don't even know if Ungoliant is white — all we know is that he's Brazilian. (By the way, I have created entries offensive to my people or used in the context of my people's genocide, and I assure you I did not "get off" on creating them like you claimed.) —Μετάknowledgediscuss/deeds 17:54, 30 September 2020 (UTC)[reply]
My apologies, @Metaknowledge. I normally ping the people I'm directly addressing, but the indentation might have given the impression that I was speaking directly to Ungoliant. Those who I was referring to as "you people" know themselves. I don't need to elaborate on that. Now, to get down to the meat of the matter and my original question: I know that there are hundreds of people here who do not sympathise with my concern. I know that Wiktionary's policies, as they stand, reflect the heart and character of its "core members." I know that many here feel that they're serving some higher cause by increasing our workload to maintain several pages which speak to how their ancestors were and how some of them still are. But am I, a newcomer here to editing, allowed to start a vote on this matter? Even if I expect it to fail spectacularly, could I -- a Wiktionarian of colour -- get confirmation of that through the voting process? Could somebody help me to do this? What's required and how do I go about it? -- Dentonius (talk) 19:00, 30 September 2020 (UTC)[reply]
Sure, you can create a vote. See WT:VOTE. I think it would be a waste of everybody's time, most of all yours, because you already know it will "fail spectacularly". I don't think it'll make you feel any better to get confirmation that Wiktionary stands by the motto "all words in all languages". But you are certainly entitled to create it anyway. —Μετάknowledgediscuss/deeds 19:27, 30 September 2020 (UTC)[reply]
@Metaknowledge, if it's "all words in all languages", then why do we reject useful terms in printed dictionaries? It seems those people are playing favourites, and man, do those people love their N-words. -- Dentonius (talk) 19:34, 30 September 2020 (UTC)[reply]
What useful terms in printed dictionaries do we reject? I don't know of any. —Μετάknowledgediscuss/deeds 19:38, 30 September 2020 (UTC)[reply]
@Metaknowledge, I won't provide specific examples but I'll say anything which has been rejected as "sum of parts" yet has appeared in a printed dictionary. This has happened to me and others. You, I suspect, also know that this is true. -- Dentonius (talk) 19:45, 30 September 2020 (UTC)[reply]
I don't think phrases that are the sum of their parts are useful, because you can look up each part separately. If you won't give an example, I'm not sure we can have a useful conversation about it. But I also don't know what that has to do with the topic at hand here; you could make a vote about SOP terms too, and depending on your proposed policy change, it could potentially pass (cf. WT:COALMINE, which did just that). —Μετάknowledgediscuss/deeds 19:52, 30 September 2020 (UTC)[reply]
I will always be defending truth and the discovery and preservation of truth. Some of the great problems of some of our civilization and of nations and societies derive from the exclusion of truth and the use of weaselly words to conceal the truth. The US Constitution, whatever its merits, used a lot of weasel words to obscure the nature of the compromises made that preserved slavery for 75 years after it was written. There are plenty of weasel words that help obscure the underlying intent of speakers and writers today. Eg, "We hired her because she is diverse." "We need to appeal to a more urban audience." It is our duty to address these 'evolving meanings' and we do attempt to perform that duty. DCDuring (talk) 17:41, 30 September 2020 (UTC)[reply]
This is a tricky subject, and I appreciate that everyone is approaching it with the appropriate respect for one another. From a new editor's perspective, here are my two cents.
(1) pretending to be "neutral" is folly that eventually leads to defending a racist status quo. It's as true in lexicography as anywhere else.
(2) altering the editing norms for a limited number of entries - ie having different CFI for offensive words - is logically circular and would get us tied in knots.
(3) is it possible to add a label (much) higher up than in the POS that shows that we, collectively, understand the terms in question to be offensive and harmful? Doesn't this kind of term deserve more attention than, say, a regionalism? SteveGat (talk) 18:48, 30 September 2020 (UTC)[reply]
@SteveGat, do you mean like tagging the word as offensive/racist in the search field and page title? You probably don't mean that, but that would be great. -- Dentonius (talk) 04:44, 1 October 2020 (UTC)[reply]
@Dentonius I hadn't thought of putting it in the search field, but if it is technically doable, why not? But at least a clear statement - like a large-font banner on top of the entry - that makes it clear that Wiktionary does not endorse the harm that is done to the reader who opens an entry. But the current situation, where the bland offensive label comes so far down the entry, is imo not nearly good enough. SteveGat (talk) 13:14, 1 October 2020 (UTC)[reply]
@SteveGat, I feel ya. Technically, if Wiktionary or Wikimedia wanted to, they could even implement a SafeSearch feature like Google. The words are already tagged. Users could technically opt in to see vulgar content. I wouldn't hold my breath to see that kind of change here, though. -- Dentonius (talk) 18:33, 1 October 2020 (UTC)[reply]
@Dentonius: I am answering you now for two hours for how profoundly wrong you are. You are only debasing yourself. These proposals do not defend the right things. They ground on political naïveté. I must tell you that you have fallen prey to the fraudulent scheme of the communists. The communist scheme works like this: A seeks redress from B because X has wrought damage to Y. I know not what whites have done to you but here the whites do not, they are trying to do lexicography by all legitimate motivations. What happens with this scheme is of course that even more wrong is caused. Because of some imagined compensation neurosis – I am not trying to pathologize mass psychology, I habitually apply law-derived concepts – formerly oppressed classes flock together and some trained Marxists, as they call themselves, incite the mob to attack and plunder a political target when the mob passes by the correct one. But they attack him who worked, whether white or black, and destroy riches from which other riches could be created – few have understood the broken window fallacy. It never was about blacks against whites, it was the productive against the fraudsters. Not a coincidence that Mr. Mandela was of the communist party, and since him South Africa goes down the toilet, and that increasingly the more policies against whites and whiteness are invented and measures executed, which are just boogiemen for systems that work that communists are envious of. After all these expropriations in favour of incompetent random people and farm raids they can’t even avoid famine. And what did white people do? You couldn’t sit on every bench, but at least you could sit on some benches without being robbed. Because the whites stood for the right and order. That which is expected to work in the foreseeable future. They do not remedy everything in an instant, because they are wary of restrictions in the nature of things, but they are also those who abolished slavery, in spite of constantly being depicted as its main perpetrators. Apart from innumerable other benefits, which you cannot and should not avoid. So stop ascribing sinister motives to a race or its language or its manner of documenting it, beware of swindlers.
The counter examples are communists. The National Socialists in turn were like communists and ailed from envy, so they wrung the Jews to get their stuff and deported them, apart from the other reason or fallacy that they thought they were collectively representative of their international socialist competitors. Look at the bottom line and not the catchwords. It’s bikeshedding and the actual lack of empathy, apart from being a first world problem, to complain about nasty words if on the other end destitution grows because it is not addressed or even cannot be addressed because of political correctness and an ever tumbling Overton window, and people have to flee because of the very same. Fay Freak (talk) 20:06, 30 September 2020 (UTC)[reply]
Fay Freak, this was embarrassing to read. Please take your misunderstanding of communism to a political forum rather than Wiktionary. As you said to Dentonius, you are only debasing yourself. —Μετάknowledgediscuss/deeds 20:43, 30 September 2020 (UTC)*[reply]
@Metaknowledge. That rant was awesome! It made my day. @Fay Freak, don't let them tell you otherwise: Ты молодец! Bleib genau so, wie du bist! -- Dentonius (talk) 04:09, 1 October 2020 (UTC)[reply]
  • You are suggesting consolidating the N-word derivates into a single entry, does this mean you do not want the derivatives to be searchable? If yes then this is a big no-no because this will make English Wiktionary less useful. Don't assume every single person on this planet knows every ethnic slur in English. If they are going to be still searchable then I do not understand why would you care if they have separate pages. Dixtosa (talk) 22:01, 30 September 2020 (UTC)[reply]
@Dixtosa. Thanks for replying. They'd still be searchable. The search algorithm is really good. The big difference would be that ethnic slurs wouldn't have such a large footprint. So instead of there being potentially hundreds of N-word pages (or K-word, or J-word, or F-word, or ...), there would only be one page for each. The definitions of the myriad of racist terms would still be searchable but each restricted to one page each -- their root. All x-word derivatives would be present as examples on the x-word page. -- Dentonius (talk) 04:21, 1 October 2020 (UTC)[reply]
  • 2p from left field, as it were (my main focus is Japanese).
From a lexicographic standpoint, I'd like to point out that non-native English readers may use this site as a resource. If a word is offensive, we should definitely clarify that. During my time in Japan, I spent a while teaching English, and was horrified at some of the resources my students brought -- such as a Hello Kitty dictionary that blandly defined some of these offensive N-word synonyms simply as 黒人 (kokujin, literally black person). Were a second-language-learner to use this vocabulary in the wrong context, they could be setting themselves up for some potentially dangerous, or at any rate negative, interactions.
From a technical standpoint, I can imagine this working best if the N-word derivatives subject to migration to within the body of the main lemma form, were particularly those terms that start with the lemma form itself. Anything spelled differently should presumably go on a different page, no? I'm not sure how it would work otherwise. ‑‑ Eiríkr Útlendi │Tala við mig 04:46, 1 October 2020 (UTC)[reply]
@Eirikr. Thanks for sharing your thoughts. That's precisely what I had in mind. -- Dentonius (talk) 05:05, 1 October 2020 (UTC)[reply]

I'd still be happy to see some feedback on this topic from time to time. It was interesting to see everybody's viewpoints here even though I disagree with a lot of them. However, this isn't my Wiktionary, it's our Wiktionary. I would be lying if I said I wasn't saddened by the general lack of support for this idea. Jberkel, also mentioned a few benefits of having such a policy as it relates to vandalism and a violation of Wikimedia's community standards which was raised at Phabricator. There's no doubt in my mind that the ethnic and racial make-up of the active users plays a role. Were there a majority of Wiktionarians of colour, the proliferation of useless racist pages would be restricted in size and scope. In my country, we say "He who feels it, knows it." The concept of racism for many of you is as abstract as an as yet undiscovered exotic particle in physics. Our Wiktionary here, in many ways, is like a microcosm of the larger society. The concerns and plights of minorities are ignored because those in the majority are uninterested in them and are incapable of relating to them in any way whatsoever. This is lamentable. -- Dentonius (talk) 05:05, 1 October 2020 (UTC)[reply]

I can sympathize with your concern (if not fully empathize), but I could not disagree more strongly with the proposition to make subjective determinations about what can and can't be an entry. Keep in mind that you are reading a modern-day perspective into a lot of these terms. The N-word etymologically just means "black", and thus not all words derived from it are or were offensive. For a long time, niggerhead, for instance, was a normal word with no offensive connotations. We would be doing people a disservice to not include these words as entries, not to mention that we would be essentially obscuring the racist history of English-speaking countries, since the words wouldn't even be visible in categories and such and would. There is no rational basis for sanitizing the dictionary, only an emotional one. Idiomaticity and attestability should be our criteria, not the subjective sensibilities of certain users (note that there are many black people who would disagree with you quite strongly, so how on earth would we determine who to listen to if we started making exceptions to the rules?). Andrew Sheedy (talk) 12:01, 1 October 2020 (UTC)[reply]
There is discussion above about more clearly labelling words as racist. That's something I'd be willing to support, provided it doesn't obscure the actual meaning of language (like, for instance, saying that niggerhead is a racist term, period, without any indication that this wasn't always the case, and that once upon a time, authors who used the word weren't necessarily racist). Andrew Sheedy (talk) 14:22, 1 October 2020 (UTC)[reply]
@Andrew Sheedy, your heart seems to be in the right place. -- Dentonius (talk) 18:31, 1 October 2020 (UTC)[reply]
I appreciate the fresh perspective, which reconsiders some basic assumptions about how things word (e.g., that every word gets its own entry). I sometimes think there would be benefits (but also drawbacks!) to adopting a system more like some print dictionaries where many "derived" entries are handed under "main" entries. Looking more broadly, far beyond just offensive terms, this might make it easier to find the relevant definition if someone sees e.g. "got off" used in a text (currently, some senses are at get [+off], some at get off, but if we housed both entries at "get"...). (Entries could still be searchable/findable via soft redirects of one form or another, such as the {{no entry}} ones we sometimes use to direct people to appendices.) But, there are also benefits to giving each word its own entry, which is the prevailing practice with only a few exceptions (e.g. sometimes the Foobar and Foobar will both be on the page "Foobar"). If we consolidated all the terms that start with nigger onto nigger, the page would be very large, unwieldy, and might even run into memory issues that would result in content not displaying, as happens on some other big pages (CAT:E). There would also be a lot of issues with deciding what to merge vs not merge (merging Niggerville under nigger but not Greenville under Green, or niggeresque but not Romanesque, would not be intuitive, IMO). Ultimately, I would not support consolidating only offensive terms (and I would be wary of consolidating derived terms generally, despite some appealing features). - -sche (discuss) 00:10, 2 October 2020 (UTC)[reply]
If how we treat derived terms changes, that discussion can be had, but notions of offensiveness should have no part in it at all.
Offensiveness is not a neutral criterion, what is "too offensive for inclusion" is determined only by who shouts loudest (and that is fickle - it will come back to bite you in the ass when political winds change). Such a consideration has no place in a dictionary that strives to serve humanity in a neutral, scientific way. I would lose a lot of respect for this project if we were to start handling terms that are offensive to some (or even most) people with kid gloves. — Mnemosientje (t · c) 16:00, 2 October 2020 (UTC)[reply]
At the moment there are 385 nigger pages +or reference in en.wikt. Is this normal? ‑‑Sarri.greek  | 16:39, 2 October 2020 (UTC)[reply]
Well actually, they are the 385 pages which contain the word "nigger". Comparing that to 38 with "honky", 40 with "gringo", 256 with "Paki", 14 with "wigger", there's probably some conclusion to be made. --Daleusher (talk) 16:50, 2 October 2020 (UTC)[reply]
Back in 2014, someone came at el.wiktionary and added only plurals of such words. Also an anonymous IP, added sex related lemmata of that sort. They took advantage of the lack of patrolling in a small wiki with 2-5 active editors. I found them in 2020 and deleted some, but did not check them all. I will do so, now. I have added at el:nigger that it is a errr... I do not know the english equivalent, ... προτακτικός (protaktikós) / el:προτακτικό, a word that precedes another creating loose compounds. If the compounds are understood, I shall delete them. Alas, european languages have indeed words that are vomiting. For several centuries, it will be very difficult to look at such lemmata. But one can not delete crimes and sins. ‑‑Sarri.greek  | 20:33, 2 October 2020 (UTC)[reply]
Besides all these technical and lexicographical points, I think there's another discussion worth having: how do we create a welcoming environment for new editors with different backgrounds? I can totally understand if some get discouraged when they see the proliferation of n-word entries (or how they are handled), or inappropriate, borderline misogynistic usage examples. And without a diverse community of editors we suffer from the same bias typical of Wikimedia projects : the clichéd middle-aged white men documenting their interests. – Jberkel 14:53, 3 October 2020 (UTC)[reply]
These clichéd middle-aged white men are just that, clichéd. Dentonius is attacking a straw man, as many others have pointed out here by mentioning that there are plenty of people from "different backgrounds" here, I am not even convinced the majority here is a cishet WASP as people are pretending right now (nor am I convinced it would matter if that were the case). I personally check a couple of boxes on the diversity checklist but you don't see me grandstanding with my 'identity' as if it's an argument, despite the fact there are plenty of words on this project that are (depending on context!) offensive to me and others like me. — Mnemosientje (t · c) 08:40, 4 October 2020 (UTC)[reply]

Censorship is the last thing you would ever want in a dictionary. Besides that, if you "just consolidate them under their roots" ( like moving niggerhead to nigger/niggerhead), all those links, their templates like {{m}} and {{l}} and even the search bar need to be reworked to lead to the correct pages. With this amount of effort someone may just as well built another wiki site to realise anything he wants. -- Huhu9001 (talk) 17:43, 6 October 2020 (UTC)[reply]

Capitalisation of Korean transliterations edit

(Notifying TAKASUGI Shinji, HappyMidnight, LoutK, Karaeng Matoaya, B2V22BHARAT, Quadmix77): Hi all. What do you think of completely abandoning of capitalisation of Korean transliterations? It's not just about if it's useful or our preference. I think, it's not supported by any existing standard. Korean romaja is among the three currently supporting capitalisation at the English Wiktionary - the other two are Japanese rōmaji and Mandarin Chinese (hanyu) pinyin. The capitalisation of these two is standardised. --Anatoli T. (обсудить/вклад) 08:34, 30 September 2020 (UTC)[reply]

It’s clearly written that proper nouns should be capitalized. 국어의 로마자 표기법:
  • 제3항 고유 명사는 첫 글자를 대문자로 적는다.
TAKASUGI Shinji (talk) 09:48, 30 September 2020 (UTC)[reply]
@TAKASUGI Shinji: Thank you. That settles it then.
There are still occasional issues with a few templates. As you know, we use the symbol "^" to capitalise romaja, e.g. 라트비아 (ko) (Rateubia) works fine but not 라트비아의 (ko) (Rateubiaui) - currently displaying "^라트비아의". --Anatoli T. (обсудить/вклад) 09:58, 30 September 2020 (UTC)[reply]

Categorization by "Template:unsigned" edit

Is there any particular benefit to {{unsigned}} categorizing pages into "Category:unsigned with nonstandard timestamp" and "Category:unsigned with standard timestamp", particularly since there is no indication in the template documentation as to what constitutes a standard timestamp? — SGconlaw (talk) 16:20, 30 September 2020 (UTC)[reply]

I certainly hope not. When ever I add {{unsigned}} (or more likely {{unsigned2}}) to someone else's comment, I always subst: it in, which somehow removes the categorization. So if anyone's trying to keep track of what pages contain unsigned comments, the category isn't even exhaustive. —Mahāgaja · talk 16:29, 30 September 2020 (UTC)[reply]
Maybe I should do that too. I've been trying to avoid categorization into "Category:unsigned with nonstandard timestamp" by trying to follow the format shown in the documentation exactly, but it doesn't work. For some reason, the template calls another template, {{timestamp}}, but for what purpose I know not. If the categories are not needed, maybe we should just replace {{unsigned}} with {{unsigned2}}. — SGconlaw (talk) 17:07, 30 September 2020 (UTC)[reply]
Pinging @Dcljr, Kc kennylau, Kephir, Ruakh who previously edited the template and appear to be still active. — SGconlaw (talk) 17:16, 30 September 2020 (UTC)[reply]
I don't see the benefit, personally. {{unsigned}} isn't read by any software so far as I know; it's just there for human readers, so we can see when we've reached the end of a comment (and maybe to nudge the non-signer so they remember to sign their posts in the future). It's not as if anyone were going to go back and clean up old {{unsigned}}'s to date them properly. —RuakhTALK 20:02, 30 September 2020 (UTC)[reply]
Presumably Kephir would be the one to explain the benefit, since that's who added the relevant code and created the categories in the first place. But the existence of the categories now can be used by a bot to "subst:" all the remaining non-substituted instances, if that is desired. (Note that there are several redirects to {{unsigned}} that would also need to be taken into account.) - dcljr (talk) 23:12, 30 September 2020 (UTC)[reply]
It looks like @Kephir is not very active these days, so if there are no objections I will remove the categorization. — SGconlaw (talk) 08:37, 22 October 2020 (UTC)[reply]