Archive edit

Catalan inflections edit

Hi Ben, any chance we could have automatic Catalan inflections? There's User:DTLHS/catalan bot requests, but it doesn't seem to be running very often, and it's tedious to add manually to a list. Jberkel 18:12, 11 December 2023 (UTC)Reply

@Jberkel Yeah I have looked into this. The thing is that I'd probably have to rewrite Module:ca-verb to work like Module:es-verb or Module:pt-verb. The Spanish, Portuguese and Galician modules were all written mostly by me and implement JSON fetching of the inflections as well as {{es-verb form of}} and similar to automatically fetch the correct inflections for a given verb form. The former wouldn't be too hard to add to the existing module but the latter would be painful, and it would probably be better to rewrite the module instead. I have looked into doing this but I don't have that good a handle on Catalan verbs, esp. those in -er/-re. Do you have any good references that explain how Catalan verbs work, especially focusing on the -er/-re verbs, which is where the irregularities seem to be? The current module seems to push a lot of the complexity down into the template call, e.g. veure's invocation looks like this, which is a mess:
{{ca-conj-ure-ia2|v|e<!--
-->|past_part=vist<!--
-->|past_part_mpl=vistos<!--

-->|pres_ind_1_sg=veig<!--

-->|pret_ind_stem=vei<!--
-->|pres_sub_stem=veg<!--
-->|impf_sub_stem=vei<!--

-->|pret_ind_1_sg=viu<!--
-->|pret_ind_2_sg2=veres<!--
-->|pret_ind_3_sg2=véu<!--
-->|pret_ind_1_pl2=vérem<!--
-->|pret_ind_2_pl2=véreu<!--
-->|pret_ind_3_pl2=veren<!--

-->|impr_2_sg=veges|impr_2_sg2=ves<!--
-->|impr_2_pl2=veieu<!--
-->}}

I'd want to have this stuff all in the module itself, similarly to what's being done for Spanish, Portuguese, Italian, French, etc. Benwing2 (talk) 23:15, 11 December 2023 (UTC)Reply

Ok, thanks for looking into it, I sent you some reference material via email. Jberkel 09:01, 12 December 2023 (UTC)Reply
@Jberkel Thanks, I received it. Benwing2 (talk) 21:12, 12 December 2023 (UTC)Reply
@Jberkel I have a question, not sure if you know the answer. In -ar verbs whose root vowel is e or o, is that vowel pronounced è or é (or ò or ó for roots in o) in root-stressed forms (e.g. first-singular present indicative), or does it vary from verb to verb? In Proto-Romance it varied from verb to verb, and this is still the case in modern Italian. Spanish has a reflex of that in verbs that unexpectedly have ie or ue in root-stressed forms, but Portuguese has regularized the vowel quality (for example, using low-mid vowels in -ar verbs). I think in conservative varieties of Occitan at least, it varies from verb to verb, and this is reflected in the spelling. Benwing2 (talk) 08:41, 13 December 2023 (UTC)Reply
Pinging @Vriullop from ca.wikt. Ultimateria (talk) 23:28, 13 December 2023 (UTC)Reply
@Vriullop @Ultimateria It appears that it varies from verb to verb in Catalan, at least based on the two verbs pegar, which ca.wikt says has /ɛ/ in Central Catalan (consistent with its origin from Latin short ĭ), and membrar, which ca.wikt says has /e/ in Central Catalan (again, consistent with its origin from Latin short ĕ). But the situation is complicated by the dialects, where many dialects have /e/ for both verbs. I'm interested in finding a dictionary that indicates these vowel qualities so that maybe we can include them in the conjugation table, similarly to how the French and Italian conjugation tables give pronunciation; this would only be for Central Catalan for now (maybe forever), since the dialects are complicated. Benwing2 (talk) 00:19, 14 December 2023 (UTC)Reply
BTW if what I've said is correct, where can I find in Catalan dictionaries the indication of how the stressed vowel is pronounced for a given verb? Benwing2 (talk) 05:32, 14 December 2023 (UTC)Reply
For variation in dialects see the notation used with {{ca-IPA}}: ê for /ɛ/ in Central, /e/ in Valencian and /ə/ in Balearic. Similarly with ô, and è, é, ò, ó has no variations. This is fair consistent with few exceptions.
It is etymological, ê from Latin ĭ or ē, but with some exceptions.
The only dictionary that indicates the rhizotonic stress is the DNV, for example membrar says é, but it is only for Valencian and it could be either ê or é. It is only helpful for è and ò. I have not found any other source indicating systematically the rhizotonic stress, even the dictionary of pronunciation I have in my bookshelf only includes some paradigmatic verbs. Frankly, there are some verbs I don't know how they are pronounced, apart from my personal perception, not a good sample. The only clue is a noun related with the verb, and the etymology of inherited ones. On ca.wikt I include a rhizotonic parameter verb by verb with ca-IPA notation. Vriullop (talk) 09:25, 14 December 2023 (UTC)Reply
@Vriullop Thank you! I wonder why Catalan dictionaries are so bad at including the rhizotonic vowel quality patterns. Pretty much all monolingual Italian dictionaries list the rhizotonic quality (and position) for all verbs. What about the pronunciation of other forms, such as verbs with pres 3s in -ou or -eu? Are there any dictionaries indicating the vowel quality of these and other endings? Thanks for any help you can give. Benwing2 (talk) 09:57, 14 December 2023 (UTC)Reply
I'm not sure what you mean, 'mou' from 'moure' and 'veu' from 'veure' have the same stress that the infinitive.
Endings that may be ambiguous, without any graphic accent:
  • -em, -eu, as in cantem, canteu, cantarem, cantareu: ê
  • -essis, -essin, as in cantessis, cantessin: é
  • -eres, -eren, as in temeres, temeren: é
  • infix -eix- (-eixo, -eixes, -eix, -eixen, -eixi, -eixis, -eixin): ê, but not used in Valencian that change to -ix-
This is a summary from different sources, coherent with the etymology. Vriullop (talk) 12:38, 14 December 2023 (UTC)Reply
@Vriullop OK thanks, I suppose that the DCVB dictionary gives the infinitive pronunciation of words like moure. This is very helpful; if I have other questions I'll let you know. Benwing2 (talk) 19:55, 14 December 2023 (UTC)Reply
DCVB is fine for pronunciation, but in some cases is not complete or confuse. If necessary, you can compare it with the GDLC in the link "francès" that includes translation ca-fr and also pronunciation in Central Catalan, and the DNV for Valencian. Vriullop (talk) 20:57, 14 December 2023 (UTC)Reply
@Vriullop Thanks! Benwing2 (talk) 21:41, 14 December 2023 (UTC)Reply
@Jberkel I wrote a preliminary Catalan conjugation module; see User:Benwing2/test-ca-conj for examples. It has a few bugs in it that I'm working out, but it's close. Benwing2 (talk) 22:13, 17 December 2023 (UTC)Reply
Already looking good, thanks for working on this! Jberkel 22:26, 17 December 2023 (UTC)Reply

Pronunciation of feu is correct, 2n pl. regular with -eu, and the irregular past was spelled féu in pre-2016 orthography which is more helpful.

The pattern /e/ in Central and /ɛ/ in Valencian is possible, but rare. It can appear for different reasons:

  • Pronunciation of stressed e is not as uniform in Central Catalan as in other dialects. For example, some word can be /e/ in Barcelona and /ɛ/ in Girona or vice versa. In general, one of the two is considered formal and the other local or dialectal. The formal one is usually the expected one or the same as in Valencian and Balearic.
  • Recent loanwords may have hesitations in their adaptation. They are usually adapted with è, but with é for the Spanish ones.

The DCVB indicates these local details. In this case I trust the GDLC more. The DCVB comes from fieldwork in the 1920s. Some of the pronunciations have not been registered in other late 20th c. fieldwork. The GDLC compiles the pronunciation of the main reference work used for radio and TV speakers in Central formal speech. In short, this pattern is rare in formal pronunciation. As far as I can remember, it doesn't happen with verb forms, and it can be treated like other irregular cases that do not follow an expected pattern. --Vriullop (talk) 18:00, 19 December 2023 (UTC)Reply

Although the /e/-/ɛ/ pattern above is rare, the other way is more common: /ɛ/ in Central and Balearic, /e/ in Valencian. This is noted on cawikt as ë (double e), a variant of ê (triple e). Stressed schwa in Balearic is used in inherited words and inflections. In cultisms or loanwords (i.e. cafè), or just words perceived as literary (i.e. mestre), instead of schwa it is /ɛ/ as in Central. There are indeed verb forms with rhizotonic vowel ë. There is no equivalent with stressed o, but for consistency it could be noted ö (double o) instead of ô. Vriullop (talk) 08:02, 21 December 2023 (UTC)Reply
@Vriullop Thanks for all your help. I have implemented ë in Module:ca-IPA. Can you help me by fixing the default rules in the module that currently default to ê to instead default to ë when it's correct? For example, cens defaults to cêns when it should be cëns. This is in the mid_vowel_e() function of Module:ca-IPA. I don't know Catalan well enough to fix it myself, and the corresponding cawikt module in ca:Module:ca-pron/AFI seems to have the same rules we currently have. Benwing2 (talk) 20:49, 21 December 2023 (UTC)Reply
As stressed schwa depends on inherited v. cultism, there is too much variation with -ens, -ena, -enes endings to be able to redefine the rule. I have added a tracking and I have checked where it was being applied by default. After adding hint ê or ë, I think it is safer to remove this rule: Special:WhatLinksHere/Template:tracking/ca-IPA/ens-ena-enes. Later, I'll look other rules with default ê. Vriullop (talk) 09:19, 22 December 2023 (UTC)Reply
@Vriullop Thank you. I agree about removing the rule. In general I'm not much in favor of rules like this that are wrong a significant fraction of the time, and prefer to be explicit except when it's nearly completely predictable. Benwing2 (talk) 11:06, 22 December 2023 (UTC)Reply
@Vriullop I just discovered that cerndre is irregularly missing the first r in pronunciation. Does this carry through to inflected forms like cerno, cerns or are they pronounced regularly with /r/? Benwing2 (talk) 03:05, 24 December 2023 (UTC)Reply
BTW there is a bug in cawikt's handling of Balearic pronunciation with ê; hard /k/ shows up as /c/ in the first of two alternants. See ca:cerca for an example. Benwing2 (talk) 03:08, 24 December 2023 (UTC)Reply
@Vriullop OK, I have several more questions. I'll try to list them all here and avoid pinging you individually.
  1. cors "privateering campaign" and cors "Corsican" are given without the /r/ in Eastern Catalan pronunciation both here and in cawikt. However, GDLC says /kórs/ for the former and /kɔ́rs/ for the latter. Which is correct, and if the /r/ is correct, do we need to update Module:ca-IPA?
  2. I am going through mid-vowel verbs trying to update the inflected forms to have the correct vowels. I am probably going to implement something soon in {{ca-conj}} and/or {{ca-verb}} to let you specify the mid-vowel quality and display it, similar to what cawikt does. I cannot determine the vowel quality of the following verbs so far: cessar, conrar, copar, copsar, crepar, dopar, drenar, gestar. Can you help?
  3. I am going to update Module:ca-IPA so you can individually specify the pronunciation of different dialects, as I have found some need for this. Apropos of this, I notice that the cawikt version of {{ca-pron}} supports ; do you think we should support this, or just use the per-dialect support I am going to add?
  4. Also, I'm more and more convinced that we should have few default rules for mid-vowel quality, and require it to be given explicitly in all cases that don't involve a well-known affix.
  5. fossa "pit, grave, etc.": does it have /o/ [per GDLC] or /ɔ/ [per DNV, DCVB and cawikt]?
  6. llei "law": does it have /e/ or /ɛ/ in Eastern Catalan, or some complex mixture? cawikt says /ɛ/, GDLC says /e/, DCVB says a complex mixture.
Thanks for your help, Benwing2 (talk) 06:41, 24 December 2023 (UTC)Reply
Lot of stuff here, but I'm happy to help.
  • 'Cerndre' losts first r when followed by sequence -ndr-. That is infinitive, future and conditional. All other forms have regular pronunciation. This happens also with prendre and derived verbs. See ca:Categoria:Rimes en català -ɛndɾe including 14 verbs ending with -prendre. Sequence -rndr- only occurs in 'cerndre' and there is not any other term with sequence -rendr- other than these 14 verbs.
  • /c/ in Mallorcan is an allophone of /k/, i.e. local pronunciation [məˈʎɔ̞ɾ.ca̟]. You're right, this is phonological and not phonemic. Catalan works often include some phonological symbols in phonemic representations for dialectal contrast, but this is not the case of [c] with restricted use. I plan to remove it for being misleading.
  • 'Cors' fixed on cawikt. This r is really retained, respelled 'corrs'. The module should not assume the lost of -r(s) in final coda for monosyllables. While most polysyllables do, most monosyllables don't. The problem is how to manage that.
  • My guest on rhizotonic vowels:
    • cessar: é; inherited from Latin ě not followed by an opening context, and DNV é.
    • conrar: ó; from unstressed o, reduction of conrear, DNV ó.
    • copar: ó; from French /u/ and analogous to noun copa, DNV ó.
    • copsar: ó; inherited from Latin ǔ, DNV ó.
    • crepar etym 1: ë; as noun crep from the same French root, neologism not attested in Balearic, DNV é.
    • crepar etym 2: é; from Latin ě, only used in Balearic.
    • dopar: ó; neologism as in Spanish, close to the English original, DNV ó.
    • drenar é; idem.
    • gestar: é; from Latin ě, as the noun gesta from the same root, DNV é.
  • Notation ẽ is hardly used. It is better to fix that with parameters per-dialect: ca:Special:Diff/2245937. I'll remove it on cawikt.
  • Some rules for mid-vowels are theoretically justified. I have this pending to review the unwanted side effects. I agree that it shouldn't lead to erroneous results.
  • Fossa should be ò from Latin ǒ, but there have been some modern changes during the 20th c. that I am still unable to explain. The DCVB shows the situation in the first third of the 20th c. in accordance with etymology. Probably in Central today is hesitant. In this case, I would say ó in Central and ò in Balearic and Valencian, two dialects more conservative.
  • Llei fixed on cawikt. From Latin ē it should be ê, but the diphthong has changed it: é in most Central, retained è in northern Central, /ə/ in Balearic, é in Valencian.
Vriullop (talk) 18:27, 26 December 2023 (UTC)Reply
@Vriullop Thank you! I have applied the changes offline to the specific verbs and other words mentioned above, and I will push them soon. Still working on Module:ca-IPA. A few more questions:
  1. More verbs where I'm not sure of the rhizotonic vowel quality: menar "to lead" (is this ê?), menjar "to eat" (apparently it uses now-deprecated ẽ?), mentir "to lie" (?), molar "to mock" (from Spanish; ó?).
  2. mesa "altar, mense, table": cawikt says /e/ for both East and West, which agrees with DCVB, but GDLC says /ɛ/. Mistake?
  3. messes "harvest time": again, cawikt says /e/ for both East and West, which agrees with DCVB, but GDLC says /ɛ/.
Benwing2 (talk) 05:46, 27 December 2023 (UTC)Reply
  • 'menar': ê.
  • 'menjar': é but Balearic ə. I'll modify the rizo parameter to accept an explicit /e/, /ə/, only used here.
  • 'mentir': é in forms without -eix-.
  • 'molar', to rock, from Spanish: ó.
  • 'mesa' as a noun has two etyms with different pronunciations, but GDLC only show one in translations. Here DCVB is correct.
  • 'messes', I would say é but irregular è in Central.
Vriullop (talk) 09:47, 27 December 2023 (UTC)Reply
@Vriullop Thanks for your quick response! I have made the offline updates. Some more questions (for N and O) ...
  1. noble: I already pinged you about this. DNV says /o/ for Valencian but DCVB says /ɔ/.
  2. nombre: cawikt and DCVB say /o/ for Eastern Catalan but GDLC says /ɔ/. /o/ is etymologically expected.
  3. odre: Same. cawikt and DCVB say /o/ for Eastern Catalan but GDLC says /ɔ/. /o/ is etymologically expected.
  4. ofi "office": Vowel quality? Maybe /o/ since the o is unstressed in oficina?
  5. oi: DCVB splits the interjection into /ɔj/ "yes" from Latin hoc and /oj/ (expression of pain or surprise). GDLC and DNV group these two meanings and say the pronun for both is /ɔj/. Who is right?
  6. orla "border, fringe": DCVB and cawikt say /ɔ/ for Valencian. DNV says /o/.
  7. oro "suit in a Spanish deck or cards": Same as previous: DCVB and cawikt say /ɔ/ for Valencian. DNV says /o/. (Not in GDLC.)
Benwing2 (talk) 01:03, 28 December 2023 (UTC)Reply
For P:
  1. peli "film" (clipping of pel·lícula): cawikt says pel·li has ê, so I assume this is the same, but it seems strange to have ê for a recent coinage.
  2. perca "perch (fish)": cawikt says /ɛ/ for Valencian but DNV says /e/. DCVB doesn't give a pronunciation.
  3. pesta "plague": cawikt and DCVB say /ɛ/ for Central but GDLC says /e/ (mistake?).
  4. pleca "vertical bar": Balearic vowel? Is it ê?
  5. poblar "to populate": DNV says stressed vowel is /o/ despite poble having /ɔ/. Mistake?
  6. porro "leek; spliff": cawikt and DCVB say /ɔ/ but both GDLC and DNV say /o/.
  7. posa "pose" (not in cawikt): GDLC says /o/ despite this being derived from posar, which has /ɔ/. (Are there two different pronuns/etyms here?)
  8. postres "dessert": cawikt and DCVB say /ɔ/ for Valencian but DNV says /o/.
  9. pregar "to pray": Presumably /e/ (same as prec)?
Benwing2 (talk) 05:23, 28 December 2023 (UTC)Reply
For P:
  • 'peli' is an informal spelling of 'pel·li'. The latter is used in the press and has been consolidated, unlike other clippings. I spontaneously pronounce it è just like any word beginning with consonant + stressed e + l, including inherited ones from Latin both ě and ē. Being of general use and not exclusively colloquial, I would say ê, fully adapted in Central and the same value as unstressed in Balearic and Valencian.
  • 'perca': ë. Expected é but è per context C+ě+r, not fully changed in learned borrowings.
  • 'pesta' is weird, expected é but with some irregular è not enough explained in context C+ě+s. From the sources, è but irregular é in Central, although the irregularity is the other way around.
  • 'pleca': ë, as a technical word, schwa is improbable in Balearic.
  • 'poblar': I can't find any explanation for the difference between 'poble' and 'pobla'. Without any confirmation, for now I would say ò.
  • 'porro': ó. Expected ò but usually changes to ó before -rr-.
  • 'posa': noun ó and verb ò. Expected ò both from 'pausa' and 'pausare', but most current senses of the noun are calques of French or Spanish, both ó.
  • 'pregar': é.
Vriullop (talk) 13:30, 29 December 2023 (UTC)Reply
On cawikt the pronunciation was first added according to DCVB. Revision with GDLC is partial, not completed. Inclusion of pronunciation on DNV is recent, not yet checked. Your guesses are usually correct.
For N and O:
  • 'noble': ô. Expected ó, on first syllable changed to ò per consonant context, except on areas with Mozarabic influence as in Valencian.
  • 'nombre': ô. The same case, but I trust DCVB for Balearic with irregular ó.
  • 'odre': ô, but Balearic ó.
  • 'ofi', I've never heard it in Catalan. My guess is ó either from an unstressed vowel or from Spanish.
  • 'oi' both ò and ó. I trust DCVB with three groups, the last one used specially in Balearic. The two authors of the DCVB were Balearic, and both 'oi las' (surprise) and 'ois' (moans) result familiar to me heard from Balearic people. Probably outside the Balearic Islands people don't care about the difference with barely used senses.
  • 'orla': ô. Again, an expected ó changed to ò except in Valencian, confirmed in descriptive works.
  • 'oro': ô, hesitant by analogy with inherited 'or'.
Vriullop (talk) 15:32, 28 December 2023 (UTC)Reply

──────────────────────────────────────────────────────────────────────────────────────────────────── For R:

  1. reble "rubble": cawikt and DCVB say ê, but GDLC says /e/ for Central Catalan.
  2. recar "to regret": DNV says /e/; DCVB suggests /e/ everywhere, is that right?
  3. regar "to water": Etymologically should be ê, is that right? (OTOH reg has /e/ everywhere per GDLC and DNV)
  4. regna, regne, regnar: These seem to have [ŋn]. Do all words in -gn- have this? If so we should fix Module:ca-IPA to do this automatically. (Is this Eastern Catalan only? Valencian seems to have [gn].)
  5. reptar: In the meaning "to reprimand; to challenge" it seems to have rhizotonic /e/. In the meaning "to crawl" I am not sure.
  6. resar "to pray": Since this is a Spanish borrowing, does it have /e/? res "prayer" seems to have /e/.
  7. retre "to give back, to return": cawikt and DCVB say /e/ in Eastern Catalan but GDLC says /ɛ/.
  8. rosca "screw thread": cawikt and DCVB say /ɔ/ for Valencian but DNV says /o/.
  9. rosta "fried bacon, fried bread": cawikt says /ɔ/ for both Eastern and Western; DNV says /ɔ/ for Valencian but GDLC says /o/. DCVB has /ɔ/ and /o/ dialectally.
  10. rosta (feminine of rost "steep"): Same. cawikt says /ɔ/ for all, DNV says /ɔ/ but GDLC says /o/. Here, DCVB has only /ɔ/.
  11. rotar: Two etyms: (1) "to belch": Does it have /o/ like rot "belch"? (2) "to rotate": Does it have /o/ because it's borrowed from Spanish?
  12. rotllo: "roll; annoyance": DNV says it has /o/ but rotlle has /ɔ/. Mistake? cawikt and DCVB say forms have /ɔ/ everywhere, and GDLC agrees that both forms have /ɔ/ in Central Catalan. Note also rotlo, where again DNV has /o/; here again, DCVB says /ɔ/ everywhere but in this case cawikt says uses ô to get /o/ in Valencian.

Benwing2 (talk) 08:37, 28 December 2023 (UTC)Reply

@Vriullop Thanks again for your detailed responses, I really appreciate the work you're putting into the responses. Issues I found involving terms with S:
  1. seca "mint": GDLC says /ɛ/, DNV says /e/ and cawikt says ê, which are all compatible, but DCVB says /ɛ/ everywhere. In this case I wonder if DCVB is actually correct while both DNV and cawikt are mistaken.
  2. sedar "to sedate": DNV says /e/ for root vowel but unknown in Central Catalan.
  3. sense "without": cawikt and DCVB say ê, DNV says /e/ but GDLC says /e/ rather than expected #/ɛ/.
  4. sentir "to feel": DNV says /e/ root vowel. No dictionary attests the Central Catalan root quality, although /e/ is expected.
  5. serva "serviceberry": cawikt and DCVB say ê, DNV says /e/ but GDLC says /e/ rather than expected #/ɛ/.
  6. setge "siege; figwort": cawikt and DCVB say é, DNV says /e/ but GDLC says /ɛ/.
  7. soga "rope": DNV and GDLC both say /ɔ/ but DCVB says variously /o/ or /ɔ/ for a bunch of obscure places that I'm not familiar with but seem mostly Northwest Catalonian. I assume Balearic must have /ɔ/ but not sure.
  8. sonso "clumsy, gauche": cawikt and DCVB say /o/ for both East and West; DNV agrees with /o/, but GLDC says /ɔ/ for Central Catalan. Maybe this is a case of changing over the last century?
  9. sorna "sarcasm": cawikt says ô, but both DNV and GDLC say /o/. DCVB doesn't give pronun.
  10. sosa "saltwort, soda ash": cawikt and DCVB say ô, but both DNV and GDLC say /o/.
  11. sostre "ceiling": cawikt says ó and DNV says /o/, but GDLC says /ɔ/. DCVB maybe has the real story: /ɔ/ in Barcelona, /o/ elsewhere. I'm going with the idea that Western Catalan (Northwestern and Valencian) have /o/, while Central has /ɔ/ and Balearic has /o/. Correct?
  12. sotjar "to spy on": DNV says /o/ root vowel. No dictionary attests the Central Catalan root quality, but I am guessing /o/ based on the proposed etymologies. Correct?
Note that I'm now 87% through the set of 2,722 terms that I identified for auditing the mid-vowel quality, and have finished with S. T represents about 7% of the total, V represents 4-5%, and the remaining letters around 1%. So I'm quite close to finishing, with lots of help from you :) ... Benwing2 (talk) 09:20, 30 December 2023 (UTC)Reply
For S:
  • seca: I think the correct one is ë, although I'm not sure about its evolution from Arabic.
  • sedar: expected ê from Latin sēdō.
  • sense and sens: expected ê, but such words often used as proclitics tend to become closed. So é but schwa in Balearic.
  • sentir: é as expected.
  • serva: ê is correct. As in other similar cases, the GDLC does not distinguish properly different pronunciations from different etyms.
  • setge: expected é, but è in Central per context subject to openness.
  • soga: ò in general. It was identified by Coromines in a handful of about 40 words that have changed an etymological ó by ò except in some specific areas. It is known as the Coromines law, and it is still unknown why it includes certain words and not others.
  • sonso: ó but ò in Central, for unknown reason to me.
  • sorna: ó in general.
  • sosa: ó in general.
  • sostre: it is one of the Coromines law, expected ó changed to ò. This law may have various degrees of extension. Probably most conservatives areas, Balearic and Valencian, maintain the old ó, while most Central has changed to ò. Usually Northwestern also changes by Central attraction, to be confirmed.
  • sotjar: not sure, but ó is the best guess.
Vriullop (talk) 08:31, 5 January 2024 (UTC)Reply
For R:
  • reble: expected é. The DCVB with ê seems by analogy with other words. I would say é but with an irregular ə in Balearic.
  • recar: é as expected from an earlier 'a'.
  • regar: ê as expected. Nouns 'rec' and 'reg' are interrelated and are not a good indicator for the verb.
  • All -gn- between vowels are pronounced [ŋn]. Also -n- followed by /k/ or /ɡ/, but this one was reverted per no phonemic.
  • reptar: é from Latin rěp(u)tō and ê from rēptō.
  • resar: é as noun 'res'.
  • retre: I really don't know which process applies here. By now I'd say ë, pending of confirmation.
  • rosca: ô.
  • rosta, as a slice of bacon usually fried with bread is a typical dish of the Pyrenees. Although it is the feminine form of 'rost', from the old sense "roasted", in the Pyrenees this ò usually changes to ó. In the DCVB, I read that the northernmost localities say ó, and ò it is quite far from the Pyrenees. In short, as a noun ó in Central, ò in Valencian and Balearic. As an adjective form: ò, although the GDLC does not separate it properly.
  • rotar: ó for both etyms.
  • rotllo, what a mess! It is not attested in Valencian until recent times, probably from Spanish rollo. This ó is archaic, not accepted in other areas where it is used from Old Catalan. 'Rotlle' is the inherited form, hardly used in Valencian where it is preferred the spelling 'rotle', both ò. 'Rotlo' is only used in Balearic, for me it is anecdotal how to try to pronounce it by outsiders with a range of alternatives spellings.
Vriullop (talk) 11:29, 4 January 2024 (UTC)Reply
@Vriullop Thank you again! BTW I have gone through and added (offline) stressed root vowels to all enwikt Catalan verbs with e or o where I could determine it, using some combination of cawikt, DNV, GDLC and DCVB. (It looks like I was able to figure out the vowel for 1,174 verbs in -ar, 33 verbs in -ir and all relevant verbs in -re and -er, and only couldn't figure out the vowel for 72 verbs in -ar and 2 verbs in -ir.) I am mostly done coding the changes I want to make to Module:ca-IPA and I'll use the new code to support displaying the root vowel info. I'll post the list of undetermined verbs soon. Benwing2 (talk) 19:55, 4 January 2024 (UTC)Reply
BTW I have finished the changes to Module:ca-IPA and Module:ca-headword and pushed all the root vowel additions. You can see them in action e.g. in flirtejar, besar, adreçar, annexar and several others. Benwing2 (talk) 07:45, 5 January 2024 (UTC)Reply
Also, I added tracking for all terms with defaulted mid vowel quality, with the plan of removing some of the defaults. The first word I looked at, for example, is amulet, a recent borrowing that claims to have ê, which seems unlikely. Benwing2 (talk) 08:07, 5 January 2024 (UTC)Reply
Here is the list of now 68 -ar verbs where I couldn't identify the Central Catalan root vowel (sometimes only in one etymology out of several): afogar, agregar, al·legar, alterar, amonestar, ancorar, atemptar, celebrar, col·laborar, commemorar, compensar, condensar, confessar, congregar, conrear, contemplar, crebar, delegar, denegar, depredar, desagregar, desintegrar, deteriorar, devorar, discrepar, dreçar, dropar, edulcorar, elaborar, elevar, encetar, engegar, enllumenar, ennuegar, ensopegar, entaforar, entollar, entrenar, esborrar, esbotzar, esmicolar, espitregar, esverar, evaporar, exacerbar, expectorar, explorar, gofrar, impetrar, increpar, integrar, interpretar, isolar, laborar, negar, perforar, prolongar, rememorar, retolar, rosegar, secretar, segregar, somorgollar, temptar, tomar, trafegar, trepar, trepollar. Benwing2 (talk) 08:12, 5 January 2024 (UTC)Reply
In some cases I can't be completely sure, these are my best guesses: afogar ó, agregar é, al·legar ê, alterar é, amonestar é, ancorar ó, atemptar é, celebrar é, col·laborar ó, commemorar ô, compensar ê, condensar ê, confessar é, congregar é, conrear ë, contemplar é, crebar é, delegar é, denegar é, depredar é, desagregar é, desintegrar é, deteriorar ó, devorar ô, discrepar é, dreçar ë, dropar ó, edulcorar ô, elaborar ó, elevar é, encetar é, engegar é, enllumenar ê, ennuegar ë, ensopegar ê, entaforar ó, entollar ò (both), entrenar é, esborrar ó, esbotzar ó, esmicolar ô, espitregar ë, esverar é, evaporar ó, exacerbar é, expectorar ó, explorar ó, gofrar ó, impetrar é, increpar é, integrar é, interpretar é, isolar ô, laborar ó, negar é (both), perforar ó, prolongar ó, rememorar ó, retolar ó, rosegar ê, secretar ë, segregar é, somorgollar ó, temptar é, tomar ó, trafegar ê, trepar é, trepollar ó. Vriullop (talk) 08:23, 10 January 2024 (UTC)Reply
Reviewing mid-vowel defaults tracked:
  • e/u: doesn't make any sense, probably it was intended for a diphthong -eu-.
  • o/u: also nonsense.
  • e/ct-cts-cts-ctes: too many variations è with cases of é only in Central.
  • e/dre-dres: mostly ë instead of é.
  • e/final-l: it is stable but needs to exclude -ell(s).
  • e/l-ls-ll: it's ok, I haven't found any problem.
  • e/ma-mes: too many variations
  • e/ens-ena-enes: too many variations ê/ë
  • e/nse-nses: it doesn't worth for a few words
  • e/nt-nts: mostly é with few exceptions, widely used
  • e/r-rs-ra-res: too many variations é/ê
  • e/rC: it's ok
  • e/sos-sa-ses: it's ok
  • e/t-ts-ta-tes: too many variations
  • è/s-blank: FIXME only in last syllable stressed, currently includes tèbia, època, ...
  • o/r-rs-ra-res: too many variations
Vriullop (talk) 09:20, 8 January 2024 (UTC)Reply

──────────────────────────────────────────────────────────────────────────────────────────────────── @Vriullop I have finished everything up through T and pushed the offline changes to Wiktionary. Issues I found with T:

  1. teca three etyms: (1) "food"; (2) "teak"; (3) "theca". All three have /e/ per DNV, and (1) and (3) have /ɛ/ per GDLC. (1) has /ɛ/ per DCVB, otherwise not indicated. I am guessing then that (1) and (3) have ë, and (2) must have either é or ë.
  2. temprar: Exactly parallel to emprar. cawikt says ê but DNV says /e/ for tempre. Is /e/ more recent for Central?
  3. temptar "to try": /e/ per DNV, I'm guessing é per etymology.
  4. tesla "tesla": /e/ per DNV, I'm guessing é.
  5. testar "to witness": /e/ per DNV, I'm guessing é per etymology.
  6. teu "your": /ɛ/ per GDLC for Central Catalan but /e/ per cawikt. GDLC says /e/ for meu "my" so I wonder if this isn't a mistake in GDLC.
  7. text "text": /tekst/ per GDLC, /tɛkst/ per DNV. Correct? DCVB says /test/ for everywhere, which may be antiquated.
  8. tomar (1) "to catch"; (2) "to knock down". Root vowel?
  9. tondre "to shear": /o/ for Central in cawikt and DCVB, but /ɔ/ in GDLC (DNV says /o/). However, note that tosa has /o/ in GDLC. What's going on here?
  10. tora "aconite": GDLC and DNV both say /o/ but DCVB says /ɔ/ for both Western and Eastern. Is /ɔ/ antiquated?
  11. torbar "to disturb", torba "disturbance" and "torba" peat: GDLC and DNV both say /o/ but cawikt says /ɔ/ for Central Catalan (/o/ for Valencian). Is /ɔ/ wrong or antiquated?
  12. tors "torso": cawikt says /o/ (dialect not indicated), but GDLC says /ɔ/ for Central (and DNV says /o/ for Valencian). I am assuming GDLC is correct.
  13. trempa, trempar: cawikt says ê everywhere, in agreement with DCVB for tremp and trempa, but GDLC gives /e/ for both tremp and trempa; maybe /e/ is more modern as DCVB's fieldwork is ~ 100 years old.
  14. trenca "duffel coat": A borrowing from Spanish trenca. The other meaning of the noun "breakage; lesser grey shrike" has ê but this seems unlikely for a Spanish borrowing. I'm guessing ë.
  15. trepa "trimming; stencil" also "mob, riffraff, rabble" also a form of trepar "to drill, to perforate". DNV says /e/ for all three etyms; GDLC says /e/ for the first two, but DCVB says /ɛ/ for the meaning "mob, rabble". I am not sure whether all three are etymologically related.
  16. tropa "troop; crowd": cawikt says /ɔ/ everywhere (and DNV says /ɔ/) but GDLC says /o/. DCVB says /ɔ/ for Eastern but /o/ for Girona; maybe /o/ for Central is more recent.
  17. trotllo "medusafish": cawikt says /ɔ/ everywhere but DNV says /o/, so I'm assuming ô.

Also a few other issues:

  1. alliberar: cawikt says /ɛ/ everywhere but DNV says /e/.
  2. beca and derived becar "to give a scholarship to": cawikt says ë but DNV says è.
  3. clon: cawikt says ò but DNV says /o/. I am guessing ô then.
  4. emprar: cawikt says ê but GDLC says /e/ for empre. Is /e/ more recent for Central?
  5. perseverar: cawikt says ê, are we sure? sever has é.

Benwing2 (talk) 01:53, 6 January 2024 (UTC)Reply

One more question (sorry for the barrage of questions): Currently the module section for Central Catalan unilaterally removes final single -r, whether absolutely word-final or followed by an -s. I'm thinking of making this less absolute, as follows:
  1. Don't remove final -r(s) in monosyllables.
  2. In non-monosyllables, remove final -r(s) in -ar, -er, -ir and in -[dtsç]or, but not otherwise. This is based on the fact that most words in -[dtsç]or are agent nouns and seem to fairly consistently remove the -r, while the remaining words in -or often (but not always) preserve the -r per GDLC. Here is a long list of such words: amor, humor, anterior, vapor, rumor, labor, major, tenor, tumor, terror, inferior, superior, clamor, posterior, furor, ulterior, tricolor, temor, rigor, vigor, menor, decor, olor, llavor, suor, licor, rubor, petricor, negror, remor, millor, albor, cremor, claror, grogor, blavor, maror, pitjor, frescor, senyor, finor, incolor, rojor, vermellor, blancor, lletjor, amargor, primor, favor, picor, escalfor, tremolor, esgarrifor, llacor, raor, xafogor. The idea is that to force the preservation of -r, write 'rr', and to force the non-preservation, write '-ó' (although if all these words preserve the -r in Valencian, we'd want some other signal, e.g. '-(r)'). Thoughts? Benwing2 (talk) 09:50, 6 January 2024 (UTC)Reply
This plan sounds fine, assuming:
  • The non-preservation happens when the final syllable is stressed. When unstressed only affects some words, like créixer, càntir.
  • In Valencian it is always preserved. To force the non-preservation in Central and Balearic writing '-(r)' or '-(r)s' is intuitive. In fact, this is similar to the rhymes, i.e. Rhymes:Catalan/a(ɾ).
  • In Balearic there are more loses of final -r than in Central Catalan. See amor, although the result is correct, it is not consistent when the preservation is forced writing 'rr', and it is not reasonable to assume that in Balearic no final -r is ever pronounced. Maybe it should be fixed with a per-dialect parameter.
There are many pending things above that require more time. Vriullop (talk) 11:57, 6 January 2024 (UTC)Reply
@Vriullop Thanks for your comments. I'm thinking that writing rr should force the pronunciation of final -r everywhere, while writing something like rh should cause it to be pronounced in Central Catalan but not Balearic. This is based on looking through the DCVB with a sample of the above nouns, some of which appear to have pronounced -r in the Balearics, some not, and for some it depends on where in the Balearics. More complex scenarios can be handled using dialect-specific params (which are now implemented; see llei for an example). Benwing2 (talk) 21:23, 6 January 2024 (UTC)Reply
Another possibility in place of rh for "pronounced everywhere but Balearics" is (rr). This sets up a hierarchy of pronunciation: rr > (rr) > (r) > nothing. Benwing2 (talk) 01:13, 7 January 2024 (UTC)Reply
BTW I am planning on making it required to specify the way final -r is pronounced, using one of rr, (r), (rr) [or maybe rh if we decide on that] or omitting it, except in the circumstances where it defaults to (r), which are multisyllabic words ending in stressed -ar, -er, -ir or -[dtsç]or. In all other circumstances, the pronunciation seems far too irregular to provide a default.
Note that I have already removed the majority of defaults for mid vowel o and added the vowel explicitly, and I'm planning on doing the same for mid vowel e. For the defaults I removed, either there were few places that made use of the defaults or there were many but with lots of errors, e.g. o and e in the penultimate syllable with -i or -is in the last syllable were defaulting to ò and è respectively, which makes sense for adjectives of this form but doesn't work for subjunctive verb forms, and there were lots of places where this default was being used for subjunctives, producing incorrect results.
One other thing: the pronunciation given in GDLC for meteor is [məteɔ́ɾ], with unstressed [e]. Is this correct? If so I'll need to add a special symbol to allow for unstressed unreduced vowels. However, maybe it's a mistake; I found a pronunciation on Forvo here [1], which sounds more like [mətəɔ́ɾ] (BTW cawikt says [mətəóɾ] with /o/, which may be wrong as well for Central Catalan). Benwing2 (talk) 05:08, 7 January 2024 (UTC)Reply
I forgot to add, I'm implementing a shortcut notation to make it easier to specify things like the pronunciation of final -r without having to repeat the entire word. If you write [FROM:TO] where FROM is part of the spelling and TO is the corresponding respelling, it will make that substitution in the respelling as long as it's unambiguous. So you can write [or:ôrr] for meteor. To make it even shorter, in cases where the spelling and respelling are similar enough, you can just write the respelling, hence [ôrr], and the code knows that ô should match either o or ô in the original spelling and rr should match either r or rr. Another common example is [ks], which is equivalent to [x:ks] and can be used to respell x as ks in words like boxejador. This will all be documented in {{ca-IPA}} as soon as I push the code. Benwing2 (talk) 05:16, 7 January 2024 (UTC)Reply
Great.
  • For final -r, I like the hierarchy rr > (rr) > (r)
  • 'meteor' with unstressed [e] is correct. No need to do anything in the module, function reduction_ae does not apply any reduction in groups 'eà' and 'eò'.
  • A shortcut for respelling is useful.
Vriullop (talk) 10:27, 8 January 2024 (UTC)Reply
@Vriullop I have implemented everything described above and fixed up all terms in final -r(s) appropriately. The use of the respellings for -r is documented in {{ca-IPA}}. The substitution notation like [ó(r)] is still being documented. Benwing2 (talk) 02:29, 10 January 2024 (UTC)Reply
@Vriullop Thanks for your comments! I have added add the root vowels you specified and am going through the defaulted mid vowel conditions and fixing them up. One thing I notice is that written bl pronounced /b.bl/ and similarly written gl pronounced /g.gl/ aren't correctly handled. For bl at least it seems not all occurrences of bl result in this doubling, e.g. doblar does but sublim doesn't yet they have the same structure in terms of # of syllables, word shape, position of the accent, etc. What do you recommend? i tried manually adding written g.gl to segle, writing it as seg.gle, but then Valencian also gets the doubling, which is wrong. I see two approaches: (1) Manually require all doubled bl and gl to be written as bbl and ggl except maybe in certain suffixes (e.g. -able(s), ible(s)), and have the Valencian-specific code remove the doubling and convert it back to single stops; (2) Double bl and gl by default. This would mean we'd need some method of indicating the non-doubled occurrences, maybe by writing sub.lim or something (although this might be problematic when we start providing phonetic output with fricative [βɣð], which I'd like to do soon; not actually sure though if there will be an issue). Thoughts? Benwing2 (talk) 07:02, 12 January 2024 (UTC)Reply
The groups -bl- and -gl- are geminate in Central and Balearic in post-stressed position: poble /ˈpɔb.blə/, regla /ˈreɡ.ɡlə/, including endings -able, -ible. That can be coded in the module. It doesn't happen in Valencian, nor in pre-stressed position, as in sublim. But all its derivatives are also geminate even if in pre-stressed position: poblar, població, reglar, reglament, ... That needs to be respelled pobblar, pobblació, regglar, regglament, and then undone in Valencian. Vriullop (talk) 08:22, 12 January 2024 (UTC)Reply
@Vriullop Got it, thanks. I'll implement this. What do you think of just providing phonetic output and changing the /.../ to [...]? This seems consistent with what the various dictionaries do; or at least, they explicitly show the fricative allophones [βɣð]. This would mean, for example, that the issue of whether to display [ŋ] goes away: we just display it whenever it's pronounced as such. Benwing2 (talk) 20:04, 12 January 2024 (UTC)Reply
I have implemented what you said for -bl- and -gl-. I am currently working on auto-adding secondary stress to adverbs in -ment. (In the process I'm adding a quick shorthand to indicate a part of speech for a given term, e.g. n/RESPELLING or just n/ for a noun, a/RESPELLING or just a/ for an adjective, etc. The idea here is that terms in -ment default to adverbs, which means they get secondary stress by default, but you can override this by specifying n/ for a noun like desembarcament or a/ for an adjective like vehement. Some terms need both a part of speech and respelling, e.g. desdoblament needs n/[bbl] to indicate that it's a noun and the -bl- is pronounced /bbl/.) I have a question though about this. Adverbs in the DNV are indicated with *primary* stress on the preceding component and no stress on -ment, e.g. see [2] for feliçment. This seems rather strange to me and it's contrary to what the Wikipedia article on Catalan phonology says. Is this really true or is it just something weird in the DNV? Benwing2 (talk) 23:40, 12 January 2024 (UTC)Reply
BTW I found an exception to the rule that post-stressed -bl- is geminate: bíblic (and Bíblia). Are there others? If so and given how many exceptions there are in the other direction, I wonder if we shouldn't just make all -bl- and -gl- geminate by default in Central Catalan and Balearic, and require that all cases where this doesn't happen get rewritten using [b.l] or [g.l]. Benwing2 (talk) 04:01, 13 January 2024 (UTC)Reply
I implemented the auto-adding of secondary stress to adverbs in -ment, along with the part of speech hints described above, and fixed up all nouns and adverbs in -ment appropriately. (I actually added pronunciations to all or almost all nouns and adverbs in -ment that were missing them; this took several hours for adverbs because there are around 800 of them in -ment, and many of them have secondarily stressed e or o, which needed looking up.) The mid vowel hint now applies to the part preceding the adverbial -ment, not to the -ment itself (which is always pronounced /men(t)/ with /e/). Note also that in the future, these part of speech hints can also help with things like terms in -ar, where adjectives in Central Catalan pronounce the final -r but nouns and verbs generally dont. Benwing2 (talk) 07:33, 13 January 2024 (UTC)Reply
OK, from the GDLC it looks like there are actually three ways that -bl- can be pronounced: obligar has [βl], doblar has [bbl], and obliterar has [bl]. Is that correct? If so I'll need to come up with some notation to distinguish these three. Maybe we should write o-bliterar to get [bl]; this is consistent with words like hipoglucèmia, which have hard single [gl] following a prefix with secondary stress [ìpuglusɛ́miə]. This would suggest a respelling hípo-glusèmia. Then if we need post-stressed [βl], we write e.g. Bíb.lia, and if we need post-stressed [bl] for some reason we'd write e.g. Bí-blia or something, and to get post-stressed [bbl] we'd write e.g. Bíbblia (or rely on the default). Make sense? Sorry to dump so much text on you. Benwing2 (talk) 09:25, 13 January 2024 (UTC)Reply
Great work here.
  • The inclusion of allophones βɣðŋɱ does not imply to change the transcription with brackets [...] In fact, /β/ is not a w:voiced bilabial fricative but a simplification without diacritic of an approximant [β̞]. Catalan works follow a convention of "broad transcription" with the inclusion of what is considered relevant and without any claim about phonemic values. A purely phonemic transcription is a theoretical discussion. According to different authors, between 25 and 31 phonemes can be considered in Catalan. For example, the schwa is a predictable dialectal allophone, but it is relevant in contrast with other Romance languages. If it were necessary to mark that it is not strictly phonemic, frwikt uses \backslashes\. They are also used by the Merrian-Webter as a notation for its own IPA transcription. The criteria followed in enwikt do not seem consistent enough to me.
  • The DNV does not show primary and secondary stress, nor does it in compound words. It is more noticeable in Eastern dialects without schwa in secondary stress. The stress showed in adverbs with -ment is misleading.
  • 'Bíblic' and 'Bíblia' are the only exceptions to geminate bl.
  • I have not found any explanation for 'obliterar' and 'hipoglucèmia'. See https://giec.iec.cat/textgramatica/codi/4.4.3.3. Maybe as cultism in very formal speech, but I think it doesn't worth to make exceptions here. On the contrary, note that /β/ does not happen in Balearic and formal Valencian after a vowel, that is in dialects that distinguish /b/-/v/.
Vriullop (talk) 09:17, 15 January 2024 (UTC)Reply
@Vriullop Thanks for your response, this is very helpful. I am currently working on fixing up terms with written x (there are a lot of mistakes) but I'm almost done with the offline portion and I think next I'll focus on adding the fricative allophones and correctly handling multiple words. For handling multiple words I need to know the following:
  1. What are the unstressed words? I assume they are all the proclitic object pronouns em, et, es, el, la, els, les, li, ens, us, ho, hi, en; plus the enclitic ones -me, -te, -se, -lo, -la, -los, -les, -li, -nos, -vos/-us, -ho, -hi, -ne (which might already be handled correctly); the contracted ones with apostrophe (which may already be handled correctly); maybe the unstressed possessives mon, ma, mos, mes, ton, ta, tos, tes, son, sa, sos, ses; the prepositions a, de, per, amb (and obsolete ab?), en (what about cap, des?); the prepositional contractions al, als, del, dels, pel, pels; articles el, la, els, les (already handled as proclitic pronouns), personal articles en, na (what about indefinite articles un, u, uns?); maybe salat articles es/ets, sa, ses, so, sos; the conjunctions i, o (what about si?). Any others?
  2. Which assimilation rules apply across words? The Wikipedia article Catalan phonology says that final -s voices before a vowel, which seems to cause a preceding consonant to voice as well, hence tots els has /dz/ in the middle. I assume that lenition of written b d g occurs across word boundaries as well. What about final omitted -r? Does it reappear before a vowel in the next word, e.g. in a phrase like vaig amar una dona? (And for that matter, does the -ig in vaig become voiced in this phrase?) Do you have any references on this?
Thanks again. Benwing2 (talk) 09:57, 15 January 2024 (UTC)Reply
The list is correct: proclitic and enclitic pronouns, unstressed possessives, prepositions but not 'cap', 'des', contractions, articles including personal ones and salats, indefinite articles but not 'u', conjunctions including 'si' and 'ni', and also que as a pronoun and conjunction.
In general, contact between words have the same process of assimilation, voicing, or devoicing that inside words. A typical example is els avis /əlz/, els savis /əls/, and tots els is really /ˈtodz.əls/, and vaig amar /ˌbad͡ʒ.əˈma/. The final -t reappears followed by a vowel (sant Antoni /ˌsan.tənˈtɔ.ni/). The final -r of infinitives only reappears followed by a pronoun (anar-hi /əˈna.ɾi/). From chapter 4.4 onwards of the IEC grammar you can find a lot of examples. Vriullop (talk) 12:37, 15 January 2024 (UTC)Reply
@Vriullop Thanks again for your help. I finally finished most of the work on multiword support. Still to go is approximant allophones of b/d/g, correct handling of apostrophes (represented with ‿), and ‿ as an indicator of liaison in respelling for cases like Sant Antoni respelled Sànt‿Antòni (which should produce /ˌsan.tənˈtɔ.ni/). I (more or less) read chapter 4.4 in the IEC grammar and I notice it also talks about certain cases of total assimilation where maybe cap de is pronounced /kad də/ or something, but I'm not sure we should implement that. I have some questions though:
  1. Brunsvic (as in e.g. Nova Brunsvic) given as [bɾunzvík] in GDLC, is the v correct?
  2. For drets humans, the module currently generates /ˈdɾɛdz uˈmans/, is that correct?
  3. fer cas, fer acte de presència: Is the <r> pronounced in Central Catalan?
  4. Sant Llorenç de la Salanca: the module currently generates /ˈsaɲ ʎuˈɾɛnz də lə səˈlaŋ.kə/ for Central and /ˈsand ʎoˈɾɛnz de la saˈlaŋ.ka/ for Valencia; correct? In general, does final -ç voice when the next word begins with a vowel?
  5. The IEC grammar is equivocal about whether b/d/g become fricatives after /r/, /ɾ/ and /z/, what should we do in this case?
  6. It appears double schwa /əə/ is often compressed to single schwa /ə/ in Central and maybe Balearic, but not in Valencian. This is indicated in GDLC and seems to operate fairly consistently if the second schwa is in a closed syllable (sobreescalfament, contraescarpa), but only sometimes in an open syllable (centreafricà, contraatac). Can you comment here? Likewise, /i/ or /u/ followed by schwa seems to elide the schwa in aeroespacial, autoescola, antiespasmòdic, but only sometimes if the schwa is in an open syllable (hence not in autoerotisme, antiemètic but yes in fotoelèctric, fotoelectricitat, macroeconomia). Likewise /uu/ seems to compress to /u/ if the second /u/ is in a closed syllable (microorganisme), but only sometimes in an open syllable. How do you think we should handle these cases?
  7. I am trying to figure out what to do for written <tn>, <tm>, <tl>, <tll>. It seems that these tend to be pronounced as geminates in native words (e.g. cotna, setmana) but with [d] in cultisms/learned words. I'm thinking maybe we should make the cultism behavior the default and require respelling for the remainder, and least for <tm> where there are more terms like ritme, aritmètic, atmosfera than terms like setmana. But maybe this should differ depending on the different spellings, e.g. <tl> even in a cultism like atlàntic seems to have a geminate in it in Central Catalan but not in Valencian. Can you comment on what you think should be done?
Benwing2 (talk) 22:45, 26 January 2024 (UTC)Reply
Note, I also revamped the testcases, see Module:ca-IPA/testcases (which demonstrate there's still a lot to fix). Benwing2 (talk) 23:26, 26 January 2024 (UTC)Reply
  • Brunsvic is strange. It is supposed the GDLC includes pronunciation from the Diccionari ortogràfic i de pronúncia (DOP), but it turns out that the DOP does not include proper names. For non-Catalan place names I check ésAdir, a website for radio and tv journalists, and it shows /'bɾunz.βik/ as I expected.
  • 'Drets humans' is correct.
  • 'Fer cas', 'fer acte', are correct. The r of infinitives only reappear followed by pronouns: fer-se /ˈfer.sə/, fer-hi /ˈfe.ɾi/, fer-t'ho /ˈfer.tu/...
  • 'Sant Llorenç de la Salanca' is correct. Final /s/ of Llorenç is voiced /z/ followed by a voiced consonant or by a vowel.
  • The IEC grammar is too much descriptive about approximants, when they may or may not appear. Considering that /β/ is rare in dialects with contrast /v/-/b/, that is Balearic and Valencian, and trying to be consistent with GDLC and DNV:
    • No approximants r/s + b/d/g in Central.
    • No approximants r/s + b in Balearic and Valencian.
    • Approximants r/s + d/g in Balearic and Valencian.
  • In general, the concurrence of two identical vowels /əə/ (or /aa/, /ee/), /uu/ (or /oo/) is reduced to a single vowel. Variations may depend on formal v. informal, or common use v. cultism, or emphasis of some prefixes. It is hard to define any exception.
  • Written <tm> and <tn> are geminated in a handful of inherited words: cotna, reguitnar, setmana and its derivatives. But 'setmana' with a single /m/ in Valencian. 'Vietnamita' and 'sotmetre' are hesitant. Others like 'ritme', 'ètnic', 'algoritme' are cultisms /dm/.
  • Written <tl> is always /ll/ in Central and Balearic. In Valencian it is /ll/ in inherited words and /dl/ otherwise. Valencian inherited words include those with alternative spelling <tll>: ametla > ametlla, butla > butlla...
  • Written <tll> as alternative spelling of inherited <tl> is pronounced /ʎʎ/ in Central and /ll/ in Balearic and Valencian. Although the DNV includes 'ametlla', 'butlla'... it is not really used, and if written it is still pronounced as <tl>. As a cultism, like 'ratlla', 'bitllet' or 'butlletí', it is pronounced /ʎʎ/ in Central and /ʎ/ in Balearic and Valencian.
Vriullop (talk) 10:54, 29 January 2024 (UTC)Reply
@Vriullop Thanks. I have (already) implemented most of the above things. I haven't yet implemented reduction of adjacent unstressed vowels or redone the implementation of <tl> and <tll>. As for Sant Llorenç de la Salanca, the module formerly generated [ˈsand ʎoˈɾɛnz ðe la saˈlaŋ.ka] for Valencia (note the [d] in /sand/) but I am guessing this is wrong, so I changed it so it now generates [ˈsaɲ ʎoˈɾɛnz ðe la saˈlaŋ.ka]. Basically I am guessing that elision of stops after nasals happens in Valencia before a consonant but not a vowel or utterance-finally. Is this correct? Benwing2 (talk) 01:53, 30 January 2024 (UTC)Reply
I didn't notice 'sant'. It is correct, elision of t and assimilation of the nasal before a consonant, not before a vowel or isolated.--Vriullop (talk) 08:00, 30 January 2024 (UTC)Reply

Your bot is removing valid categories edit

e.g. {{C|de|Western Sahara}} at Westsahara. —Justin (koavf)TCM 00:55, 1 January 2024 (UTC)Reply

@Koavf This is unavoidable. When you add a page to a category, sometimes it takes a little while for the category to register having the page in it, and in the meantime it shows up in CAT:Empty categories, which is what I use periodically to delete empty categories. I check that category before deleting the empty categories referenced, but I can't notice everything. Any non-empty categories so deleted will get re-created in a few days in any case. Benwing2 (talk) 01:06, 1 January 2024 (UTC)Reply
What are you talking about? That category was on that page for 5.5 years and your bot removed it for no reason. How is that unavoidable? Are you telling me that your bot is going to re-add all of these categories and undelete them as well? —Justin (koavf)TCM 01:09, 1 January 2024 (UTC)Reply
Dude, fuck off. Seriously. Yelling at me is not going to get me to help you any quicker than writing nicely.
As for my response, I thought you were referring to my recent deletion of empty categories (as of a few hours ago) rather than a bot change from a month and a half ago. In the future I'd recommend you link to the specific diff. My removal of the category at that time was a by-hand change, not a script change, even though the bot pushed the change; that's what "manually assisted" means (and I have a strong feeling I've already explained this to you). The reason for the removal is that Module:place normally auto-adds categories of this nature, and I thought it would in this case; the reason it didn't is apparently because Western Sahara is listed in Module:place/shared-data as a country, but its definition identifies it (correctly) as a territory rather than a country. I'll fix this so it gets correctly auto-added. Benwing2 (talk) 01:30, 1 January 2024 (UTC)Reply
I was much nicer than you were just now and was in no sense "yelling". There was no reason for that language. I didn't realize that what I wrote was ambiguous and I thought that referring you to the entry would be sufficiently clear where you can see what your bot (or script or by-hand you) did. Thanks for agreeing to fix this and undelete all of these categories. When will this happen? —Justin (koavf)TCM 22:18, 1 January 2024 (UTC)Reply
When will you or your bot undo these category removals? —Justin (koavf)TCM 22:42, 15 January 2024 (UTC)Reply
@Koavf Which removals are you referring to? Specifically to do with Western Sahara, or are there any others? Benwing2 (talk) 22:44, 15 January 2024 (UTC)Reply
The only ones I am aware of are removals of the sort {{C|CODE|Western Sahara}} which emptied several categories that were then deleted. I'm not familiar with any others. —Justin (koavf)TCM 22:46, 15 January 2024 (UTC)Reply
When will you or your bot undo these category removals? —Justin (koavf)TCM 01:37, 21 January 2024 (UTC)Reply
@Koavf Did you not get my ping? I did this days ago. Benwing2 (talk) 02:37, 21 January 2024 (UTC)Reply
I see that it did and no, I didn't. For some weird reason, I also did not get updates for this thread even after subscribing. :/ Thanks a lot. —Justin (koavf)TCM 10:16, 21 January 2024 (UTC)Reply

Twice-borrowed terms edit

I looked up παλάβρα, which is from παραβολή after passing through Ladino, and found out that, after moving all the "twice-borrowed terms" categories to "terms borrowed back into", there are still lots more Greek twice-borrowed terms than Greek terms borrowed back into Greek. This may also be true of other languages. Can you look into it? PierreAbbat (talk) 16:43, 1 January 2024 (UTC)Reply

@PierreAbbat It’s because they were added manually due to the origin being Ancient Greek, which is a misuse of the category imo. Theknightwho (talk) 19:17, 1 January 2024 (UTC)Reply
Yeah @Pierre, if I may expand on what Theknightwho said, it is indeed because of Ancient Greek being considered a separate language, and this is discussed at Wiktionary:Beer_parlour/2023/November#Does_'terms_borrowed_back_into_LANG'_include_cases_where_the_borrowing_was_from_an_ancestor? (and actually quite a few other places over the years, e.g. Wiktionary:Etymology_scriptorium/2016/June#Twice-borrowed_term_or_term_derived_from_an_older_stage_of_the_same_language?, Wiktionary:Beer_parlour/2011/October#Twice-borrowed_terms), and ... it's tricky. Because ... while I'm sympathetic to the potential complaint that it's somewhat arbitrary that a word used in the modern form of Hebrew or Latin (or Chinese) and derived from the variety spoken two thousand years ago can be automatically categorized as "borrowed back" while a word in modern Greek or English can't be, just because we decided it was most convenient to handle the changes those languages underwent as still being ==Hebrew==, ==Latin== (or ==Chinese==) but decided to split the changes Greek underwent between two languages ... we do have to draw a line somewhere or else we get into absurdities (e.g. a term from Proto-Indo-European, which went into French, and was borrowed into English, is twice-borrowed/borrowed-back?), and if we draw the line anywhere other than "whatever we've decided to consider a separate full language", it gets fuzzy and messy fast. But please comment in the November BP discussion linked above if you have suggestions. - -sche (discuss) 19:45, 1 January 2024 (UTC)Reply

New :toBcp47Code() method edit

If I interpret this recent change to Scribunto correctly, it provides a way to convert from MediaWiki langcodes to proper langcodes directly. Might be worth incorporating, as I imagine it’ll simplify some of our code, and I think you’re more familiar with that side of things than me. Theknightwho (talk) 15:50, 2 January 2024 (UTC)Reply

@Theknightwho Unfortunately I'm not sure this is useful for our purposes. Wiktionary language codes aren't always the same as MediaWiki language codes and I don't think we ever need to convert MediaWiki -> BCP47; instead if anything we'd need to convert MediaWiki <-> Wiktionary and Wiktionary -> BCP47. Benwing2 (talk) 22:47, 15 January 2024 (UTC)Reply

Addition to quotation-template documentation edit

I just fixed a module error caused by WF converting a quote to {{quote-book}} without checking what goes where. The template documentation is thoroughly organized, voluminous, and useless for figuring out how to fix parameter values in the wrong slots. I was going to add a little index of positional parameters, but that would have required reverse-engineering your documentation module. Instead, I'm just going to dump a mockup here, and let you deal with it:

Positional parameters
Position: 1 2 3 4 5 6 7 8
Description: Language code(s) Year Author Title URL Page Quote Translation
Equivalents: |year= |author= |title= |url= |page=
|pages=
|text=
|passage=
|t=
|translation=
See group: Quoted text Date Author Title Title Page and line Text Text

An alphabetical index of parameter names might also be nice.

And, no, I don't want fries with that...

Thanks! Chuck Entz (talk) 06:14, 5 January 2024 (UTC)Reply

@Chuck Entz Yeah there are so many params that organizing them properly is a very challenging task. For this reason I tried to do away entirely with positional params but some people squawked loudly enough that they are kept for {{quote-book}} and {{quote-journal}}, and disallowed for the rest. I think your mockup is a good idea. Benwing2 (talk) 06:17, 5 January 2024 (UTC)Reply

Using the Old French conjugation table as an inspiration edit

I was trying to create a more complex conjugation table for the Old Spanish language. Then I started viewing other templates and learned that the one used for the Old French language is perfect. I might be able to perform some basic editions to adapt for the Old Spanish conjugation system. However, I couldn't get a sample of that template to edit as there are so many links together. So would you please share with me a simple, editable sample of the template of the Old French language so I can apply it to this page: Cantar? Besides, it'd be helpful to better standardize Wiktionary. Thalyson2019 (talk) 05:42, 6 January 2024 (UTC)Reply

@Thalyson2019 The Old French conjugation tables aren't implemented using templates but rather using a module: Module:fro-verb. I agree that it's a good base to start with when designing a conjugation system for a language that wasn't really standardized. I'm not sure if you are comfortable working in Lua, because the module is written in Lua and it's not really possible to do what it does just using template syntax. Benwing2 (talk) 05:57, 6 January 2024 (UTC)Reply
Is there any solution for that? I already have the verbs and their positions in mind. I'm not familiar with Lua, even though I create basic templates. Thalyson2019 (talk) 06:08, 6 January 2024 (UTC)Reply
@Thalyson2019 You'd have to get someone to create the Lua module for you. I can't commit to something like this right now as I have already committed to several other projects. However if you create some mockups and link them here, then if/when I or someone else is able to contribute, the mockups can be a good starting point. Benwing2 (talk) 06:10, 6 January 2024 (UTC)Reply
Such mockups should be in format of codes or pictures? Thalyson2019 (talk) 06:14, 6 January 2024 (UTC)Reply
@Thalyson2019 Maybe some sample template calls for some simple verbs like cantar and some complicated verbs as well (tener? ir?). I or anyone working on this would in addition need some good resources on Old Spanish verb conjugation. Benwing2 (talk) 06:18, 6 January 2024 (UTC)Reply

Finnish inflections edit

Hey Benwing, I know that WingerBot is used to mass-create the inflection pages for Romance verbs. Is there any way that it could do similar work with Finnish noun forms? According to Jberkel's last data dump there are literally millions of Finnish redlinks, most of which appear to be nouns, so bot help is probably necessary to make a real dent. Thanks for your time! Vergencescattered (talk) 20:01, 6 January 2024 (UTC)Reply

@Vergencescattered: have you talked to @Surjection about this? As a native speaker with a bot, they would be a more logical choice, and more likely to be aware of potential problems. Chuck Entz (talk) 20:35, 6 January 2024 (UTC)Reply
@Vergencescattered I agree with Chuck. Also pinging @Hekaheka. E.g. there may be a reason these forms aren't created (too many of them?). Benwing2 (talk) 21:20, 6 January 2024 (UTC)Reply
There are probably somewhere around 200,000 nouns in Finnish and each has 30 inflected forms (15 cases in singular and plural) without taking into account any suffixes. This is the rough number found in Nykysuomen sanakirja. Adding dialects and slang one gets roughly to half a million or more. That would give 6 to 15 million entries. If we add the six (third person possessive suffixes are the same for plural and singular but to compensate this potential simplification there are two of them) possessive suffixes, the number of potential entries increases to 40 to 100 million. Some of the forms might be unattestable as abessive, comitative and instructive are quite rarely used but that does not cut more than 20% of the total. On top of this each verb has close to one hundred inflected forms if we take into account the possessive forms of some infinitives and participles.
This leads me to think that we might need a new approach to inflected forms in general. Perhaps they should have an entry of their own only in such rare cases in which the inflected form has a meaning or meanings that cannot be readily derived from the lemma form. In most cases the system would work so that a search for an inflected form would redirect to the article of the lemma form. --Hekaheka (talk) 23:33, 8 January 2024 (UTC)Reply
@Hekaheka It would be great if MediaWiki could autogenerate the text of an inflected form, but in its current state it can't do either that or redirect from an inflected form to a lemma form. IMO the most useful thing about having inflected forms entered as such is when you have homophones or homographs between different inflected forms. This occurs fairly often in the Romance languages, for example, between noun and verb forms or between adjective and verb forms. It also occurs fairly often in Russian between noun and verb forms but rarely for adjectives except for short forms of adjectives; for this reason I have never done a bot run to create Russian adjective forms (besides the fact that there are a lot of them). If Finnish grammar is largely regular and doesn't have a lot of homonyms, I would think it's not useful to have inflected forms generated. I suppose for the moment we need to use our judgment as to whether it's worth it to create such forms. Benwing2 (talk) 23:38, 8 January 2024 (UTC)Reply
I would definitely appreciate their input! I didn't know about Surjection or their bot before you mentioned them, so I apologize for bothering you about it. Thank you! Vergencescattered (talk) 23:27, 6 January 2024 (UTC)Reply

Request to deploy {{szy-pron}} edit

I've created a Sakizaya pronunciation template, and I need help deploying it to all Sakizaya language entries on Wiktionary. Could you assist with this using your bot account? --TongcyDai (talk) 17:29, 7 January 2024 (UTC)Reply

@TongcyDai What needs to be done here? Are there any cases where manual respelling or other help for the template is needed? Benwing2 (talk) 22:54, 7 January 2024 (UTC)Reply
When adding the template, simply insert {{szy-pron}} into each Sakizaya entry, no parameters and respelling are needed. TongcyDai (talk) 10:16, 8 January 2024 (UTC)Reply
Please let me know if there's anything else you need from me to deploy the template. --TongcyDai (talk) 18:38, 1 March 2024 (UTC)Reply

Relational -> demonym edit

Could you clean up Spanish demonyms like diff? It makes more sense than categorizing 900+ demonyms as relational adjectives just because they don't have a one-word translation in English. Ultimateria (talk) 19:23, 7 January 2024 (UTC)Reply

@Ultimateria Hi, I actually wrote a script awhile ago to do exactly this but never ran it. I don't remember why; maybe it needed a few fixes. I'll go ahead and finish this. Benwing2 (talk) 22:52, 7 January 2024 (UTC)Reply

Revert adding acceleration forms to {{pl-conj-ai}} edit

Hi @Benwing2. You just reverted the changes to the template {{pl-conj-ai}}. Could you please elaborate on what was broken? So I could see how it could be fixed while preserving the benefit of the acceleration forms? Incidentally, similar changes have been made to other templates, so the same error could arise for other verbs. You are referring to active adverbial participles, for which only one single form was used before, even though those adverbs have different forms depending on plural/singular and gender. Maybe the breaking tool needs to be updated to cater for those other forms. @Vininn126 JuChelou (talk) 14:04, 25 January 2024 (UTC)Reply

@JuChelou For one thing, the specific value of 'active adjectival participle' (along with various other specific values) is processed specially in Module:accel/pl and causes the inflection to be set to 'actv|adj|part'. By changing this you broke this support, and caused it to use an invalid inflection tag set 'm|s|active adjectival participle'. The other inflections of the participle were similar. The correct thing to do is to leave the masc sing participial forms unchanged and if you want to add acceleration to the other forms, they should cause the form to be created as e.g. {{feminine singular of|pl|PARTICIPLE}} rather than as an inflection of the verb. You can see an example of how to handle this correctly by looking at the lines starting at Module:accel/pt#L-21. Benwing2 (talk) 22:50, 25 January 2024 (UTC)Reply
Thank you @Benwing2 for your reply. @Vininn126
I tried something in Module:accel/pl and {{pl-conj-ai}} to add proper accel form support for the adjectival participles.
However, I am not fully satisfied with the result because:
1/ on the masculine singular form, it could add 2 forms, for example for wyrzucający wyrzucać
2/ the result would not be similar if the new wiki page is triggered from the conjugation chart or from the adjective declension chart (which I also added recently). For example, for wyrzucające, the new wiki page triggered from the verb link would "miss" the fact that it is also the form for accusative neuter and accusative non virile.
Any advice? Or should I just ditch the extra accel forms for the participle and contributors would use the new accel links from the adjective declension module? JuChelou (talk) 16:18, 26 January 2024 (UTC)Reply
In theory you could generate wyrzucający and from there generate the others, but it's less than ideal. Vininn126 (talk) 16:40, 26 January 2024 (UTC)Reply
@JuChelou Hmmm, I'm not quite sure how to handle #2; either you'd have to add all the non-nominative forms of the participles to the verb table so that the accelerator code knows about them automatically, or you'd have to hack the code in MOD:accel/pl somehow to add the remaining inflections in. (This latter thing is possible, as I think I added a hook that you can define in the accelerator module that operates at the end after all the inflections have been combined.) As for #1, the general principle I've followed is not to include definitions for non-lemma forms that are identical in spelling to the lemma. I followed this principle, for example, when I create a bot script to add Russian noun inflections. This also happens in Portuguese verbs (where the 1st and 3rd singular future subjunctive usually looks the same as the infinitive), and for Latin feminine nouns (where the ablative singular is spelled the same as the nominative singular, although the pronunciation is different as the ablative ends in long -ā while the nominative ends in short -a). I actually removed the cases where Portuguese verbs were defined normally but had an additional definition as the 1st/3rd singular future subjunctive, but I may have left alone the Latin ablative cases because of the different pronunciation. In the Polish case, the pronunciation is the same and so you could fix this by just not having an accelerator defined on the forms that look like the lemma.
In general, I would actually argue that instead of including only the nominative case forms, it's best not to include anything but the masculine nominative singular of the various participles in the verb table, and require that the remaining forms be defined using accelerators on the participle table, even though User:Vininn126 thinks this is non-ideal. This is how we handle participles in Russian, for example, which is similar in many ways to Polish. I think the main benefit to having non-lemma participle forms defined in the verb table is if there are irregularities in their formation, but I don't think this is the case in Polish. Benwing2 (talk) 23:20, 26 January 2024 (UTC)Reply
An additional thought is maybe we shouldn't be defining non-lemma forms of participles at all, since AFAIK they're quite regular and there are a lot of them. See the discussion above about #Finnish inflections. This is the policy we follow for Russian, for example. Benwing2 (talk) 23:22, 26 January 2024 (UTC)Reply
Where do we define non-lemma participles? Vininn126 (talk) 10:17, 27 January 2024 (UTC)Reply
@Vininn126 Sorry, can you clarify what you mean? Benwing2 (talk) 10:37, 27 January 2024 (UTC)Reply
I simply didn't understand your last message Vininn126 (talk) 10:59, 27 January 2024 (UTC)Reply
Thank you @Benwing2 for your very detailed answer.
Basically, regarding your recommendations for #1, that would be easy to remove the accel form for the version identical to the lemma form.
For the #2 however, that would be more tricky as it would require to duplicate generating all the forms, opening room for discrepancies between the pl-adj module and the polish accel module.
If I understand correctly, your overall recommendation is to remove all the other forms of the participles in conjugation templates. Basically, we would just have "active adjectival participle: masculine singular nominative form".
It would be similar to what is done for the verbal noun, where there is only the masculine singular nominative form, even though other forms exist.
@Vininn126 what would be your opinion on removing the additional forms of the adjectival participles from the conjugation templates? JuChelou (talk) 17:02, 28 January 2024 (UTC)Reply
Sounds fine to me; it's not typical to have them. Vininn126 (talk) 18:03, 28 January 2024 (UTC)Reply

On the {{quote-book}} template edit

Hi,
I was wondering what exactly the combined use of the parameters |start_year= and |year= is supposed to communicate.
It's supposed to mean a range of dates, but—with an example 1390–1400—is range meant:

in the sense of "the composition of this work started in 1390, and ended in 1400"?
or in the sense of "this work was probably completed (or brought to its current state, if unfinished) somewhere between 1390 and 1400"?

Thanks in advance for any clarification. I've recently discovered these parameters, and I'm not sure I've been using them properly. —— GianWiki (talk) 15:24, 25 January 2024 (UTC)Reply

@GianWiki These parameters were there before I started to clean the template up, so you might ask User:Sgconlaw, but I'm thinking it's used for works that took several years to create. Benwing2 (talk) 23:45, 25 January 2024 (UTC)Reply
I see, I hadn't noticed that. I'll try asking them just to make sure.
Thank you for your time. —— GianWiki (talk) 08:18, 26 January 2024 (UTC)Reply
@GianWiki: I don't think the parameters were clearly defined at the time when I first tidied up the {{quote-}} templates. Personally, I use them to mean a range of publication dates (for example, if a novel is originally published in parts in a magazine over many months), and if I intend a range of dates to mean anything else I add a qualification in parentheses for clarity like this: |year=c. '''1597–1600''' (date written). — Sgconlaw (talk) 10:54, 26 January 2024 (UTC)Reply

WingerBot and Welsh animal genders edit

Hi, your bot edited garan ("crane") and petris ("partridge") so they would be “m or f by sense”, which isn’t correct. I've corrected them, but can you amend the bot so it doesn’t edit other animals like this please?

Garan is usually a masculine noun, that can be feminine due to dialect, rather than the sex of the animal (e.g. in Iolo Williams’s Llyfr Adar and the Geiriadur yr Academi) and petris is feminine.

I’ve consulted a bit with other Welsh speakers and the only source I can see for petris ever being masculine is the Geiriadur Prifysgol Cymru, which could easily be due to one or two examples from centuries ago. “A small cock partridge” would be ceiliog petris bach – where bach modifies ceiliog, not petris.

Cheers, Arafsymudwr (talk) 15:54, 30 January 2024 (UTC)Reply

@Arafsymudwr This was a one-off run where I manually made the changes in question in a text editor and only used the bot to push the changes (that's what "manually assisted" means in the changelog message). So there's no script to amend but I'll make sure not to change the genders of animal terms in Welsh (or generally in any language, I think) in the future. Benwing2 (talk) 06:11, 31 January 2024 (UTC)Reply

Links to English possessives in inflection-line templates edit

I wish I had included this in my request about links to components of hyphenated terms in English inflection templates. (How's that coming, BTW?) Many vernacular names of organisms are like Gundlach's hawk (See Gundlach's hawk). It would be better, especially for me, if the link were to Gundlach rather than the possessive. I can't think of any instances for which the possessive would be a better link target and believe that any such instances are relatively rare exceptions. DCDuring (talk) 16:29, 31 January 2024 (UTC)Reply

@DCDuring Yes, in fact my concerns over how to handle apostrophes are why this hasn't already gotten done. I'm thinking that we should split any term with a trailing 's except for one's and someone's (with exceptions also maybe for he's, she's, it's), but not split other terms with apostrophes (e.g. I'm, don't, haven't). BTW I notice that we've split apostrophe-s into two terms, 's for the contraction and -'s for the possessive. Personally I think this is confusing and probably they should be merged into 's (without the hyphen). It also makes auto-linking more difficult; probably we should link all occurrences of 's into -'s since this is the more common case. Benwing2 (talk) 22:07, 31 January 2024 (UTC)Reply
This 's/-'s distinction gets to how to indicate the distinction between an inflectional ending and a contraction, doesn't it? One one level one needs a linguistics or philosophy degree to be qualified and/or motivated to argue this, but I don't hold the right degrees. On another level, how to help users, it would seem both should be on the same page, almost certainly 's. It probably should go to BP, but you may be able to go ahead with what is convenient to implement and rely of links between [['s[[ and -'s to help users in the meanwhile. DCDuring (talk) 22:22, 31 January 2024 (UTC)Reply
@DCDuring Please see User:Benwing2/test-en-multiword for some examples of the new headword link handling system that I'm testing. It includes the ability to change the link of one (or several) of the words of a multiword expression without having to write out the entire expression; see the examples that specify |head=~.... (This functionality was already implemented for Italian and later extended to other Romance languages.) Note that if there are both hyphens and spaces, the default behavior is to link the space-separated components but not break up hyphen-separated components, although this can be changed using |splithyph=1. Possibly the default should be reversed and hyphen-separated components broken up by default unless |nosplithyph=1 is given; what do you think? Benwing2 (talk) 00:01, 2 February 2024 (UTC)Reply
I will look at it in about 16 hours. DCDuring (talk) 00:04, 2 February 2024 (UTC)Reply
@DCDuring: OK, thanks. BTW I'm thinking we should indeed change the default when there are both hyphens and spaces, and maybe make an argument to convert hyphenated terms to space-separated terms, e.g. for cases like civil-rights movement and claw-hammer coat that should be linked as [[civil rights|civil-rights]] [[movement]] and [[claw hammer|claw-hammer]] [[coat]] (likewise closed-circuit television, clock-face timetable, coffee-table book, etc.), although there are also examples like close-up lens, coin-operated laundry, context-free grammar, co-occurrence network, etc. where we do want to link the hyphenated component as such. Benwing2 (talk) 00:58, 2 February 2024 (UTC)Reply
I really like the more hyphenated forms because they reduce certain kinds of possible misreading of MWEs, but contemporary relative frequency may indicate that hyphenated forms are already much less frequent. For three-part English vernacular names of organisms, I often find that the hyphen is in the wrong place or is not useful. But black billed amazon is not a good substitute for black-billed amazon. DCDuring (talk) 01:10, 2 February 2024 (UTC)Reply
@DCDuring I have redone the handling of terms with both hyphens and spaces so that it now looks up the hyphenated term to see whether it exists in order to determine how to link it. Specifically:
  1. If the term exists as a space-separated compound, link to that. (We prefer space-separated compounds because the hyphen-separated form often exists as a soft redirect.)
  2. Otherwise, if the term exists as a hyphen-separated compound, link to that.
  3. Otherwise, link the hyphenated terms separately.
This handles most cases properly, although there are occasional situations where it fails; for example, close up and close-up both exist and are different, and by default close-up lens links (wrongly) to the former. For this reason I've provided params to override the default handling: |hyphspace=1 forces case (1) above, |nosplithyph=1 forces case (2) above, and |splithyph=1 forces case (3) above.
Benwing2 (talk) 05:27, 2 February 2024 (UTC)Reply
I hope we will never have entries for terms like scaly-headed. So I'll have to use nosplithyph=1 for a vast number of vernacular names. I may as well not have asked for this favor. I suppose I could create a new template to wrap {{en-noun}} or {{head}}, specifiying the parameter, to save keystrokes for these vernacular name entries.DCDuring (talk) 13:41, 2 February 2024 (UTC)Reply
@DCDuring If you need to use |nosplithyph=1 for a large number of vernacular names, that is defeating the purpose of things. Can you explain why you think you need to use this for so many? Things like scaly-headed are SOP so should be split, IMO. Benwing2 (talk) 20:37, 2 February 2024 (UTC)Reply
I misread in haste, I think. DCDuring (talk) 22:43, 2 February 2024 (UTC)Reply
@DCDuring I have implemented the various changes to the linking behavior of Module:en-headword. They are documented on the module documentation page Module:en-headword/documentation (although the section on link modifiers is still to be written). There is text in the documentation of {{en-noun}}, {{en-verb}} and {{en-adj}} pointing to the module documentation page for the specifics about multiword linking and suffix handling. Let me know if there's anything else needed documentation-wise. Benwing2 (talk) 00:10, 7 February 2024 (UTC)Reply
The section on link modifications (renamed from link modifiers for clarity) is written. Benwing2 (talk) 00:46, 7 February 2024 (UTC)Reply

devil's own edit

I reverted WingerBot's edit to this entry not just because of the module error (I think you added |def= to the noun and proper noun code, but not to the adjective), but because it looks to me like the syntax is more along the lines of "[the devil's] own" rather than the "the [devil's own]. Not that I would get into an edit war over this- I just wanted to make sure you were aware of that dimension before deciding how to fix things. Chuck Entz (talk) 04:23, 4 February 2024 (UTC)Reply

@Chuck Entz Thanks. Yeah I forgot about handling adjectives with the in them. As for the syntax issue, all that |def=1 does is add the before the head; it doesn't assert any particular way of parsing the constituents. I suppose it could be interpreted as asserting an analysis like the [devil's own] but that wasn't my intention (and I'm not quite sure how we'd indicate such an analysis in the head). Benwing2 (talk) 04:47, 4 February 2024 (UTC)Reply
But adjectives don't have the in them. We should review the entries that so claim and determine whether there is good reason to ever have the inside the headword template for adjectives. DCDuring (talk) 14:14, 4 February 2024 (UTC)Reply
Never mind. I was thinking of leading the. We have numerous entries of purported adjectives with the embedded. Some of them seem like attributive use of a noun, but not all. DCDuring (talk) 14:23, 4 February 2024 (UTC)Reply

Category:LANG nouns with other-gender equivalents edit

Hello Benwing. I hope that this does not take too much of your time. How should CAT:Telugu nouns with gendered forms be added to MOD:te-headword? I tried looking at MOD:hi-pa-headword, but could not figure out what and where to add the equivalent of:

table.insert(data.categories, data.langname .. " " .. plpos .. " with other-gender equivalents")

to MOD:te-headword. I noticed that this feature was missing for Telugu when I saw

Synonym: (female) రచయిత్రి

at the entry for రచయిత (racayita). The Lua-fication of {{te-noun}} means adding features such as this is not as easy as adding

{{#if:{{{m|}}}{{{f|}}}{{{n|}}}|{{cln|te|nouns with other-gender equivalents}}}}

to {{te-noun}}. Kutchkutch (talk) 00:46, 5 February 2024 (UTC)Reply

Adding
{{#if:{{{m|}}}{{{f|}}}{{{n|}}}|{{cln|te|nouns with other-gender equivalents}}}}
at the end of {{te-noun}} seems to work for categorisation but not for the headword line. Kutchkutch (talk) 00:59, 5 February 2024 (UTC)Reply
@Kutchkutch Glad you figured it out. IMO Module:te-headword needs to be rewritten; it wasn't written by me and doesn't really follow the standard structure for such modules, which is probably why you had difficulty figuring out how to add the appropriate code. Benwing2 (talk) 22:33, 8 February 2024 (UTC)Reply

Email edit

Btw, idk if you have notifications turned on for emails, but I sent you one. Vininn126 (talk) 22:24, 8 February 2024 (UTC)Reply

Thanks, I responded. For some reason I didn't get an email notification here on Wiktionary even though I do have email notifications turned on. Benwing2 (talk) 22:32, 8 February 2024 (UTC)Reply

bùzháodiào edit

Hello. Could you help me fix the Traditional Chinese conversion here? Thanks. ---> Tooironic (talk) 00:31, 11 February 2024 (UTC)Reply

@Tooironic What's the exact issue? BTW in general I am not too familiar with how the Trad <-> Simp conversion works; User:Theknightwho knows more. Benwing2 (talk) 00:32, 11 February 2024 (UTC)Reply
Thank you User:Theknightwho! ---> Tooironic (talk) 00:39, 11 February 2024 (UTC)Reply

hmm edit

How much longer is it going to take you to finally finish making this new pronunciation module for Polish? You've been doing it for several months now, hurry up, or someone might think you're getting a little lazyyy :) Gugugagasraniewbanie (talk) 08:30, 13 February 2024 (UTC)Reply

@Gugugagasraniewbanie Yeah it will happen soon. Benwing2 (talk) 08:32, 13 February 2024 (UTC)Reply
OK, then you have my forgiveness Gugugagasraniewbanie (talk) 08:35, 13 February 2024 (UTC)Reply

Mon-Burmese script edit

I changed some letters defined for specific languages (e.g. "X is a letter of the Shan alphabet") to that language (i.e. Shan), then added a request for definition to the translingual entry. If this is somehow considered vandalism, I'll revert myself, but I'm assuming obvious fixes like this are acceptable, an it parallels other entries that only have definitions for specific languages. (A definition might be as simple as stating that it's a letter of the Mon-Burmese script corresponding to a certain letter in Sanskrit, but I didn't do that myself as I thought I might be accused of vandalism.)

I also removed a couple pronunciations that were for the wrong entry. kwami (talk) 04:25, 14 February 2024 (UTC)Reply

@Kwamikagami "Vandalism" doesn't seem like the right word for changes that are in good faith. As to whether they are wrong or counterproductive I don't know but they seem generally fine to me. User:RichardW57 do you have any comments? Benwing2 (talk) 04:45, 14 February 2024 (UTC)Reply
Okay, "blockable offense" then. kwami (talk) 04:47, 14 February 2024 (UTC)Reply
Yeah I understand. BTW I think blocking is only likely if you edit-war or keep making changes of a specific nature after people have objected to them. (Also editors who don't know what they're doing but think they do; editors of this nature can do a lot of damage.) Wikipedia seems generally more tolerant of edit-warring, maybe because of the number of editors relative to how many articles there are. Benwing2 (talk) 04:57, 14 February 2024 (UTC)Reply
@Benwing2: Which Shan alphabet? There are several Shan languages, which often makes the letters translingual because shared by several Shan languages! The change seems backwards - I would have said that the thing to do was to waste space by adding the Shan entry. As Burmese-script words easily consist of a single letter, cloning letters to each language using them makes Wiktionary more difficult to find by eye, in accordance with the apparent aim of difficulty of use. --RichardW57 (talk) 08:49, 14 February 2024 (UTC)Reply
If there are other Shan languages besides [shn], and they use the same letter, then they should be listed. But as it was, they were not listed -- only [shn] was.
And yes, I know you want to lump all languages together, but that's not the consensus for Wikt. kwami (talk) 18:49, 14 February 2024 (UTC)Reply
We have Shan (shn), Khamti Shan (kht), Aiton (aio), Phake (phk) and Tai Laing (tle) that use the Burmese script. The Tai Nuea (tdd) (= Tai Le /Tai Dehong / Chinese Shan) (not to be confused with Northern Tai or Northern Thai) and Tai Khuen (kkh) (though their speech is more akin to Northern Thai, but they identify as Shan) use different scripts. There's also Khamyang (ksu or nrr). Tai Ahom should arguably be included, but again it has its own script. --RichardW57 (talk) 23:32, 14 February 2024 (UTC)Reply
And when we say a letter is used by [shn], do we necessarily know that it's also used by the others? E.g. in Lik-Tai for Khamti? The label "Shan" may cover multiple languages in some usage, but when Wikt has an entry for Shan [shn], we mean specifically that language. When we mean Khamti, we say Khamti. Etc. But sure -- if we can demonstrate that a letter is used by multiple languages, we can say that it's used for multiple languages. Though when giving the pronunciation and orthographic rules, we need to be careful not to present [shn] as representative if it isn't. kwami (talk) 01:23, 15 February 2024 (UTC)Reply

Seeking template help edit

Hi, we find your Hindi language templates very helpful. Could you assist us with essential Sylheti templates (language code: syl) on English Wiktionary? We could contribute with translations, although we are still familiarizing ourselves with Wiktionary policies. -- ꠢꠣꠍꠘ ꠞꠣꠎꠣ (talk) 07:52, 16 February 2024 (UTC)Reply

@ꠢꠣꠍꠘ ꠞꠣꠎꠣ Hi I'm up to my ears in requests so I'm won't be able to get to this soon, although if someone else wants to work on it using the Hindi modules as a starting point, I can provide guidance. Benwing2 (talk) 09:55, 16 February 2024 (UTC)Reply

Category:Romance terms inherited from Latin nominatives edit

Hi. Sorry, I think I was a bit too 'bristly' with how I responded earlier. I really do support removing these categories and sticking the relevant content into 'Appendix: Romance terms plausibly inherited from Latin nominatives'. Nicodene (talk) 17:21, 18 February 2024 (UTC)Reply

@Nicodene This sounds good to me and "plausibly" sounds like a good term to use, and I apologize if I also was a bit in-your-face. If you can write the appendix and put the terms there in a list, I can remove the categories from the terms by bot. Benwing2 (talk) 19:56, 18 February 2024 (UTC)Reply
Done. This should actually make it easier for me to reorganise/restructure it all, which I've been meaning to do. Nicodene (talk) 20:39, 18 February 2024 (UTC)Reply
@Nicodene Thanks! Benwing2 (talk) 00:45, 19 February 2024 (UTC)Reply
@Nicodene I am going to remove the pages listed in the appendix from the '... inherited from Latin nominatives' categories. Just checking that this is OK with you. Benwing2 (talk) 04:56, 19 February 2024 (UTC)Reply
Yes, go for it please. Nicodene (talk) 05:01, 19 February 2024 (UTC)Reply
@Nicodene OK it's done. BTW the appendix is looking good and I'm glad you have included detailed notes. Benwing2 (talk) 05:26, 19 February 2024 (UTC)Reply

Macrolanguages edit

Hi - do you have any ideas for how we could handle macrolanguages in the data (Chinese being the most obvious example, given how we handle Chinese L2s). I’m not keen to create a whole new type of object, since this situation comes up in loads of places, as we don’t have a coherent distinction between “is a type of” and “is a descendant of”, leading to the issues I mentioned in WT:RFM#Converting Min Nan into a family, where Teochew and Leizhou Min are “descended from” Min Nan, whereas they’re actually types of Min Nan.

I suspect you’ve noticed similar things with how Persian and Latin are handled. One common situation which stands out are language periods: we list Old Latin as ancestral to Latin, but as it’s an etym-only language of Latin that technically means we’re saying it’s ancestral to itself. Same for Early Modern English and English, and so on. We get round it by adding an explicit check to Module:languages to prevent a language being ancestral to itself, but that’s a kludge which is symptomatic of our poorly defined language model.

Also see the Japonic family tree at Category:Proto-Japonic language, where the periodisation of Japanese is all messed up because they’re all treated as etym-only languages part of Japanese, even though Early/Late Middle Japanese have Middle Japanese as their immediate parent. (They currently display in the wrong order, since Middle Japanese should not be listed before Early Middle Japanese if we were to follow the same system as Latin; the data is correct but Module:family tree is bugged.) A much bigger issue is that we imply Middle Japanese is split into three periods, and that the central period is somehow representative. This is confusing at best, and outright misleading to anyone who isn’t familiar with the nuances of our data modules. Theknightwho (talk) 18:29, 18 February 2024 (UTC)Reply

@Theknightwho Since you have merged etym-only and full languages to the point that both are more or less just types of Language objects, can we not just have a "type" field identifying something as a macrolanguage? That way it will still work as a language for most purposes. IMO we do need to properly distinguished is-a-X and is-a-descendant-of-X, and it seems you've provided a way with the ancestors field. As for the issue of Old Latin vs. Latin, we do have a "Classical Latin" etym language and ultimately we need to push more in this direction, although it will require some thinking. These are just the thoughts off the top of my head. Benwing2 (talk) 19:54, 18 February 2024 (UTC)Reply
@Benwing2 Thanks - that's helpful to think about.
I'd rather not have a specific macrolanguage field, since it's superfluous to whether or not something is set as being a "type of" that language. I think the handling of Chinese, Latin, Persian, English and (one I missed above) Norwegian should probably all be done in the same way. At the most extreme end, the Sinitic family and Chinese are in fact the same thing, so I'm more inclined towards having a way to set one language as a type of another (as we do with etym-only languages), fully merging etym-only languages into languages, and then having a flag which sets whether it should be treated as a full language. That way, we also get rid of the weird half-and-half situation going on with Classical Persian and the arbitrary distribution of Chinese lects between language and etym-only language, while making it more straightforward to switch something from one to the other (e.g. the Prakrits). It may also be worth doing the same with families, since (as Chinese shows) macrolanguages and families are basically the same thing in most situations.
I think we probably need some kind of periodisation mechanism. In the case of Latin, if we're treating Old Latin as a "type of" Latin, then strictly speaking Latin's ancestor should be Proto-Italic. However, within that we could have the various periods, including Classical Latin, and there should be a way to set a default period for situations when only the generic language code is provided. For most languages that would be the standard language; in the case of Latin, it would be Classical. This would alo potentially address the issue of cross-overs between regional lects and periods: e.g. Northern Early Modern English, and should also help avoid the silly Japanese situation, since periods should be possible to nest inside each other. Theknightwho (talk) 20:10, 18 February 2024 (UTC)Reply
@Theknightwho All this sounds good to me in general although it would be helpful if you could write out your proposals in more detail as it's sometimes a bit hard for me to work out what your thoughts are when presented abstractly. Benwing2 (talk) 20:31, 18 February 2024 (UTC)Reply
@Benwing2 Will do. I’ll also have a think about how we should handle this in the family tree display, since a lot of the confusion stems from that displaying descendants and variants/types in exactly the same way. Theknightwho (talk) 20:52, 18 February 2024 (UTC)Reply
One problem that needs to be addressed is that language change doesn't always follow a tidy tree model. Macrolanguages are messy. A macrolanguage always has a standard lect that the other lects identify with- but there can be more than one, and which lect is the standard can change over time. Even some of the more complex ordinary languages have similar phenomena. This can end up being reflected in the history of languages both within and deriving from the (macro)language.
With English, you have the same language changing its prestige/standard dialect several times in Old English due to the rise and fall from prominence of specific kingdoms: Anglia, Mercia, Northumbria, and finally Wessex (this is off the top of my head- I'm sure I missed something). With the transition to Middle English it all moved to London. Middle English borrowed heavily from Old Northern French, but since then the source has been Parisian French. Scots split off from the northern dialects that descended primarily from Northumbrian. I'm sure there were changes in the Old Norse dialects that Old English and Middle English borrowed from, and then there's the matter of Brythonic Pictish and Goidelic Gaelic in Scotland and their influence on Scots and northern English.
China had several changes in which were the prestige lects, and these are reflected in the various named yomi in Japanese, as well as the borrowings into other neighboring languages. Then there's Mycenaean Greek, which is different from whatever became Ancient Greek, and the fact that older Latin borrowings didn't come from the Attic dialect that became modern Greek, and Tsakonian that came from Doric, etc.
If you look at a regional lect, you can find things descended directly from the same region in the ancestral language, and things that came in from the standard lects of the different historical stages, and other things that were borrowed from various external languages. Sometimes separate languages split off from these regional lects, so they have more in common with the regional varieties of the main language than with the standard lects of any historical period.
To stretch the tree analogy a bit: sometimes a limb that's touching the ground sets root and becomes a tree in its own right, and other times branches or roots from separate trees graft together after prolonged contact.
I seem to have written a book here, but I hope you can see what I'm getting at. It would be a good idea to think about some way of representing the internal structure of macrolanguages and even regular languages, and the way that different descendants can come from different parts of the same language. There's a complex interchange between region and historical period, so the Wessex dialect of today has a completely different status from the Wessex dialect of a thousand years ago, and the geographical identification of what's mainstream and what's dialectal changes over time. It's all secondary to the main concept of parent and daughter language, but it might help us with some exceptional cases like Chinese. Chuck Entz (talk) 23:15, 18 February 2024 (UTC)Reply
Agreed. Even Anglo-Norman, the main vehicle of 'Gallicisms' in Middle English, began as a chaotic hodge-podge of Old French dialects, certainly in many respects 'northern-flavoured', but not only, and increasingly slanting towards (but never quite attaining) Central French norms as the centuries went by. In this case as well there is no question of a precise dialectal ancestry. Nicodene (talk) 14:34, 19 February 2024 (UTC)Reply

Italicising synonyms for taxonomic names edit

Hi Benwing. Could you edit Module:form of, Module:form of/templates, and/or T:synonym of to add the ability to italicise the linked-to term in transclusions of {{synonym of}} (preferably by calling |i=), please? Such functionality is needed for taxonomic synonyms. ATM, work-arounds like those seen in Asclepias filiformis var. buchenaviana, Bulbophyllum buchenavianum, Gomphocarpus filiformis var. buchenavianus, Megaclinium buchenavianum, and Tropaeolum buchenavianum are necessary. 0DF (talk) 00:38, 19 February 2024 (UTC)Reply

@DCDuring who would know how this is handled in other taxonomic entries. Chuck Entz (talk) 01:08, 19 February 2024 (UTC)Reply
Now, {{syn of}} (and {{alt form of}}, possibly others) suppresses italics formatting that {{taxlink}} provides or direct or piped wikitext formatting. All we would need is templates like {{syn of}} and {{alt form of}} to handle embedded wikitext for italics, as is now possible in other templates that incorporate links. Alternatively Something like {{syn of}}, say {{taxsyn}} (also {{taxalt}}), would have all the formatting capabilities {{taxlink}}, which include not italicizing terms like "var.", "section" ("sect.", "subsect"), "subg.", and "subsp." in taxonomic names. This would probably not involve too much renaming of templates at this point. DCDuring (talk) 13:58, 19 February 2024 (UTC)Reply
And it would be nice to allow to appear without requiring pipes. DCDuring (talk) 14:37, 19 February 2024 (UTC)Reply
@DCDuring: I assume it would be possible to include the non-italicising functionality of {{taxlink}} in {{synonym of}} by making it contingent upon both |1=mul and |i=1 being true. I can't imagine a case in which one would want to define a term as a synonym of something translingual that contains any of the strings sect., subg., subsect., subsp., or var.; italicise it; and for that term not to be a taxonomic name. 0DF (talk) 14:38, 19 February 2024 (UTC)Reply
The italicization rules of the various taxonomic bodies include that all taxonomic names (ie, any rank) of viruses, bacteria, and archaebacteria be italicized. It is probably simpler to use passed-through wikitext italics than to duplicate {{taxlink}} functionality. DCDuring (talk) 14:47, 19 February 2024 (UTC)Reply
@DCDuring: I only meant {{taxlink}}'s functionality of automatically de-italicising those few abbreviations. Italicising dependent on a parsing the taxon (as a species, genus, phylum, or whatever) seems superfluous and unnecessarily complicated for {{synonym of}}; |i=1 should be all that's necessary. 0DF (talk) 14:59, 19 February 2024 (UTC)Reply
It seems too complicated to me too, but I've often been surprised with what our techno-mavens are willing to do, for reasons that remain mysterious to me. Simply passing through wikiformatting (and, possibly, "") would be fine with me. It would be easy enough to find the relatively few instances we would have of improper handling of those not-to-be-italicized terms in {{syn of}}, {{alt of}}, and the various etymology templates, too. DCDuring (talk) 19:06, 19 February 2024 (UTC)Reply
@DCDuring: How would you want the obelus to be treated? 0DF (talk) 22:42, 19 February 2024 (UTC)Reply
Directly in front of taxon, ignored for linking, but displayed without being italicized. DCDuring (talk) 12:42, 20 February 2024 (UTC)Reply
@DCDuring: De-italicising would be handled in the same way as it's handled for sect., subg., subsect., subsp., and var., I expect. Stripping from the link text would be easy (handled in the same way Latin ā, ē, ī, ō, ū, ȳ link to Latin a, e, i, o, u, y), but it may end up being enacted in undesirable circumstances. Do we need a new (mul-tax?) language code for taxonomic names, perhaps? 0DF (talk) 18:06, 20 February 2024 (UTC)Reply
I'd prefer a shorter one, of course, like 'mult' or 'mul-t'. DCDuring (talk) 18:27, 20 February 2024 (UTC)Reply
@Mahagaja: How much freedom do we have in devising language codes? 0DF (talk) 18:30, 20 February 2024 (UTC)Reply
@0DF: You'd have to get consensus at WT:RFM for it. I wouldn't hold my breath. —Mahāgaja · talk 18:48, 20 February 2024 (UTC)Reply
@Mahagaja: Thanks for the response. I mean, rather, what restrictions are there on the form that language codes take? I know we use ISO 639-3 codes where they're available, but what about custom, in-house codes? 0DF (talk) 20:17, 20 February 2024 (UTC)Reply
@0DF @Mahagaja @DCDuring We actually already have mul-tax as a variant of Translingual (no idea when it got added, but see Module:etymology languages/data). I don't think it's used for anything at the moment, but it would make sense to use it for this. Theknightwho (talk) 20:25, 20 February 2024 (UTC)Reply
@Theknightwho: Thank you.
@DCDuring: How 'bout it?
0DF (talk) 20:29, 20 February 2024 (UTC)Reply
I always fear that the cure will turn out worse than the disease. Can it all be done automagically or will there be a few hundred exceptions? It is true that mul in Latin script is hard to confuse with mul in CJKV. DCDuring (talk) 20:37, 20 February 2024 (UTC)Reply
Daniel Carrero added Tax. "for test purposes" back in November 2016; -sche then standardized it to mul-tax. I don't know what he was testing, but the code is there for anyone who wants to use it. —Mahāgaja · talk 20:38, 20 February 2024 (UTC)Reply
@DCDuring: I looked at the histories of Module:form of, Module:form of/templates, and {{synonym of}}. They showed me that Benwing had done a lot of editing on all three, so I figured he/she would be sufficiently familiar with those pages to make the changes I requested. There's nothing suspicious about that and I hardly see how I can be said to have "power" here. 0DF (talk) 00:33, 21 February 2024 (UTC)Reply
It's a habit of exclusion, not an intent of exclusion. Specific folks can always be pinged. DCDuring (talk) 14:31, 21 February 2024 (UTC)Reply
@DCDuring: I guess so. Not that I intended the request to turn into a prolonged discussion. 0DF (talk) 15:12, 21 February 2024 (UTC)Reply

Error handling with Module:parameters and Module:languages edit

Hiya - just a heads up (and you've probably noticed already), but I've recently updated Module:parameters to allow languages, scripts, families (etc) as data types, as well as a few other things. The means that the argument table which is returned contains the relevant object(s), and invalid codes will throw an error (which automatically highlights the incorrect parameter). This avoids having to manually handle invalid codes, since the only way to do proper error-handling previously was to pass the ready-baked parameter into Module:languages using getByCode's paramForError parameter, which was tricky when dealing with lists etc. Having converted a number of template modules, it's also cut down on code length by quite a bit, too.

Ideally, we should be able to remove error handling from Module:languages and Module:scripts altogether at some point, since it doesn't really belong there, and it's annoying having to work around it when requesting etymology langs and families, too. Theknightwho (talk) 15:21, 27 February 2024 (UTC)Reply

@Theknightwho Yup I did notice it, thanks. I haven't had a chance to use the new functionality but it sounds good to me. BTW if you haven't already done this you might consider adding support for comma-separated lists of lang codes and for a term with a preceding language code (see parse_term_with_lang in Module:parse utilities, which implements this latter functionality currently). Benwing2 (talk) 20:01, 27 February 2024 (UTC)Reply
@Benwing2 I've already done the comma-separated list actually, but haven't updated the documentation since I want to make sure the implementation is stable/won't need further expansion. The solution I opted for was sublist=, where sublist=true splits the list using %s*,%s*, but using a string value allows for other splits. The other thing which isn't yet documented is set=, which is for parameters that take an (ideally small) closed set of values, where inputs with other values would be nonoperative anyway.
I'll have a think about how to handle preceding langcodes. Theknightwho (talk) 20:07, 27 February 2024 (UTC)Reply
@Theknightwho The |set= support is definitely useful. Note that the corresponding flag in Python's argparse module is called |choice=, which might possibly be a clearer name (although I can see the argument for using set as well). Benwing2 (talk) 20:16, 27 February 2024 (UTC)Reply
@Benwing2 That makes sense. The reason I opted for set= is because it uses the {a = true, b = true, c = true} format, since that makes lookup much faster/simpler. Theknightwho (talk) 20:26, 27 February 2024 (UTC)Reply
@Theknightwho Hmm, I wonder if that isn't false economy since it requires more typing, and I imagine a lot of people will call listToSet on a list to handle this format. Benwing2 (talk) 20:28, 27 February 2024 (UTC)Reply
@Benwing2 That's a good point, but checking a list is the same amount of work as doing listToSet, so changing Module:parameters to accept a list would simply guarantee the worst-case scenario, instead of leaving it up to the calling module. Theknightwho (talk) 20:34, 27 February 2024 (UTC)Reply
@Theknightwho I suppose but the actual difference in memory and speed is completely negligible, so IMO you might as well make it easier for the callers. Benwing2 (talk) 20:54, 27 February 2024 (UTC)Reply
And also you don't have the overhead of loading a new module. Benwing2 (talk) 20:54, 27 February 2024 (UTC)Reply
@Benwing2 If I have time, I might do some profiling on Module:parameters, since I have a feeling it's contributing a significant chunk to page loading time. e.g. a loads about a second faster since I made the changes, and there are still quite a few other optimisations that could be made. Theknightwho (talk) 21:02, 27 February 2024 (UTC)Reply
@Theknightwho OK but I still think requiring the use of a set rather than (also) allowing a list is a micro-optimization since the number of items should be small. Benwing2 (talk) 21:10, 27 February 2024 (UTC)Reply
@Benwing2 Alright - I can change it. Theknightwho (talk) 21:16, 27 February 2024 (UTC)Reply
@Theknightwho & Ben: pardon the partial threadjacking, but I've been waning to ask you two about the practicality of adding parameter checking to existing, non-Lua templates, and this seems like an opportune moment while you're both already thinking about Module:paramaters. I'm envisioning something like an unobtrusive template {{allowparams|1,2,3,foo,bar,baz}} that could be added to existing templates to generate errors/warnings when the template is invoked with any params besides those listed. On the backend, it could just call Module:parameters.process() with the list of supplied params and then do nothing with the result. Ignoring the difficulty of identifying the valid parameters and cleaning up all the existing calls with invalid parameters, would adding param checking to every template add an unacceptable overhead to page processing? JeffDoozan (talk) 01:45, 28 February 2024 (UTC)Reply
@JeffDoozan I think User:Theknightwho can best answer the question about efficiency as he's done a lot more investigations of this sort. Benwing2 (talk) 01:48, 28 February 2024 (UTC)Reply
@JeffDoozan That's certainly doable, but it would add an extra Lua burden to those templates, and in many cases it would be more straightforward to do the whole thing in Lua anyway.
The reason why it concerns me is that a lot of these mixed templates already make multiple calls into Lua to retrieve things like language names, and there is an inherent cost every time a module is invoked; this is the reason why {{multitrans}} is so effective, because it removes that inherent cost from each template. Aside from memory costs, each invocation is quite time-consuming (relatively speaking), since a ton of things are done by the back-end to create each new Lua environment. Theknightwho (talk) 01:48, 28 February 2024 (UTC)Reply
@Theknightwho: Thank you for the explanation. I had naively assumed that if a page calls Lua once, then subsequent calls would be relatively cheap. I'm still assuming that most pages include few enough templates that the benefit of having parameter checking outweighs the cost of invoking the checks, but as pages get bigger and closes to memory/speed limits, the calculus may change. Do you have any guess where that tipping point might be? (100 additional calls? 1,000? 10,000?) For pages that exceed that threshold, maybe {{allowparams}} could check the pagename against a fixed "denylist" of problematic pages before invoking Lua. I'm assuming the denylist would be < 100 pages and could be programmatically generated from an XML dump by counting the number of templates that would call {{allowparams}}. What do you think? JeffDoozan (talk) 17:39, 28 February 2024 (UTC)Reply
@JeffDoozan So conventional wikicode would probably preclude that being workable, because there's the post-expand include size limit of 2MB, which is calculated by adding up the size of every page accessed, multiplied by the number of times it's accessed, and on top of that, parser functions like {{#if:}} actually apply a multiplier to anything that goes through them (which compounds, though I think it's capped at something like x12). This was a big problem we ran into with the lite templates, where the bottom 10% of a simply wasn't loading templates anymore. Even now, it's using about 1.8MB of the limit. Obviously I'm being really pessimistic when I say these things, but the irony of it is that adding these kinds of checks to aid large pages can end up having the opposite of the intended effect!
The things that help are:
  • Reducing the number of calls into Lua. If it can be done in one invoke that's ideal, but really it should be no more than 5. This includes uses of any templates which themselves are Lua based (like {{l}}), since they each result in independent calls into Lua. The Coptic conjugation templates are a great example of why this matters, since they're way slower than water/translations despite having nowhere near as many links.
  • Not creating complex wikicode logic with the parser functions (like we do with the citation templates, for example). They're really slow, a pain in the neck to maintain, and inevitably result in lots of separate Lua invocations for basic information like language names.
In terms of the parameter checking, let me know if there are any templates which are on your priority list, because it may be that we can score some quick-wins by converting some of them into pure-Lua, whereas with others the manual parameter checking may be workable. Theknightwho (talk) 17:51, 28 February 2024 (UTC)Reply
@TheknightwhoThat kind of deep information is exactly why I wanted to run this by you. Since I'm hoping to do this programatically and en-mass, it would be limited to templates where I can parse the code to find all of the parameters used, which eliminates anything already calling #invoke since the invoked module can make its own use of the parameters and I'm not sure how practical it is to try to determine the parameters used by a Module. I think this means that every modified template would mean 1 additional call to Lua for every use and also that there's likely little or no benefit to converting them to Lua. How many total Lua calls on a page is too many?
I would probably start with the templates that don't already have calls with bad parameters, which probably means the lesser used templates that might not even be included on our bigest pages. I can check which templates are used on pages with more than X template calls and exclude those templates from the mass conversion, to ensure we're not adding additional stress to our biggest pages. I understand that not all template calls are equal, but is there some reasonable number of template calls I could use for detecting "big" pages? 100? 500? 1000? JeffDoozan (talk) 20:34, 28 February 2024 (UTC)Reply

"terms spelled with" edit

Hi, I would like to bring your attention to categories such as Category:Hindi terms spelled with ॉ. We seem to have decided that ◌ (U+25CC) should not be used for the Hindi combining characters, but Translingual doesn't seem to know about that, which is why Category:Translingual terms spelled with ◌ॉ exists. What should we do about that? --kc_kennylau (talk) 16:47, 28 February 2024 (UTC)Reply

@Kc kennylau Can you explain further about U+25CC? What is its replacement? As for the "terms spelled with" categories, AFAIK these categories are suppressed for one-character entries but this entry seems to involve two Unicode chars. Maybe User:Theknightwho can comment more as he reworked the code to generate these categories. Benwing2 (talk) 02:40, 29 February 2024 (UTC)Reply
U+25CC is usually used with combining characters (see Category:Translingual terms spelled with ◌̺, which is U+25CC followed by U+033A) in order to display the character. However, due to some unknown reasons, at least in my browser the Hindi combining characters in "isolation" already come with a dotted circle when they are rendered, so using U+25CC would create two dotted circles when displayed. I tried to look at The Unicode Standard, but so far it seems to me that this is not really specified one way or another, at least not specifically for Devanagari. This is why I don't really know if we should include U+25CC or not. --kc_kennylau (talk) 02:48, 29 February 2024 (UTC)Reply

Latin macronization change: veho, vē̆xī, vectum edit

Hello, I was just looking into the vowel length of Latin vē̆xī (perfect of vehō) and it looks like most recent sources think there's a good chance that it had a long vowel like Sanskrit ávākṣam (although there is some uncertainty). I edited the entry for vehō with notes on this and to mark the vowel in the perfect stem as ē̆, but of course, that doesn't affect all the inflected forms and derived compounds (e.g. advehō, convehō, invehō, prōvehō, subvehō, trānsvehō, ēvehō). Could you have Wingerbot update those? (The long vowel seems to only be reconstructed for the perfect stem vē̆x-, not the supine stem vect-). I hope it's not too much trouble. I have also been wondering how I might set up a bot account of my own to make changes like this after editing the length of a vowel in Latin entries; if that's feasible for me to do, any tips would be welcome! Urszag (talk) 20:46, 1 March 2024 (UTC)Reply

@Urszag Hi. I'll go ahead and fix these. As for setting up a bot account, in order to do that (a) you need to be able to write Python scripts, (b) you do some small test runs using your own account and verify that everything works, (c) you set up a vote to create an account for your bot using the link in WT:Votes. I recommend using a combination of pywikibot to interface to Wiktionary and mwparserfromhell to parse the template invocations on a given page. Note that there's also AutoWikiBrowser which lets you make semi-automated changes based on regular expressions and takes less work to set up than a bot account; I used this several years ago before I set up a bot account. (It is only supported on Windows but it seems to work OK through Wine on MacOS, and there's also a JavaScript browser variant called JWB.)
BTW are there are any other macron changes you need done? I think there's an outstanding request somewhere in my archives that I never got to, possibly it was from you. Benwing2 (talk) 01:49, 2 March 2024 (UTC)Reply
Done. Benwing2 (talk) 05:18, 2 March 2024 (UTC)Reply
OK, I found the previous request. It was from you in April 2023: User talk:Benwing2/2023#More Latin vowel length changes. You mentioned hirtus, hirsutus, luxus, luctor. The relevant part of the input to my script has this:
###
### hīrtus
### 
a1 hīrtus
pn2 Hīrtius
a1 hīrsūtus
a1 hīrtellus
a3 hīrtipēs hīrtiped
###
### lūctor
###
v1+ lūctor
n1 lūcta
n3 lūctātiō
n3 lūctātor
v1+ adlūctor
v1+ allūctor
v1+ collūctor
n3 collūctātiō
v1+ conlūctor
n3 conlūctātiō
v1+ ēlūctor
a3 ēlūctābilis
a3 inēlūctābilis
v1+ relūctor
n3 relūctātiō
###
### lūxus "dislocated"
###
a1 lūxus
n4 lūxus
v1+ lūxō
Do all these need to change to ī̆ ū̆? Are there any words missed here? Also can you give me the appropriate changelog comment(s) to have the bot add when making the changes? The default is "if before two cons, per Bennett corrected by Allen and Michelson" but that's obviously wrong for these cases. Benwing2 (talk) 05:30, 2 March 2024 (UTC)Reply
Thanks! Those all look correct with ī̆ ū̆. I would add lūxuria, lūxuriō, lūxuriōsus, lūxuriēs, obluctor.
In addition, it looks like I missed some inflected forms of derivatives of nūbō, nūpsī, nū̆ptum when I made that change (e.g. nūptum, nūptiāle). Specifically, there's innūbō, inflected forms of innū̆ptus, nū̆ptia, nū̆ptiae, nū̆ptiālis, nū̆ptus (It seems I just edited the main entry for these), and connūbium and its inflected forms.
I just made a new change to the perfects of alliciō, allē̆xī (formerly marked as just long) and illicio, illexī and pellicio, pellexī (formerly marked as just short) to mark them as uncertain (it seems likely all three had the same quality, probably short). These just need the inflected verb forms updated.
The references I'm basing these on are cited at the pages for hī̆rtus, lū̆xus, lū̆ctor, alliciō, nūbō, cōnū̆bium, so I think one option is to add notes of the format "Vowel length marked as uncertain based on references cited at hī̆rtus", and so on. Or the specific references could be listed as follows. Hirt- and lux-: uncertain based on Bennett (long) vs. De Vaan (short). Luct-: uncertain based on Bennett (long) vs. De Vaan, Wartburg, Buchi and Schweickard (short, with complications). Allex-: uncertain based on Bennett, Buck and Allen. Nupt-: uncertain based on Lewis and Bennett (long) vs. De Vaan, Ernout and Meillet, Wartburg and Bienvenu (short). -nubium: uncertain per Kennedy. -licio, -lē̆xī: uncertain per Bennett and Buck, "probably short" per Allen.--Urszag (talk) 15:13, 2 March 2024 (UTC)Reply
@Urszag Done. Note that there also exists conubialis, which is currently indicated with long ū. Not sure if this needs ū̆. Benwing2 (talk) 06:18, 4 March 2024 (UTC)Reply
Thank you! Yes, conubialis seems to be like conubium.--Urszag (talk) 06:36, 4 March 2024 (UTC)Reply

Category:Hijazi Arabic terms with IPA pronunciation - Alphabet order edit

how can you change the alphabet order of the Hijazi Arabic letters from

آ أ إ ا ب پ ت ث ج ح خ د ذ ر ز س ش ص
ض ط ظ ع غ ف ڤ ق ك ل م ن ه و ي

to

آ أ إ ا ب ت ث ج ح خ د ذ ر ز س ش ص
ض ط ظ ع غ ف ق ك ل م ن ه و ي . پ ڤ

since پ and ڤ are additional letters and not part of the Alphabetical order عربي-٣١ (talk) 12:39, 2 March 2024 (UTC)Reply

@عربي-٣١ Are you referring to the sort order as it appears on category pages? The thing is, those additional letters are letters even if they aren't part of the standard Hijazi alphabet, and they need to be sorted *somewhere*. The "to chart" you gave doesn't include them anywhere. Benwing2 (talk) 22:47, 3 March 2024 (UTC)Reply
Oh NVM, you want them placed at the end. Benwing2 (talk) 22:48, 3 March 2024 (UTC)Reply
@Theknightwho @Fenakhay What do you think about this? It looks to me like there is no explicit sort key currently specified for Hijazi Arabic (nor for Egyptian and Gulf Arabic). Standard Arabic has a sort key but only for one Judeo-Arabic character, and Moroccan Arabic has a sort key of some sort that has no comments so I'm not sure what it's doing. IMO we should strive to treat all varieties of Arabic the same as much as possible, e.g. in using the same sort order everywhere as much as feasible; the additional letters correspond to /p/ and /v/, which are marginal phonemes in most varieties of Arabic (with the possible exception of /p/ in Iraqi varieties?). (Also per Wikipedia's Varieties of Arabic article, there are two different ways of writing /v/ in the Arabic script, corresponding to an East-West split.) Benwing2 (talk) 03:06, 4 March 2024 (UTC)Reply
Well they are additional variants of letters (foreign letters) and should be included at the end of this list, since they are already included as the last when you check the pages in any of the Arabic dialects sorting pages, also the Arabic sorting key should be from right to left as with the rest of Arabic dialects (not from left to right as it is now in Category:Arabic terms) عربي-٣١ (talk) 16:01, 13 March 2024 (UTC)Reply

Replacement of quotation templates edit

Hi, when you have time could you please do the following quotation template replacements?

Thank you! — Sgconlaw (talk) 13:45, 3 March 2024 (UTC)Reply

@Sgconlaw Done. Benwing2 (talk) 22:47, 3 March 2024 (UTC)Reply
Thanks! — Sgconlaw (talk) 11:28, 4 March 2024 (UTC)Reply
By the way, was the {{RQ:Milton Eikonoklastes}} replacement also done? I couldn’t tell; maybe none of the entries it’s used in are on my watchlist. If so I’m changing the template to swap around the |1= and |2= parameters so that the template is in line with other templates. — Sgconlaw (talk) 11:37, 4 March 2024 (UTC)Reply
@Sgconlaw Yes. There were only a few pages using those params though. Benwing2 (talk) 21:08, 4 March 2024 (UTC)Reply
OK, great. — Sgconlaw (talk) 22:04, 4 March 2024 (UTC)Reply

Bugs in ar-conj/module:ar-verb edit

Hi. I want to inform you about a couple of problems in ar-con/module:ar-verb. I already informed Fenakhay about'em, I'll also inform you just in case, perhaps you can sort it out. I'm sorry in advance for my post being this long:


when I was looking for entries on حَيَّ/حَيِيَ (root ح ي و), I saw long present tense alone (يَحْيَا) still being generated for short form, and it doesn't generate the short one (يَحَيُّ), which exists per Lisan al-Arab: [3]. Needs to be fixed to generate short present tense.


Also a related problem is for عَيَّ/عَيِيَ (root ع ي ي), while the conjugation table for long form عَيِيَ will be generated with specified paradigm i/a with long present يَعْيَا, unlike with حَيَّ, conjugation table for عَيَّ won't be generated at all. Btw, it also has short version of present: يَعْيُّ: [4]

Also notice how participles aren't generated at all for حَيَّ/حَيِيَ (should be short and long versions: حَيّ and حَيِيّ). Fixmaster (talk) 20:45, 5 March 2024 (UTC)Reply

Bugs in ar-conj/module:ar-verb (part 2) edit

Also notice how participles aren't generated at all in conjugation tables for حَيَّ/حَيِيَ (should be short and long versions of active participles: حَيّ and حَيِيّ). Same goes for عَيَّ/عَيِيَ (should be عَيّ/عَيِيّ per dictionaries).

And if you generate the conjugation table with عَيِيَ (don't forget, the table for عيَّ won't generate at all), there will be participles, but with wrong form: عَايٍ for active and مَعْيُوّ for passive.

Btw, speaking of passive participles, what they should be? In almaany online dictionary, I found مَحْيىّ and مَعِيّ correspondingly. Notice how patterns don't match? In any case, they could probably be ignored, those passove are mostly theoretical and impersonal, anyway. Just thought it was worthy of mentioning.

What matters is the ability to generate the conjugation table at all for short version verb عَيَّ like we have for حَيَّ, long present tense for these 2 (يَحَيُّ and يَعَيُّ) which currently isn't generated, and generation of short/long active participles (حَيّ/جَيِيّ and عَيّ/عَيِيّ)

Just as a side note: maybe there should be parameters in the template to forcefully override active/passive participles (like we have the parameter for verbal nouns)? Just an idea. Fixmaster (talk) 20:41, 5 March 2024 (UTC)Reply

About categories edit

Feedback on categories from a not-so-clever reader, if you allow me. I find Categories at en.wikt very complex and unpatrolled (many were started by someone, and then were left untouched). Some of them are broken in so specialised subcategories, that one cannot find a wanted word e.g. dog in Cat:en:Animals. Is there an index=1 kind of Category-Index (allll members a...z)? We have done this at @el.wikt.Animals, plants, medicine with a different colour. Just 3 or 4 Cats. The little ««« links to the overall Cat for all languages. Also! The code-indicator for topics makes alphabetisations and comprehension impossible: why should a reader know the codes? If a first word is to be avoided, why not the style: Cat:Animals (English)? Thank you for listening. ‑‑Sarri.greek  I 03:31, 6 March 2024 (UTC)Reply

@Sarri.greek You've brought up several points and this is a big topic. Can you bring this up in the Beer Parlour? Most of the basic decisions concerning category structure predate me and we'd need consensus to institute any significant changes. Benwing2 (talk) 03:35, 6 March 2024 (UTC)Reply
@Benwing2, Here, I am not an admin, it is not my place to bring such things for discussion -my understanding of en.wikt structure and modules is not adequate-. Sir, I have been thinking at el.wikt (from where my admin.collegues, mostly wikipedians, demanded that i stop, for being too autocratic... True: I cannot stand sloppiness, lack of refs, loose CFI etc. :) But same is valid for all wiktionaries perhaps: 20 years have passed. Basics (plus details too) are covered. What now? I think, a general workpage for a.Feedback on the current state. b.The future plans for formation of crews on each subject. Cleanup, reviewing, and unifgying: cats, params, templats. Leadership: vote plans by Xadmin, by Zadmin., people responsible to do the plan and supervise the crews. If you organise a room /wikt.Future or something... and subpages for Cats, for Temps etc... we could all bring ideas? Plus: a very important thing. en.wikt is now the leader of all wiktionries, where every little wikt. copies from. IF you had to design a wiktionary from scratch, how would you go about it? Because now, it is a patchwork procedure: adding, correcting in a maze of things... Hhhhh I talk too much too! Sorry ‑‑Sarri.greek  I 04:01, 6 March 2024 (UTC)Reply
@Sarri.greek I think in a wiki it's impossible to do everything top-down. It has to be done through consensus. Also I don't think we need a separate wikt.Future discussion forum or anything; that's what the Beer Parlour is for. There's no need to be an admin to initiate a discussion for change, just go ahead. Benwing2 (talk) 04:32, 6 March 2024 (UTC)Reply

Adding a category with multiple subcategories edit

Hi, I'd like to add categories to track calls to templates with bad parameters but I haven't touched categories before so I wanted to double check that that this is a reasonable idea and that I'm going about it the right way. I think I need to create a parent category and then use a handler for the per-template categories. Since these would be maintenance categories, I would edit Module:category tree/poscatboiler/data/wiktionary maintenance and insert:

-- add the variable handlers at the top of the page (the file doesn't currently use any handlers)
local handlers = {}

--- snip ---

raw_categories["Pages using bad params when calling a template"] = {
	description = "Pages that use unrecognized parameters when calling a template.",
	additional = "These template calls should be reviewed and corrected or removed",
	breadcrumb = "Bad template params",
	parents = {"Wiktionary maintenance"},
	can_be_empty = true,
	umbrella = false,
	hidden = true,
}

table.insert(handlers, function(data)
	local template = data.label:match("^Pages using bad params when calling (.+)$")
	if template then
		return {
			description = "Pages that use unrecognized parameters when calling " .. template .. ".",
	        additional = "These template calls should be reviewed and corrected or removed",
			breadcrumb = template,
			umbrella = false,
			parents = {{
				name = "Pages using bad params when calling a template",
				sort = template,
			}},
		}
	end
end)

-- add HANDLERS to the existing return table
return {RAW_CATEGORIES = raw_categories, HANDLERS = handlers}

I know I can do something similar using template tracking, but I'm trying to make this a little more "user friendly" with the hope that it won't just be me cleaning up these categories. Is there an overhead cost to using categories like this or anything else I should take into consideration? Thanks! JeffDoozan (talk) 21:04, 8 March 2024 (UTC)Reply

@JeffDoozan Yup, this approach will work, although you need a few changes: (1) use a raw handler instead of a regular handler (because the category in question doesn't begin with a language name), and the first line of the handler should use `data.category` instead of `data.label`; (2) you don't need the `umbrella` settings because raw categories don't have corresponding umbrella categories. Other than that everything looks good. Benwing2 (talk) 21:18, 8 March 2024 (UTC)Reply
After adding categorization to ~300 templates that are used less than 5 times and called at least once with invalid parameters, I think it would be easier for cleanup if the templates were categorized into "language" templates and "general use" templates, like this:
To do that, I came up with the following code:
raw_categories["Pages using bad params when calling a template"] = {
	description = "Pages that use unrecognized parameters when calling a template.",
	breadcrumb = "Bad template params",
	parents = {"Wiktionary maintenance"},
	can_be_empty = true,
	hidden = true,
}

table.insert(raw_handlers, function(data)
	local template_type = data.category:match("^Pages using bad params when calling (.+) templates$")
	if template_type then
		return {
			description = "Pages that use unrecognized parameters when calling " .. template_type .. " templates.",
			breadcrumb = template_type,
			parents = {{
				name = "Pages using bad params when calling a template",
			}},
			hidden = true,
		}
	end
end)

table.insert(raw_handlers, function(data)
	local template = data.category:match("^Pages using bad params when calling (.+)$")

	if template then
        template_name_without_namespace = template:gsub("^Template:", "")

		-- Check if the template name starts with a hyphenated language code
		local lang
		possible_language_code = template_name_without_namespace:match("^([a-z][a-z][a-z]?-[a-z][a-z][a-z])-")
		if possible_language_code ~= nil then
			lang = require("Module:languages").getByCode(possible_language_code)
		end

		-- Check if the template name starts with a two or three character language code
		if lang == nil then
			possible_language_code = template_name_without_namespace:match("^([a-z][a-z][a-z]?)-")
			lang = require("Module:languages").getByCode(possible_language_code)
		end

		local template_type
		if lang == nil then
			template_type = "general use"
		else
			template_type = lang:getCanonicalName()
		end

		return {
			description = "Pages that use unrecognized parameters when calling " .. template .. ".",
	        additional = "These template calls should be reviewed and the bad parameter should be corrected or removed.",
			breadcrumb = template,
			parents = {{
				name = "Pages using bad params when calling " .. template_type .. " templates",
				sort = template_name_without_namespace,
			}},
			hidden = true,
		}
	end
end)
Am I just re-inventing umbrella categories? Is there a better way to do this? Would this add unnecessary overhead to the categorization system? JeffDoozan (talk) 22:28, 15 March 2024 (UTC)Reply

A couple of code replacements edit

Hi, as part of the Min Nan split, would it please be possible for you to bot replace a couple of the codes which are being deprecated? The only places these are now used should be links, which should make the switch straightforward.

  1. Hokkien: nan-hoknan-hbl (etym-only to full language conversion)
  2. Teochew: zhx-teonan-tws (code standardisation within the nan family)

Thanks. Theknightwho (talk) 01:12, 11 March 2024 (UTC)Reply

@Theknightwho Sure, will do. Benwing2 (talk) 01:30, 11 March 2024 (UTC)Reply
Thanks. Theknightwho (talk) 01:34, 11 March 2024 (UTC)Reply
@Theknightwho Does the code zhx-teo still exist? I can't find any references to it in the language data. Benwing2 (talk) 01:34, 11 March 2024 (UTC)Reply
@Benwing2 It's currently set up as an alias, but that's just a temporary thing. I recently changed the way aliases are handled so that they're no longer directly integrated into the data, because (a) that added overhead we don't need most of the time, (b) it makes keeping track of aliases easier by collating them all in one place, (c) it means we can use them for situations like this, where a code is being changed for whatever reason, and (d) we can now use them for full languages without having to complicate the language data (see point c). They're now stored in Module:languages/data at the bottom. Theknightwho (talk) 01:37, 11 March 2024 (UTC)Reply
@Theknightwho Ahh, thanks. Benwing2 (talk) 01:41, 11 March 2024 (UTC)Reply
@Benwing2 Btw, it does mean the integration isn't quite as smooth as before, since you now can't use aliases for anything that accesses the language data directly as the alias is only looked up during the creation of a language object. In practical terms, that just means they can't be used anywhere in the language data itself (e.g. the ancestors field). That was semi-intentional, though, since we don't really want aliases in the first place. Theknightwho (talk) 01:45, 11 March 2024 (UTC)Reply
@Theknightwho Yeah that is fine. I agree we should eliminate aliases as much as possible, and in fact I did that previously with a bunch of random etym-only aliases. Benwing2 (talk) 01:47, 11 March 2024 (UTC)Reply
@Benwing2 I've just added a check to Module:data consistency check for alias codes, which covers the data for languages, etym-only languages, families and scripts: all it does is check that none of the subtables has multiple keys (e.g. due to someone adding m["abc"] = m["xyz"], which is the old way aliases were handled).
The only ones it's found at the moment are for various Arabic script codes, where I consolidated all the ones that had identical tables a while back. Working out what to do with them will need a proper discussion, though. Theknightwho (talk) 02:43, 11 March 2024 (UTC)Reply
@Theknightwho Yeah I've never been very happy with having a bunch of language-specific script codes for Arabic and certain other scripts. However, I'm not sure whether it's possible to eliminate them (or some of them) using things like language selectors in CSS. Maybe User:This, that and the other and/or User:Erutuon can comment more. Benwing2 (talk) 02:48, 11 March 2024 (UTC)Reply
@Theknightwho I did a replacement run for both codes but as the tracking categories were only added yesterday, it will take longer to flush out all the old usages (indeed I now see 8 new pages in the nan-hok category and 3 in the zhx-teo category). Benwing2 (talk) 03:22, 11 March 2024 (UTC)Reply
Thanks. Theknightwho (talk) 05:38, 11 March 2024 (UTC)Reply
@Theknightwho I'll do another run tomorrow. Benwing2 (talk) 05:41, 11 March 2024 (UTC)Reply
@Theknightwho Did another run. Going to bed now but will do another one tomorrow evening; hopefully that will catch any stragglers. Benwing2 (talk) 08:42, 11 March 2024 (UTC)Reply
Sounds good - thanks. Theknightwho (talk) 08:43, 11 March 2024 (UTC)Reply
@Theknightwho I did two runs, one just now and one about 10 hours ago, and already more have appeared, so it may be a few days before everything catches up and there are no more additions to the tracking categories. Benwing2 (talk) 07:50, 12 March 2024 (UTC)Reply
@Theknightwho I went through CAT:Terms derived from Hokkien and CAT:Terms derived from Teochew recursively and changed all the terms in them as well as remaining tracked terms (including uses in {{rfp}} and {{cog}} and such). I *THINK* this is done now; probably close enough that you can delete the old codes and handle any remaining errors as they occur. Benwing2 (talk) 22:40, 12 March 2024 (UTC)Reply
@Benwing2 Thanks - I caught one, but that looks to be it. Theknightwho (talk) 18:49, 13 March 2024 (UTC)Reply
I have also wondered why we use those special lang+script codes for the Arab and Beng scripts. Perhaps they date from a time when no other solution was well-supported enough to deliver different fonts for different languages. I note that Syrc and Xsux specify different fonts for different languages with CSS alone, so it is clearly possible to do it that way. (Not too sure what is going on with Mong...) This, that and the other (talk) 03:50, 11 March 2024 (UTC)Reply
@-sche, Surjection Maybe either of you could comment. If we can replace things like fa-Arab with just the appropriate language selectors in MediaWiki:Gadget-LanguagesAndScripts.css I would rather do it that way and not expose what is essentially an implementation detail into the wikicode. Benwing2 (talk) 03:57, 11 March 2024 (UTC)Reply
@This, that and the other, Benwing2 In the case of Mong, it's been split because the code actually covers four closely related scripts: Mongolian (proper), [Oirat] Clear Script, Manchu and Xibe. It's a situation where the split exists to get more accurate language data, rather than because we need different CSS classes (though that may be something we want in the future; Manchu and Oirat-specific fonts exist, and I suspect Xibe as well). In each case, the character ranges only cover the characters used by those scripts; there's some overlap, but most are only used in a subset of the four. See [5] for a breakdown (note: Todo = Clear Script; Sibe = Xibe). (Edit: this distinction does matter in some cases, e.g. Sanskrit, which has Mong, mnc-Mong and xwo-Mong.) Theknightwho (talk) 05:38, 11 March 2024 (UTC)Reply
@Theknightwho could you update the Chinese entry at WT:LT, such as it is? This, that and the other (talk) 03:51, 11 March 2024 (UTC)Reply
Done. Theknightwho (talk) 05:38, 11 March 2024 (UTC)Reply

Module editing tutorials edit

Hi, would you be able to point me to some places where I can learn more about module creation and editing?

I'm self-taught in HTML which has served me fine for entries and templates, but there are quite a lot of things I would like to see done at the module level in Welsh (ways of presenting collective-singulative nouns, accounting for literary and colloquial forms in adjectives, a template for phrasal verbs, a template for generating IPA transcriptions) that at the moment are well beyond my abilities.

I'd also prefer not to bother other users by constantly asking them to do tasks for me when I could just learn to do it myself. Cheers, Arafsymudwr (talk) 16:45, 13 March 2024 (UTC)Reply

Min translations edit

Hi - following the renaming of various Min lects, could you please do the following name replacements in translation sections?

  1. Min Bei → Northern Min (mnp)
  2. Min Dong → Eastern Min (cdo)
  3. Min Zhong → Central Min (czo)
  4. Puxian → Puxian Min (cpx)

They should all be nested under Chinese.

I'm not including Min Nan, since all the translations have to be converted manually due to the split anyway, so changing them to Southern Min would just create confusion. Thanks. Theknightwho (talk) 21:26, 13 March 2024 (UTC)Reply

@Theknightwho OK I have an existing script to sort translations that I was able to modify to handle this. I will run it shortly. As for Min Nan in translation sections, I checked and there are 2,637 pages with Min Nan translations in them so it will take awhile to do this totally manually. I had hoped they would have a qualifier by them indicating the particular Min Nan lect but usually that doesn't seem to be the case. The first two examples, from dictionary and rain cats and dogs, are typical:
*: Min Nan: {{tt+|nan|字典|tr=lī-tián / jī-tián}}, {{tt|nan|詞典|tr=sû-tián}}, {{tt|nan|辭林|tr=sû-lîm}}
*: Min Nan: {{t+|nan|㴙㴙落|tr=tsa̍p-tsa̍p-lo̍h}}
I know little about Min Nan but from what I've heard, I suspect the vast majority of them are Hokkien. It may be possible in any case to speed this up by looking up the terms in question to see whether the lect can be identified. For example, the four terms given above all have Pronunciation sections indicating that the transliterations in question are Hokkien (and some of them also have Taiwanese Hokkien qualifiers). Some translations don't have transliterations given, but in that case as long as there is a Hokkien pronunciation given, I think it's fine to tag it as Hokkien. (Also I looked for Teochew translations and several of them are tagged as nan or even mn, presumably because someone thought mn stood for Min Nan.) Benwing2 (talk) 23:53, 13 March 2024 (UTC)Reply
@Benwing2 Thanks - I've spent a couple of hours going over them so far, and I've already dealt with all the ones that were marked Teochew (including the one labelled mn, yeah). Out of the ones simply marked "Min Nan", I've only found one which was definitely Teochew, with the others all being Hokkien.
In terms of automating it, the safest thing to do would be to convert any which don't have numbered tones to Hokkien, leaving the rest for manual review (which will probably be <20).
There could plausibly be a handful which are in fact Teochew but have POJ-style (i.e. Hokkien-style) transliterations, but I don't think it's feasible to determine those, since it would be way too time-consuming to convert it to the correct romanisation and check against the entry for every single translation.
Theknightwho (talk) 00:02, 14 March 2024 (UTC)Reply
@Theknightwho: Sounds good. For reference here is the complete list of Min Nan translations as of the Mar 1 dump that have numbered tones in them:
Page 872 four: Found match for regex: *: Min Nan: {{qualifier|Xiamen}} {{tt+|nan|四|tr=sì, sù}}, {{qualifier|Teochew}} {{tt+|nan|四|tr=si3}}
Page 873 five: Found match for regex: *: Min Nan: {{qualifier|Xiamen}} {{tt+|nan|五|tr=go, ngò}}, {{qualifier|Teochew}} {{tt+|nan|五|tr=ngou6}}
Page 1054 eight: Found match for regex: *: Min Nan: {{qualifier|Xiamen}} {{tt+|nan|八|tr=peh, poeh, pat}}, {{qualifier|Teochew}} {{tt+|nan|八|tr=boih4}}
Page 2107 percent: Found match for regex: *: Min Nan: {{t|nan|百分之|tr=pah-hun-chi...|alt=百分之……}} {{qualifier|the number follows it, e.g. 30%: 百分之三十 pah-hun-chi saⁿ-cha̍p}}
Page 2462 cousin: Found match for regex: *: Min Nan: {{t|nan|叔伯兄|tr=chek-peh-hiaⁿ}} {{qualifier|{{tooltip|older, father’s brother’s son|[[oFBS]]|und=1}}}}, {{t|nan|叔伯阿兄|tr=chek-peh-a-hiaⁿ}} {{qualifier|{{tooltip|older, father’s brother’s son|[[oFBS]]|und=1}}}}, {{t|nan|叔伯小弟|tr=chek-peh-sió-tī}} {{qualifier|{{tooltip|younger, father’s brother’s son|[[yFBS]]|und=1}}}}, {{t|nan|叔伯阿姊|tr=chek-peh-a-chí}} {{qualifier|{{tooltip|older, father’s brother’s daughter|[[oFBD]]|und=1}}}}, {{t|nan|叔伯小妹|tr=chek-peh-sió-mōe, chek-peh-sió-bē}} {{qualifier|{{tooltip|younger, father’s brother’s daughter|[[yFBD]]|und=1}}}}, {{t|nan|表兄|tr=piáu-hiaⁿ}} {{qualifier|{{tooltip|older, mother’s sibling’s or father’s sister’s son|o[[MSiS]] or [[oFZS]]|und=1}}}}, {{t|nan|表小弟|tr=piáu-sió-tī}} {{qualifier|{{tooltip|younger, mother’s sibling’s or father’s sister’s son|y[[MSiS]] or [[yFZS]]|und=1}}}}, {{t|nan|表姊|tr=piáu-ché, piáu-chí}} {{qualifier|{{tooltip|older, mother’s sibling’s or father’s sister’s daughter|o[[MSiD]] or [[oFZD]]|und=1}}}}, {{t|nan|表小妹|tr=piáu-sió-mōe, piáu-sió-bē}} {{qualifier|{{tooltip|younger, mother’s sibling’s or father’s sister’s daughter|y[[MSiD]] or [[yFZD]]|und=1}}}}
Page 2809 handmaid: Found match for regex: *: Min Nan: {{t|nan|女婢|tr=lu2-pi7}}, {{t|nan|tsa1-boo2-kan2}}
Page 4233 eyelash: Found match for regex: *: Min Nan: {{t|nan|目睭毛//目珠毛|tr=ba̍k-chiu-mn̂g / ba̍k-chiu-mô͘}}, {{t+|nan|目睫毛|tr=ba̍k-chiah-mn̂g / ba̍k-cheeh-mô͘ / ba̍k-chiah-mô͘ / ba̍k-chia̍p-mn̂g / ba̍k-chiap-mn̂g}}, {{t|nan|目毛|tr=ba̍k-mn̂g / ba̍k-mô͘}}; {{t|nan|目眥毛|tr=mag8 ci3 mo5}} {{q|Teochew}}
Page 4352 flesh: Found match for regex: *: Min Nan: {{t+|nan|肉|tr=bah4}}
Page 5089 stiff: Found match for regex: *: Min Nan: {{t|nan|liau1}}
Page 16449 aircraft: Found match for regex: *: Min Nan: {{t|nan|飞行器|tr=hui1-hing5-khi3}}
Page 30166 gnash: Found match for regex: *: Min Nan: {{t|nan|咬牙切齒|tr=ga6 ghê5 ciag4 ki2}}, {{t|nan|咬牙|tr=kā-gê}}, {{t|nan|切齒|tr=chhiat-khí / chhiat-chhí}}
Page 31973 farmer: Found match for regex: *: Min Nan: {{t+|nan|農民|tr=lông-bîn}}, {{t+|nan|作穡人|tr=chò-sit-lâng}}, {{t+|nan|作田人|tr=chó-chhân-lâng}}, {{t|nan|农夫|tr=long5-hu1}}
Page 35994 cabbage: Found match for regex: *: Min Nan: {{t|nan|植物人|tr=sêg4 muêh8 nang5}}
Page 38088 arsehole: Found match for regex: *: Min Nan: {{t|nan|lan7-tsiau2-bin7}}, {{t|nan|臭面人|tr=tshau2-bin7-lang5}}
Page 43201 glove: Found match for regex: *: Min Nan: {{t+|nan|手囊|tr=tshiu2-long5}}, {{t+|nan|手套|tr=tshiu2-tho3}}
Page 45493 reunion: Found match for regex: *: Min Nan: {{t|nan|ui5-loo5 围炉}}
Page 45800 dung beetle: Found match for regex: *: Min Nan: {{t|nan|蜣螂|tr=khiong-lông}}, {{qualifier|Quanzhou Hokkien}} {{t|nan|屎龜|tr=sái-ku}}, {{t+|nan|牛屎龜|tr=gû-sái-ku}}, {{qualifier|Teochew}} {{t|nan|牛屎核|tr=ghu5 sai2 hug8}}
Page 48510 loess: Found match for regex: *: Min Nan: {{t+|nan|黃色}}, {{t|nan|黄砂|tr=hong2 sê1}}
Page 50690 troublesome: Found match for regex: *: Min Nan: {{t|nan|lo1so1}}, {{t|nan|lui1-lui1-tui1-tui1}}, {{t|nan|啰嗦|tr=lo1-so1}}
Page 50799 feud: Found match for regex: *: Min Nan: {{t|nan|se3-siu5}}
Page 54507 sashimi: Found match for regex: *: Min Nan: {{t|nan|刺身|tr=chhiah-sin; sa33 si55 mih3}}
Page 64034 shove: Found match for regex: *: Min Nan: {{t|nan|long1}}, {{t|nan|lang1}}, {{t|nan|nng1}}
Page 67068 vulva: Found match for regex: *: Min Nan: {{t|nan|陰門|tr=im-mn̂g}}, {{t|nan|外阴|tr=gue7-im1}}
Page 76097 shirk: Found match for regex: *: Min Nan: {{t|nan|liu1-kiang1}}
Page 104634 halfway: Found match for regex: *: Min Nan: {{t|nan|半路|tr=puann3-loo7}}
Page 106660 thimble: Found match for regex: *: Min Nan: {{t|nan|鍼黹}}, {{t+|nan|針黹|tr=cham-chí, chiam-chí}} {{qualifier|Mainland China}}, {{t|nan|指套|tr=chí-thò}} {{qualifier|Quanzhou and Xiamen}}, {{t|nan|頂針|tr=dêng2 zam1}}, {{t|nan|銅指|tr=tâng-cháiⁿ}}
Page 106811 spacious: Found match for regex: *: Min Nan: {{t|nan|阔|tr=khuah4}}, {{t|nan|khuann3-long1-long1}}
Page 125580 K2: Found match for regex: *: Min Nan: {{t|nan|K2 Hong}}
Page 179602 disadvantageous: Found match for regex: *: Min Nan: {{t|nan|put4-li7}}
Page 335793 Wiktionary:Beer parlour/2007/April: Found match for regex: :::*:Min Nan: (''Amoy'') [[囡仔]] ([[gín-á]]); (''Teochew'') [[孥囝]] ([[nou5gian2]])
Page 1357199 Wiktionary:Beer parlour/2009/May: Found match for regex: :: That's '''1''' then. The [[child]] has 3 levels. Is it really necessary? Can we keep to 2 levels? For example, ** Min Nan: 囡仔 (gín-á), 孥囝 (nou5gian2) (''Teochew'')? [[User:Atitarev|Anatoli]] 22:39, 10 May 2009 (UTC)

Benwing2 (talk) 00:58, 14 March 2024 (UTC)Reply

Thanks. Theknightwho (talk) 00:59, 14 March 2024 (UTC)Reply
@Theknightwho I am running my script now to change Min Dong -> Eastern Min and Min Bei -> Northern Min and re-sort appropriately (there were no translations involving Min Zhong or Puxian). A couple of questions:
  1. Are you finished fixing up the pages with numbered tones in them that I mentioned above? If so once the script finishes I'll do a run to change Min Nan -> Hokkien in translations along with nan -> nan-hbl, and re-sort.
  2. What about occurrences of Min Dong etc. in {{lb}}, {{tlb}}, {{zh-forms}}, {{q}} (occurring mostly in Synonyms sections), etc.? Do these need to be renamed? On rough count, there are 1,318 occurrences of Min Dong in {{lb}}, 279 in {{q}}, 48 in {{zh-forms}} and 20 in {{tlb}}. Counts for Min Bei are roughly similar, while there are only a few instances of Min Zhong and Puxian (without "Min").
Benwing2 (talk) 02:50, 14 March 2024 (UTC)Reply
@Benwing2 Thanks.
  1. Yes.
  2. Yes. For things like labels etc., "Min Nan" should be changed to "Southern Min".
Theknightwho (talk) 04:12, 14 March 2024 (UTC)Reply
@Theknightwho OK sounds good. #1 is running now. Benwing2 (talk) 04:14, 14 March 2024 (UTC)Reply
@Theknightwho What about things like "[Cc]oastal Min" as occurs in {{zh-forms}} in 唐人 and in {{lb}} in 牛母? (I guess these need manual editing, as it appears Coastal Min can be any of Eastern, Southern or Puxian.) Benwing2 (talk) 04:19, 14 March 2024 (UTC)Reply
See also coastal|_|Min in . Benwing2 (talk) 04:20, 14 March 2024 (UTC)Reply
@Theknightwho: Not sure if this is useful but there are 203 occurrences of from=Min in the Mar 1 dump, which generally occur in {{surname}}:
Page 27803 Cu: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien Chinese origin
Page 31307 Lao: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 54700 Dee: Found match for regex: # {{surname|tl|Chinese Filipino|from=Min Nan}}, most notably borne by:
Page 68861 Kong: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 71443 Juan: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 80226 Chan: Found match for regex: # {{surname|tl|from=Min Nan}} (Hokkien) of Chinese origin
Page 80245 Chi: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 80288 Co: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 80532 Du: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin, mostly around [[Cebu]]
Page 80539 Dy: Found match for regex: # {{surname|tl|Chinese Filipino|from=Min Nan}}
Page 80915 Go: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 81022 Haw: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 81061 Ho: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}} of Chinese origin, most notably borne by:
Page 81318 King: Found match for regex: # {{surname|tl|from=Min Nan}}
Page 81334 Ko: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 81420 Lee: Found match for regex: # {{surname|tl|Chinese Filipino|from=Min Nan}}
Page 81515 Lu: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 81516 Lua: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien Chinese origin
Page 81890 Ng: Found match for regex: # {{surname|tl|{{w|Chinese Filipino}}|from=Min Nan}}
Page 82214 Po: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 82353 Que: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 82618 See: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 82665 Shaw: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 82674 Sia: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 82690 Sin: Found match for regex: # {{surname|tl|from=Min Nan}}, most associated with former Archbishop of Manila, {{w|Jaime Sin}}
Page 82735 So: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin, most notably borne by:
Page 82750 Son: Found match for regex: # {{surname|tl|from=Min Nan|Filipino-Chinese}} of Hokkien origin
Page 82890 Sy: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 82930 Tan: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 82931 Tang: Found match for regex: # {{surname|en|Chinese|from=Min Nan}}.
Page 82949 Te: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 82956 Tee: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 83037 To: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 83141 Ty: Found match for regex: # {{surname|ceb|from=Min Nan}} of Chinese origin
Page 83141 Ty: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 83396 Yap: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 83409 Young: Found match for regex: # {{surname|ceb|from=Min Nan}} of Chinese origin
Page 83409 Young: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 121853 Tiu: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}}, most notably borne by:
Page 196098 Samson: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 766971 Lew: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 825754 Anson: Found match for regex: # {{given name|tl|male|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 825754 Anson: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 1066196 Chu: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1178407 Yao: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1265062 Lim: Found match for regex: # {{surname|ilo|from=Min Nan}}
Page 1265062 Lim: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1265654 Cheng: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1265730 Ang: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1265732 Ong: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1265733 Suan: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1265734 Cua: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1266900 Pua: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1266901 Uy: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1266918 Chua: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1266924 Khoo: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien Chinese origin
Page 1266970 Ching: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien Chinese origin or {{surname|tl|from=Cantonese}} of Cantonese Chinese origin, notably borne by:
Page 1277675 Gan: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1284142 Koa: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 1443955 Nga: Found match for regex: # {{surname|en|from=Min Dong}}.
Page 1579807 Kang: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 2178085 Deang: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 2625641 Wee: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 2700666 Tin: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 2845428 Henson: Found match for regex: # {{surname|tl|from=Min Nan}} common among Filipinos of Chinese ancestry
Page 3305014 Yang: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 3750292 Lo: Found match for regex: # {{surname|tl|from=Min Nan|Filipino-Chinese}} of Hokkien origin
Page 4170429 Chung: Found match for regex: # {{surname|tl|from=Cantonese}}, or {{surname|tl|from=Min Nan}} (Hokkien) of Chinese origin.
Page 4713793 Coo: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien Chinese origin, or {{surname|tl|from=Cantonese}} of Cantonese Chinese origin.
Page 5112069 Sanson: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 5152613 Kho: Found match for regex: # {{surname|tl|Malaysia, Singapore, Indonesia, Philippines, Thailand, Vietnam-Chinese|from=Min Nan}}, most notably borne by:
Page 5152613 Kho: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}}, most notably borne by:
Page 5159150 Kua: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5171997 Yee: Found match for regex: # {{surname|ceb|from=Min Nan}} of Chinese origin
Page 5375208 Yu: Found match for regex: # {{surname|ceb|Filipino-Chinese|from=Min Nan}}, the 26th most common in the Philippines
Page 5375208 Yu: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}}, the 26th most common in the Philippines
Page 5404772 Ngo: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5406204 Chong: Found match for regex: # {{surname|tl|from=Cantonese}} of Cantonese Chinese origin, or {{surname|tl|from=Min Nan}} of Hokkien Chinese origin.
Page 5406528 Tong: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5406530 Chiu: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}}, most notably borne by:
Page 5410833 Leong: Found match for regex: # {{surname|tl|from=Cantonese}} of Cantonese Chinese origin or {{surname|tl|from=Min Nan}} of Hokkien Chinese origin.
Page 5411779 Pang: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5413143 Ison: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 5415076 Dizon: Found match for regex: # {{surname|pam|from=Min Nan}} of Chinese origin, notably borne by:
Page 5415076 Dizon: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin, notably borne by:
Page 5435565 Yung: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5437599 Shao: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5437924 Loo: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5438022 Sison: Found match for regex: # {{surname|en|from=Min Nan}}.
Page 5438022 Sison: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry, notably borne by:
Page 5438104 Hau: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5438288 Tian: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5439278 Teng: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}} of [[Hokkien]] origin
Page 5442404 Ting: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5453194 Tien: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5512124 Tuazon: Found match for regex: # {{surname|en|from=Min Nan}}.
Page 5512124 Tuazon: Found match for regex: # {{surname|pam|from=Min Nan}}
Page 5512124 Tuazon: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 5514761 Goh: Found match for regex: # {{cln|en|surnames from Chinese}} {{surname|en|Chinese|from=Min Nan}}.
Page 5514761 Goh: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5538352 Niu: Found match for regex: # {{surname|en|from=Min Nan}}.
Page 5538352 Niu: Found match for regex: # {{surname|tl|from=Min Nan}}
Page 5543775 Quiambao: Found match for regex: # {{surname|en|from=Min Nan}}.
Page 5543775 Quiambao: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin, most notably borne by:
Page 5558677 Lacson: Found match for regex: # {{surname|en|from=Min Nan}}.
Page 5558677 Lacson: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry, most notably borne by:
Page 5582383 Tecson: Found match for regex: # {{surname|en|from=Min Nan}} ''[[Hokkien]] Chinese'', common among Filipinos of Chinese descent.
Page 5582383 Tecson: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien Chinese origin, most notably descendants of ‘Tek Sun’ brothers from Guangzhou (Canton), China
Page 5584134 Layson: Found match for regex: # {{surname|tl|from=Min Nan}} common among Filipinos of Chinese ancestry
Page 5586737 Cinco: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 5586737 Cinco: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 5614689 Soon: Found match for regex: # {{surname|tl|from=Min Nan|Filipino-Chinese}} of Hokkien origin
Page 5618852 Singson: Found match for regex: # {{surname|tl|from=Min Nan}} common among Filipinos of Chinese ancestry
Page 5636472 Gozon: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 5646811 Gotamco: Found match for regex: # {{surname|tl|from=Min Nan}}
Page 5652715 Cayco: Found match for regex: # {{surname|tl|from=Min Nan|Filipino-Chinese}}
Page 5652718 Syson: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}}
Page 5652722 Layco: Found match for regex: # {{surname|tl|Tagalog|from=Min Nan}}
Page 5653661 Tengco: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}} of Hokkien origin
Page 5655949 Yuzon: Found match for regex: # {{surname|ceb|Filipino-Chinese|from=Min Nan}}
Page 5655949 Yuzon: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}}
Page 5656631 Tiongson: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 5656647 Cojuangco: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}}, borne by a known political and business clan in the Philippines
Page 5671242 Jocson: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}}
Page 5673469 Tiangco: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}}
Page 5674047 Quisumbing: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5674054 Lichauco: Found match for regex: # {{surname|tl|Filipino-Chinese|from=Min Nan}}
Page 5676213 Locsin: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin, most notably borne by:
Page 5677430 Quizon: Found match for regex: # {{surname|pam|from=Min Nan}}
Page 5677430 Quizon: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin, most associated with [[w:Dolphy|Dolphy]], which bears the real name of Rudolf Quizon
Page 5677431 Quimpo: Found match for regex: # {{surname|ceb|from=Min Nan}} of Chinese origin
Page 5677431 Quimpo: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5678951 Tangco: Found match for regex: # {{surname|tl|from=Min Nan}} or Hokkien origin
Page 5678980 Tiongco: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5678984 Guanzon: Found match for regex: # {{surname|tl|from=Min Nan}} common among Filipinos of Chinese ancestry
Page 5678991 Hizon: Found match for regex: # {{surname|tl|from=Min Nan}} common among Filipinos of Chinese ancestry, most notably descendants of migrants from [[Macau]] to {{w|Parián}}, {{w|Mexico, Pampanga|Mexico}}, {{w|Pampanga}}
Page 5684485 Tiamson: Found match for regex: # {{surname|tl|from=Min Nan}} or Hokkien origin
Page 5686268 Tuason: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien Chinese origin, {{alt form|tl|Tuazon|nocap=1}}
Page 5686671 Tio: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5687329 Ganzon: Found match for regex: # {{surname|tl|from=Min Nan}} or Hokkien origin
Page 5689830 Pecson: Found match for regex: # {{surname|pam|from=Min Nan}}
Page 5689830 Pecson: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5690622 Siason: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5690623 Tiozon: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5691453 Unson: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin common among Filipinos of Chinese ancestry
Page 5692143 Cuizon: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5692145 Suico: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5693840 Quimson: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5694341 Tancinco: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5696938 Ongkiko: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5696941 Sioson: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5700562 Bauzon: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5700580 Yatco: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5700589 Gancayco: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5700604 Limjoco: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5700656 Coquia: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5700659 Dijamco: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5700712 Ticzon: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5700939 Cosico: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5701342 Yuvienco: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5701354 Sangco: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5738755 Ayson: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 5740882 Songco: Found match for regex: # {{surname|tl|from=Min Nan}}
Page 5764989 Leyson: Found match for regex: # {{surname|tl|from=Min Nan}} common among Filipinos of Chinese ancestry
Page 5769732 Kiamzon: Found match for regex: # {{surname|tl|from=Min Nan}}
Page 5769773 Sayson: Found match for regex: # {{surname|tl|from=Min Nan}} common among Filipinos of Chinese ancestry
Page 5773490 Sanciangko: Found match for regex: # {{surname|ceb|Filipino-Chinese|from=Min Nan}}
Page 5773649 Guico: Found match for regex: # {{surname|tl|from=Min Nan}}
Page 5773673 Tanchoco: Found match for regex: # {{surname|tl|from=Min Nan}}
Page 5773685 Siongco: Found match for regex: # {{surname|tl|from=Min Nan}}
Page 5788737 Tayson: Found match for regex: # {{surname|tl|from=Min Nan}}
Page 5788738 Limcaoco: Found match for regex: # {{surname|tl|from=Min Nan}}
Page 5885208 Joson: Found match for regex: # {{surname|tl|from=Min Nan}}
Page 5889986 Tanseco: Found match for regex: # {{surname|tl|from=Min Nan}}
Page 5906982 Siao: Found match for regex: # {{surname|ceb|from=Min Nan}} of Chinese origin
Page 5906982 Siao: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 5983082 Yongco: Found match for regex: # {{surname|ceb|from=Min Nan}} of Chinese origin
Page 5983082 Yongco: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 6060762 Pacquiao: Found match for regex: # {{surname|ceb|from=Min Nan|xlit=Pacquiao}}
Page 6601914 Caw: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 6601919 Pueson: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 6601923 Causon: Found match for regex: # {{surname|tl|from=Min Nan}} common with Filipinos with Chinese ancestry
Page 6601938 Quitson: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 6601988 Auyong: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 6601989 Awyoung: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 6603830 Syaw: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 6603831 Shau: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 6603884 Hwan: Found match for regex: # {{surname|tl|from=Min Nan}} of Chinese origin
Page 6603960 Liong: Found match for regex: # {{surname|tl|from=Cantonese}} of Cantonese Chinese origin, or {{surname|tl|from=Min Nan}} of Hokkien Chinese origin.
Page 6603976 Mapua: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien Chinese origin, notably borne by:
Page 6638858 Banzon: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien Chinese origin
Page 7439359 Teh: Found match for regex: # {{surname|en|from=Min Nan}}.
Page 7782052 Sitchon: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 7782063 Itchon: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 7849686 Tiong: Found match for regex: # {{surname|en|from=Min Dong}}.
Page 7849688 Diong: Found match for regex: # {{surname|en|from=Min Dong}}.
Page 7924413 Ngeh: Found match for regex: # {{surname|en|from=Min Dong}}.
Page 8003694 Canoy: Found match for regex: # {{surname|tl|from=Min Nan}} common among Filipinos of Chinese ancestry
Page 8060607 Gueco: Found match for regex: # {{surname|pam|from=Min Nan 慧哥}}
Page 8343774 Siocson: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 8343781 Bengzon: Found match for regex: # {{surname|tl|from=Min Nan}} of Hokkien origin
Page 9058034 Quiason: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
Page 9058035 Quiazon: Found match for regex: # {{surname|tl|from=Min Nan}} common on Filipinos of Chinese ancestry
As can be seen, these are almost all Min Nan, almost all Tagalog and some of them explicitly say "of Hokkien origin". Are these all Hokkien? If so I'll change them accordingly. Benwing2 (talk) 04:29, 14 March 2024 (UTC)Reply
@Benwing2 Thanks for this re the surnames. The whole "of X origin" thing is totally superfluous imo, so should be deleted. If it explicitly says Hokkien somewhere then change it to that; it might also be possible to infer it from the etymology section, too. Any remaining ones should be left to manual review. Theknightwho (talk) 04:33, 14 March 2024 (UTC)Reply
@Theknightwho All right, I'll do this. BTW some of them are already fixed; I randomly picked Siocson and User:Mlgc1998 fixed it 3 days ago. Benwing2 (talk) 04:36, 14 March 2024 (UTC)Reply
@Benwing2 It's probably fine to keep Coastal Min in {{zh-forms}}. We should probably have proper categories set up for it, which categories like Category:Southern Min Chinese would be part of.
There's a whole issue with labels in Chinese entries causing a ton of duplication between the label categories and the lemma categories, but we've not come up with a satisfactory solution to it yet. Theknightwho (talk) 04:30, 14 March 2024 (UTC)Reply
@Theknightwho Yeah, IMO things like Category:Hokkien Chinese should go away in favor of Category:Hokkien lemmas now that we have the latter. {{lb}} could be made to generate the latter category in place of the former but it doesn't seem like such a good idea as it wouldn't categorize correctly into the other categories. Benwing2 (talk) 04:34, 14 March 2024 (UTC)Reply
Also IMO all label categories that refer to specific lects should have corresponding lang codes, either full or etym-only, and probably the etym-only categories added by the Pronunciation section instead of the {{lb}}. Note also that User:-sche proposed awhile ago renaming "etym-only language" to something else, which IMO is a good idea; they have gone far beyond being used only for etymologies. Benwing2 (talk) 04:39, 14 March 2024 (UTC)Reply
Yeah, agreed. It's probably worth starting a thread on the BP about renaming etym-only languages, as the current name is really misleading. Theknightwho (talk) 04:50, 14 March 2024 (UTC)Reply
Done. BTW it looks like "Min Nan" was already removed from all Tagalog etc. surnames; the only remaining instances of "from=Min" occurred in a few English surnames of Min Dong origin. I cleaned them up and removed the text "of Chinese origin" etc. following various {{surname}} invocations. The script to implement #2 above (correct "Min Dong", "Min Bei" etc. in labels/qualifiers/etc.) is running. Benwing2 (talk) 07:05, 14 March 2024 (UTC)Reply
Task #2 is close to done; going to sleep now. There are still 6,406 occurrences of "Min Nan" in qualifiers, which my script didn't touch. The occurrences can be found here: User:Benwing2/qualifier-min-nan-1 and User:Benwing2/qualifier-min-nan-2 (split over two files because otherwise the files supposedly exceed the 2MB size; in fact the total file size is 1.2MB but there's that stupid doubler effect). Some of the qualifiers occur in Reference sections but the vast majority seem to occur in Synonyms and Antonyms sections. I am guessing again that the majority are Hokkien but I'm not sure, and generally the transliterations aren't attached. Here we might have to fall back on looking up the terms in question to see which lects they are listed as occurring in (which should be bottable, if you provide appropriate instructions). Benwing2 (talk) 08:31, 14 March 2024 (UTC)Reply
@Theknightwho Let me know if you need help with any other renaming tasks that can be done or sped up by bot. I notice you're going through and renaming instances of "Min *" in comments, {{rfp}} params and other random places but there may be too many to do by hand. There were 17,750 pages satisfying the regex (Min Bei|Min Dong|Min Zhong|Puxian|Min Nan) as of the Mar 1 dump, and 12,222 remaining when I re-downloaded the same pages last night before running task #2. Task #2 changed 6,245 pages, meaning there might be on the order of 6,000 pages left, although I can check for sure by re-downloading the same pages. As I mentioned above, most of the occurrences are probably Min Nan occurring in qualifiers because my script didn't change them. Benwing2 (talk) 22:51, 14 March 2024 (UTC)Reply
@Benwing2 Thanks. Yeah, I was just going through and renaming the various "Min Bei" and "Min Dong" labels, but noticed that "Min Nan" is used on thousands of pages. It's annoying, as it's the one where "Hokkien" is sometimes a more appropriate label. That being said, it's not wrong to put "Southern Min", so it would probably be helpful to change those automatically. Theknightwho (talk) 23:04, 14 March 2024 (UTC)Reply
@Theknightwho See my comment above from last night. It's probably possible to figure out how to change Min Nan automatically to the right label by looking up the page in question to see what lects are listed on the page. If you want me to work on that I can although I'd need some instructions as to what lects to look out for. Benwing2 (talk) 23:10, 14 March 2024 (UTC)Reply
@Benwing2 Yes please - @Justinrleung might be able to give better pointers than me. Theknightwho (talk) 23:12, 14 March 2024 (UTC)Reply
@Theknightwho OK, I re-downloaded the relevant pages. There are 7,396 pages remaining satisfying the regex (Min Bei|Min Dong|Min Zhong|Puxian|Min Nan). Of these, 7,128 mention Min Nan; 39 mention Min Bei; 59 mention Min Dong; 22 mention Min Zhong; and 211 mention Puxian but only 15 of those mention Puxian using the regex Puxian($|.$|[^ ]| [^M]), which excludes "Puxian Min". There are 8,195 total lines mentioning of Min Nan (since some pages mention Min Nan more than once). Of these lines, 6,761 contain a qualifier and 6,593 specifically satisfy the regex {q.*{zh-l, i.e. a qualifier followed by a Chinese-style link. Of the 1,605 lines not satisfying {q.*{zh-l, 45 match {q.*{l (a qualifier with a generic link); 111 contain {{thcwd}} or a variant ({{thcwda}}, {{thcwdq}}), almost all preceded by a Min Nan qualifier; 227 contain Min Nan inside of {{zh-forms}}; 21 contain Min Nan inside of {{zh-see}}; 105 contain Min nan inside of {{zh-der}}, {{col3}} or a variant; and 24 contain an occurrence of {{desc}}. Excluding all of these leaves 1,063 occurrences over 412 pages, of which 260 are outside of mainspace. So I think it should be possible to create a script to handle the {q.*{zh-l occurrences, and handle the remainder type-by-type in a semi-manual fashion. Benwing2 (talk) 00:02, 15 March 2024 (UTC)Reply
@Benwing2 Sounds like a good plan. Thanks for doing this. Theknightwho (talk) 00:03, 15 March 2024 (UTC)Reply
@Theknightwho FYI I also did a download run of those same pages checking for those now containing "Southern Min". There are 5,119 lines over 4,377 pages mentioning Southern Min, mostly in labels (as expected) but occasionally in other places that could stand to be reviewed. Benwing2 (talk) 00:05, 15 March 2024 (UTC)Reply
@Theknightwho OK. Can you help me sketch out a general idea of what the qualifiers should be transformed into? For example, I randomly picked page 4445 天涯海角, which contains a synonym 天邊海角 labeled "Min Nan". This latter page has a label Hokkien and it also has {{zh-pron|mn=ml,jj,tw:thiⁿ-piⁿ-hái-kak|cat=cy}}. According to the documentation of {{zh-pron}}, mn means Hokkien and the codes inside mean ml="Mainland China (Xiamen, Quanzhou, Zhangzhou)", jj="Jinjiang", tw="mainstream Taiwan", for which a pronunciation is given. How much info do we want in the qualifiers? Is just "Hokkien" enough in this situation? In general, what lects should be specified in the qualifiers? Maybe just Hokkien, Teochew, Leizhou? Possibly also Quanzhou and/or Zhangzhou dialect if pronun is given for these dialects? This is where I need a bit of guidance from someone like you who knows the languages in question. Benwing2 (talk) 00:24, 15 March 2024 (UTC)Reply
@Benwing2 I'd wait for Justin to comment, as I think you're really overestimating my knowledge. I've got a very broad understanding of what needs to be done, but my understanding of Module:nan-pron is relatively low, so I won't be much help in interpreting the input. Theknightwho (talk) 00:30, 15 March 2024 (UTC)Reply
@Theknightwho OK. I had assumed you know the languages because you seem able to correctly split the lects; maybe you're just a fast learner ;) ... Benwing2 (talk) 00:40, 15 March 2024 (UTC)Reply
@Benwing2: I think for qualifiers of synonyms, etc., it can just be
"Hokkien" when there's only a Hokkien pronunciation, "Teochew" when there's only a Teochew pronunciation, etc., and we don't need to worry about the finer distinctions, which we will get with {{lb}} at the entry. If it's more than one Southern Min variety, we could either use the Southern Min label or list all the relevant Southern Min languages; I don't have a strong feeling about either way. — justin(r)leung (t...) | c=› } 01:38, 15 March 2024 (UTC)Reply
@Justinrleung All right. What is the complete list of Southern Min varieties? Benwing2 (talk) 01:39, 15 March 2024 (UTC)Reply
The currently supported varieties in {{zh-pron}} are Hokkien, Teochew and Leizhou Min. Other than these, there's Hainanese as well as other varieties that haven't be dealt with (WT:RFM#Additional Southern Min languages). — justin(r)leung (t...) | c=› } 01:46, 15 March 2024 (UTC)Reply

──────────────────────────────────────────────────────────────────────────────────────────────────── @Justinrleung, Theknightwho: I finished the script to convert Min Nan and Southern Min in qualifiers in Synonym/Antonym sections (and the like; whenever followed by a {{zh-l}} link). Out of 6,283 pages where it tried to do something, it was able to process 5,938, which is a pretty good record (94.5%). The breakdown of lects generated is as follows:

5418 Hokkien
 485 Hokkien|Teochew
  16 Hokkien|Teochew|Leizhou
  10 Hokkien|Leizhou
   9 Teochew

The script issued 663 warnings. They are here: User:Benwing2/min-nan-qualifier-conversion-warnings. One of you two might want to go through them. Note that 268 "may be ignorable" (meaning that the script was able to continue on and ultimately do something, despite the warning). Of the remaining 395, 276 are due to the link referring to a nonexistent page; you'd need domain knowledge to know which lect(s) are appropriate. This leaves 119, of which 50 are "Couldn't parse" errors (the line wasn't formatted in a standard fashion); 35 are "Couldn't find 'Min Nan' or 'Southern Min' qualifier" errors (the qualifier template says something like {{q|literary or Min Nan, Hakka}} or {{q|Cantonese, Min Nan}} rather than just "Min Nan"); 22 are "Saw multiple Etymology/Pronunciation sections" (in such a case, the code tries hard to figure out the correct lects, including using the gloss in the {{zh-l}} link and making sure there is more than one Etymology/Pronunciation section that refers to Min Nan and that the two sections have different lects in them); 5 are "Can't find Chinese section"; and 7 are some random misc stuff. I am going to run the script in save mode either tonight or tomorrow. Benwing2 (talk) 07:52, 15 March 2024 (UTC)Reply

@Theknightwho This is running; maybe 1 to 1.5 hours and it will finish. Benwing2 (talk) 20:43, 15 March 2024 (UTC)Reply
Cool - thanks. Theknightwho (talk) 20:47, 15 March 2024 (UTC)Reply
BTW can {{zh-l}} be replaced by {{l|zh}}? I'm not sure any more what the Chinese-specific behavior in {{zh-l}} is. Maybe it's just automatic handling of traditional vs. simplified forms? Benwing2 (talk) 20:47, 15 March 2024 (UTC)Reply
@Theknightwho Also maybe we can have the lect be specified using a lang code prefix instead of having it a separate qualifier. Benwing2 (talk) 20:48, 15 March 2024 (UTC)Reply
@Benwing2 On that point, would it be possible to do a similar analysis for all uses of the nan code used in the Thesaurus namespace? There are 483 uses at the moment, but conversion is slow as it requires a bunch of manual analysis. Some of them also have "Min Nan" in qualifiers, which will need revising as well. Theknightwho (talk) 20:52, 15 March 2024 (UTC)Reply
@Theknightwho OK, I'll take a look. Benwing2 (talk) 20:54, 15 March 2024 (UTC)Reply
@Theknightwho @Justinrleung For this purpose I think we (a) need to add the missing etym-only codes for Min Nan lects, and (b) we should include the specific lect and not just "Hokkien" in the lang prefix or qualifier. For example, I took a look at Thesaurus:打耳光 meaning "to slap someone in the face"; there are three synonyms labeled nan as well as two more explicitly labeled Zhangzhou Hokkien and Tainan Hokkien respectively. Of the three labeled nan, one is a red link, one is labeled Xiamen Hokkien and one is labeled "Quanzhou, Zhangzhou and Taiwanese Hokkien". Labeling the latter two just "Hokkien" would seem incomplete. Benwing2 (talk) 21:09, 15 March 2024 (UTC)Reply
@Benwing2 The principle I've followed so far has been to use the most specific label which adequately covers everything at the target, where that's possible. So anything that's labelled (e.g.) "Xiamen Hokkien" would get the langcode nan-xmn, but something labelled "Quanzhou, Zhangzhou and Taiwanese Hokkien" would just get nan-hbl. I agree with Justin that the labels for links aren't as important as those on the entries themselves, so incompleteness isn't the end of the world. When multiple lects are mentioned (e.g. Hokkien and Teochew), I've ditched the langcode altogether and put (e.g.) "Southern Min" as a qualifier. Theknightwho (talk) 21:12, 15 March 2024 (UTC)Reply
Also, as an aside, we don't currently have an etym-only langcode for Taiwanese Hokkien, because it's not a well-defined lect in the way varieties like Xiamen, Zhangzhou and Quanzhou are; all three are spoken on Taiwan, but (for historical reasons) the Hokkien-speaking communities on Taiwan have undergone a lot more influence from Japanese and English than their equivalents on the mainland, so it makes sense to use that label sometimes. In those cases, just labelling them "Hokkien" isn't really a problem if it's just in the thesaurus entry. Theknightwho (talk) 21:20, 15 March 2024 (UTC)Reply
@Theknightwho All right, let me look at a few more examples. While we're at it, what do you think of replacing the etym-only codes for the Hokkien varieties with ones conforming to the principles I laid out in WT:RFM? Since these codes are newly added I suspect they're barely used. This would mean nan-jnj -> nan-jin (Jinjiang Hokkien), nan-qzh -> nan-qua (Quanzhou Hokkien), nan-xmn -> nan-xia (Xiamen Hokkien), nan-zzh -> nan-zha (Zhangzhou Hokkien), nan-plp -> nan-qua-PH (probably) or nan-PH (possibly) or nan-phi (perhaps) (Philippine Hokkien). Benwing2 (talk) 21:44, 15 March 2024 (UTC)Reply
@Benwing2 I don't mind too much. I have a small preference for doing it syllabically rather than by the first letters of the name, but I don't mind if you want to use a standardised format for them.
There are sometimes instances where we won't be able to follow it, though (e.g. Category:South Dravidian I languages and Category:South Dravidian II languages, where I opted for dra-sdo and dra-sdt, respectively). Theknightwho (talk) 21:48, 15 March 2024 (UTC)Reply
@Theknightwho Yes, understood. BTW I wouldn't have an issue with something more syllabic than using the first three letters, it's just that it's not so easy to guess automatically what the right set of letters to use is in that case. (Actually the principle you followed for South Dravidian I/II *is* consistent with the principles I laid out, which call for using the initials of the lect when using the first three letters isn't practical.) Benwing2 (talk) 21:53, 15 March 2024 (UTC)Reply
@Theknightwho I changed the language codes. I used nan-hbl-PH for Philippine Hokkien. I think we can go ahead and use nan-hbl-TW for Taiwanese Hokkien, and create subvariety codes for the specific dialects that are derived respectively from Xiamen, Zhangzhou and Quanzhou (e.g. nan-xia-TW etc.). I also modified Module:columns so that it can take a comma-separated list of prefixed lang codes, e.g. nan-hbl,hak:[[毋]][[知]] and handle them appropriately (i.e. using the first one to create the term link but displaying all of them as qualifiers). I'm going to work on fixing up the Thesaurus entries now. Benwing2 (talk) 23:29, 16 March 2024 (UTC)Reply
@Benwing2: I think in most cases, specific dialects of Taiwanese Hokkien should not be tied back to the source varieties of Quanzhou and Zhangzhou (and maybe Xiamen, which is itself generally thought of as a Quanzhou-Zhangzhou mixed variety). These kinds of labels are generally not helpful lexicographically; they are only well-defined phonologically and have small bearing on vocabulary, where much more convergence has occurred in Taiwan due to dialect levelling. The locales in Taiwan (e.g., Lukang, Yilan, etc.) for subdialects of Taiwanese that are less mixed may be more helpful in cases where we want to highlight them. — justin(r)leung (t...) | c=› } 02:55, 17 March 2024 (UTC)Reply
@Justinrleung OK, this is fine and it jives well with the nan-hbl-TW label. I was just responding the User:Theknightwho's assertion that Taiwanese Hokkien isn't a well-defined lect. Benwing2 (talk) 03:21, 17 March 2024 (UTC)Reply
@Theknightwho Code is written to process Thesaurus entries and convert nan as appropriate. I will finish the analysis tomorrow and run it. Benwing2 (talk) 09:32, 17 March 2024 (UTC)Reply
@Theknightwho I expanded the script I wrote so it also attempts to convert lects mentioned in <qq:...> qualifiers into lect code prefixes. (This is the origin of that "part 1" section in WT:RFM.) These should not change the qualifier output much (possibly in some cases rearranging the order, that's it) but will help with transliteration and such. Some stats on what I have so far:
  1. I ran it on the 2,013 pages in CAT:Chinese thesaurus entries. It would change 620 pages.
  2. It issues 328 warnings. Of these:
    1. 255 of these are due to unrecognized lects in qualifiers. All of these are already discussed in the "part 1" WT:RFM section.
    2. Of the remaining 73, 40 are due to looking up a page tagged as nan: and finding it doesn't exist.
    3. Of the remaining 33, 14 are "informational" warnings that can be ignored.
    4. Of the remaining 19, 15 are due to finding multiple etymologies with different sets of Southern Min varieties in the different etymologies.
Benwing2 (talk) 05:04, 18 March 2024 (UTC)Reply
@Theknightwho Scratch the above stats. My script needs some changes to not overgenerate in the presence of multiple definitions (it already handles multiple etymology/pronunciation sections but needs to be extended for multiple definitions, because sometimes specific labels apply only to specific definitions). Benwing2 (talk) 05:24, 18 March 2024 (UTC)Reply
OK, I rewrote the script to take into account the presence of multiple definitions and try to use the glosses present in Thesaurus pages to whittle down the set of possible definitions to use. The first pass doing that increased warnings from 328 to 1,344 (!) and reduced the number of pages changed from 620 to 490, but I think I can do a whole lot better than that. Stay tuned. Benwing2 (talk) 07:07, 18 March 2024 (UTC)Reply
Generally, {{zh-l}} should be replaced (especially if it's giving a Hokkien pronunciation), but that's probably something to do en masse at another time, as there are tens of thousands of uses so we'll probably want to hash out a proper conversion method. Theknightwho (talk) 20:55, 15 March 2024 (UTC)Reply
@Theknightwho Yes, agreed; just something to keep in mind. Benwing2 (talk) 20:56, 15 March 2024 (UTC)Reply

Module:columns and Module:sa-verb, Module:sa-verb/data edit

There are 3 sanskrit entries in CAT:E because of an error in {{sa-conj}}, and I checked the entire transclusion list for ततान- your edit to Module:columns is the only recent change to executable code for anything in the list. Indeed, there are comments in Module:sa-verb, saying that code was copied from Module:columns and would need to be updated if that were changed. Chuck Entz (talk) 00:19, 17 March 2024 (UTC)Reply

@Chuck Entz Thank you, I'll fix. I looked for modules using Module:columns but I forgot about the display_from entry point. Benwing2 (talk) 00:21, 17 March 2024 (UTC)Reply
@Chuck Entz I don't think my change to Module:columns has anything to do with this error. User:Exarchus is actively working on Module:sa-verb/data and made the last change only an hour ago. User:Exarchus, can you take a look at these errors? They are due to a buggy Lua pattern. Benwing2 (talk) 00:35, 17 March 2024 (UTC)Reply
I somehow read the dates wrong on those edits- I could have sworn they were from the same date as the ones to Module:sa-verb. You're no doubt right. Sorry! Chuck Entz (talk) 00:47, 17 March 2024 (UTC)Reply