Archives edit

Bolding of clippings etc. edit

Did you go ahead and make this change, after discussion on Discord? I just noticed it at the "clipping" sense at mongo. I don't like the new formatting. The entire line should be in italics (as it used to be), to indicate this is a "non-gloss" or whatever we call it, and it's not a noun that means "a clipping of something". Should have been a vote... Equinox 15:19, 9 November 2022 (UTC)Reply[reply]

I haven’t touched it! Agree that anything like that should be put to a vote. Theknightwho (talk) 16:04, 9 November 2022 (UTC)Reply[reply]
@Equinox I've just spotted that it's because it was using {{clipping}} (an etymology template), not {{clipping of}}. I've corrected it. Theknightwho (talk) 16:51, 9 November 2022 (UTC)Reply[reply]

Speedy deletion edit

Hi, please don't blank pages that you nominate for deletion. If they're misspellings, include a link to the right spelling in the rationale and they may be useful as (nonexistent) redirects. Ultimateria (talk) 03:37, 10 November 2022 (UTC)Reply[reply]

I don't usually, but these were bad orthographies that we don't want (even as redirects), because they contain mistakes that we don't want to propagate. Noted re including the right spelling in the note, though. Theknightwho (talk) 03:41, 10 November 2022 (UTC)Reply[reply]

Serbo-Croatian entry name normalization edit

Hello. One of your module edits has resulted in the unlinkability (via Module:links) of Serbo-Croatian pages whose title contain the character ć.


The diacritic on ć is a standard part of Serbo-Croatian orthography that shouldn't be stripped away from page names, unlike e.g. the diacritic on ȉ. The only reason I even noticed this was the module error on tisuća. Hopefully the problem is confined to this one character in one language, but I haven't tested extensively. Cheers, 09:21, 24 November 2022 (UTC)Reply[reply]

Thanks - I thought I'd caught all of these! I've implemented a remove_exceptions parameter for entry_name, which excludes specific characters from having their diacritics removed. Currently, it will only work for precomposed characters (as I wanted to get it working ASAP), but I'll fix up a general solution shortly. Theknightwho (talk) 09:59, 24 November 2022 (UTC)Reply[reply]
(As a side point, I was very confused as to why your links didn't fix once I made the change, and thought I'd made a mistake! Just realised that you've hard encoded the links haha.) Theknightwho (talk) 10:01, 24 November 2022 (UTC)Reply[reply]

ийс edit

After doing hundreds of null edits to clear the results of an error you made yesterday in a massively transcluded module, I found an error at ийс that superficially defies explanation: "The current page name 'ийс' does not match any of the numbers listed in Module:number list/data/inh for 9. Check the data module or the spelling of the page." When I look at Module:number list/data/inh, the relevant part has:

numbers[9] = { cardinal = "ийс", }

Without examining the character codes, I can't see any difference. Indeed, wikilinking the entry name, ийс and the string in the module, ийс, gives links to the same page stating that there's no match.

Going through the transclusion list at ийс, I see that the only recent edits were all by you, and they seem to have revolved around dealing with diacritics. Not so coincidentally, this is the only item in Module:number list/data/inh that has a diacritic.

I don't have the time nor the expertise to figure out what the exact problem is, so you're going to have to fix this. Thank you. Chuck Entz (talk) 19:31, 25 November 2022 (UTC)Reply[reply]

@Chuck Entz Caught the issue. It was to do with the fact that I was decomposing all precomposed characters (which includes й) in order to strip the appropriate diacritics - which circumvents to have massive lists of precomposed characters for those diacritics you want to strip (which was contributing to memory usage, and is also a general PITA to maintain). Certain languages use dedicated modules for entry names (in this case MOD:inh-entryname). In those particular cases, what I'd omitted to do was recompose characters again. That usually doesn't matter, as the wiki software accounts for it automatically. However, as Lua doesn't natively support UTF8, it was comparing й with и + ◌̆ and declaring them to be different. Hence, the links worked, but they weren't being recognised as the same by the module. Theknightwho (talk) 20:27, 25 November 2022 (UTC)Reply[reply]

sortkey changes edit

Hi, can you explain all your sortkey changes to Module:languages/data2? They have led to a bunch of errors in CAT:E related to Module:collation. Benwing2 (talk) 21:04, 27 November 2022 (UTC)Reply[reply]

@Benwing2 It was the latest change to Module:languages that caused the error - I'll have to work out what the issue is, as it seems to be cropping up on a small percentage of pages using the column templates. I've reverted it for now. Theknightwho (talk) 21:06, 27 November 2022 (UTC)Reply[reply]
What I meant is, what is the overarching purpose of these changes? Is it to save memory? If so are you sure it actually saves memory? Adding new modules tends to increase memory. Benwing2 (talk) 21:08, 27 November 2022 (UTC)Reply[reply]
@Benwing2 Yes. It ensures that the sortkeys are only loaded if that language is actually used on the page. Theknightwho (talk) 21:09, 27 November 2022 (UTC)Reply[reply]
OK. Please monitor CAT:E for memory-related issues once you finish making your changes, as their occurrence often doesn't follow obvious logic. Benwing2 (talk) 21:11, 27 November 2022 (UTC)Reply[reply]
Yep, absolutely. Theknightwho (talk) 21:14, 27 November 2022 (UTC)Reply[reply]
I think the problem is that you made it so that some conditional branches resulted in the function returning a nil value, which cannot be compared. 21:09, 27 November 2022 (UTC)Reply[reply]
Yep - I see the issue. Silly mistake. Theknightwho (talk) 21:09, 27 November 2022 (UTC)Reply[reply]
I see 16 pages in CAT:E with memory errors; these could be related to your changes if you pushed them live. Benwing2 (talk) 00:51, 29 November 2022 (UTC)Reply[reply]
I'm looking into it. I'm seeing a reduction in memory usage on most pages, but these are odd outliers. Theknightwho (talk) 00:52, 29 November 2022 (UTC)Reply[reply]
Probably related to how many languages occur on a given page; with your changes, lots of little modules are loaded, with one being loaded every time a sort key for those languages needs to be created (since they contain functions, meaning loadData can't be used), and module loads appear to have significant memory overhead. Benwing2 (talk) 01:12, 29 November 2022 (UTC)Reply[reply]
I suspect you're right. I also suspect there are some horrors lurking in some of the language-specific modules, but it's such a massive task to start hunting for them.
I'm trying to find what the function loadData actually looks like, to see how they manage to share memory between different #invoke calls. Might give some insight into what approach we could take. Theknightwho (talk) 01:26, 29 November 2022 (UTC)Reply[reply]
@Benwing2 I've done a nasty hack, by taking advantage of the package.loaded logic. It hasn't got rid of all the errors, but it has made them go away on an. The way that package.loaded works is that if a module has already been loaded via a previous invoke, then a key/val pair will exist in the package.loaded table - presumably allowing the module to be run again with a smaller footprint. What I've done is pre-load the module, and then set the val to the output (i.e. the sortkey). Any times the module is subsequently "run", it will just output the string (bypassing the module logic).
This only works due to the fact that a page will only ever have one sortkey for a given language. As a result, this fudge won't work for entry name functions. Theknightwho (talk) 02:02, 29 November 2022 (UTC)Reply[reply]
Did you ever get your "hack" working properly? If not I think you should at least consider undoing all the sortkey changes as a failed experiment -- from what I've seen, they increase rather than decrease memory on the most memory-intensive pages, which are the only pages that matter for these purposes, because they result in lots of small modules getting repeatedly loaded on pages that use lots of languages. The reason those pages no longer appear in CAT:E is because IP 98.* has been diligently converting all the pages to use the *-lite templates, which is a big hack that we should avoid if possible. Benwing2 (talk) 03:20, 5 December 2022 (UTC)Reply[reply]
I'm working on it! I'd rather not just roll these back, as quite a few were amended in the process for various reasons, so would need to be changed back manually. I'll have a look in more detail again tomorrow. Theknightwho (talk) 03:22, 5 December 2022 (UTC)Reply[reply]
There are still scads of Latin entries popping up in CAT:E that have to be due to one of your module edits (but which one- who knows?). They go away after a null edit, but they make it hard to see real module errors. There's also the matter of added work for the servers while they propagate all this tinkering to the entries.
Rolling out something like this on such a massive scale without waiting to see what kinks need to be worked out is a very bad idea- there are literally millions of entries that could be affected by some of the edits you've done. I've blocked people for less. Chuck Entz (talk) 04:11, 5 December 2022 (UTC)Reply[reply]
I've noticed the Latin issue too. The first time I saw it I went through and manually null-edited everything, which took a while due to hitting a ratelimit after every 10 or so entries. But that was probably about five days ago, and this is still going on. Is there a bot we can get to automatically null-edit these entries? 16:38, 5 December 2022 (UTC)Reply[reply]
@Benwing2 I've done quite a bit of experimentation with this, and have implemented some memory savings. As I mentioned to 98 in the thread below, the issues with Lua 5.1's garbage collection make memory savings unpredictable, which means that we can't say a change is a failure just because it causes some pages to start throwing errors. What we aren't seeing are the load of other memory-critical pages which haven't started throwing errors, as they're now using 48MB instead of 49.5MB. It seems like there will always be casualties with any changes that we make, unfortunately. e.g. towards the bottom of this Phabricator thread, Surjection mentions a particularly ridiculous example, where the creation of the extra data modules for languages actually increased memory usage on some pages. I've also noticed that completely trivial changes to modules (e.g. swapping the order of two minor functions, with no change in result) will cause massive increases/decreases in memory usage on certain pages for no obvious reason. Theknightwho (talk) 00:15, 9 December 2022 (UTC)Reply[reply]
Sorry to be a pain but do you have evidence that there are a lot of pages that have decreased from 49.5MB to 48MB? In this case quite a lot of pages increased their memory as a result of this change, and the number of pages using the 'lite' templates has significantly increased. I think just saying "it's counteracted by several pages that decreased their memory" is not a good response. I have in general refrained from sweeping changes that try to optimize memory for precisely this reason. Benwing2 (talk) 04:10, 9 December 2022 (UTC)Reply[reply]
It's going to be quite difficult to prove, but I can try to put something together. Theknightwho (talk) 04:11, 9 December 2022 (UTC)Reply[reply]
Maybe it would be good to have a daily (or more frequently updated) log of how much memory all the critical pages use, so we can track how that changes. 05:46, 9 December 2022 (UTC)Reply[reply]

Please excuse the delay - I’ve been dealing with some real life stuff most of today, which has meant no time to look at this. Theknightwho (talk) 22:48, 5 December 2022 (UTC)Reply[reply]

Module errors edit

Please add a return statement to the end of this function. Thanks. 21:05, 27 November 2022 (UTC)Reply[reply]

{{lb}} broken edit

As of writing, {{lb}} doesn't generate any links or categorize entries as it should. You recently made some edits to Module:labels and its data submodules, which seem likely to be the cause. As soon as this change begins propagating, topical and regional categories will begin to depopulate, so I suggest reverting or otherwise fixing the issue ASAP.

Might I suggest that in the future you test your changes to widely-used modules in a sandbox first? 18:28, 7 December 2022 (UTC)Reply[reply]

It was certainly working for several pages which I checked, but evidently not others (and I can already see why). Frustrating. I'll do more extensive testing in future. Theknightwho (talk) 18:37, 7 December 2022 (UTC)Reply[reply]
FYI, prior to the revert I also noticed a bunch of pages in CAT:E with memory issues, including mi. Now the category is practically empty. I'm not sure whether your change is the culprit, as it seems it should have had the opposite effect, but I don't know how else to explain this. 18:49, 7 December 2022 (UTC)Reply[reply]
The problem that keeps recurring is that a change might improve 20 memory-critical pages, but might make 10 others worse at the same time. I tried to find some sort of pattern with Erutuon, but we couldn't find one. Theknightwho (talk) 18:53, 7 December 2022 (UTC)Reply[reply]
As of writing, CAT:E again has 3 mainspace entries in it (including mi), up from zero mainspace entries as of my previous comment, but down from the number I saw earlier today. I guess I'll change mi and na to use lite templates, and angel probably needs a translation subpage. 19:18, 7 December 2022 (UTC)Reply[reply]
You're right. Frankly, what we need is an increase to the memory limit, but in the absence of that I'm going to keep looking for ways to reduce it. Theknightwho (talk) 19:39, 7 December 2022 (UTC)Reply[reply]

more memory errors edit

We now have 20 pages once again in CAT:E with memory errors, probably resulting from your change to Module:scripts. I really don't see why you keep making changes like this; I would strongly recommend holding off on any more changes to core modules for at least several weeks. Benwing2 (talk) 05:13, 9 December 2022 (UTC)Reply[reply]

@Benwing2 It's the beginning of the deprecation of {{zh-l}} et al, which is sorely needed. I have spoken to some of the Chinese editors about this. Theknightwho (talk) 05:16, 9 December 2022 (UTC)Reply[reply]
Have you even reversed the sortkey changes yet? I have zero interest in fixing the existing memory errors, because it seems there's always more every single day due to some change. — SURJECTION / T / C / L / 07:19, 9 December 2022 (UTC)Reply[reply]
Not at this stage, and doing so en masse would introduce yet more unpredictability. Theknightwho (talk) 07:21, 9 December 2022 (UTC)Reply[reply]
The only unpredictable thing is all the new changes. Before the sortkey changes, there were zero module errors. After them, there were dozens. It's like you're not taking this memory issue seriously at all. — SURJECTION / T / C / L / 07:26, 9 December 2022 (UTC)Reply[reply]
I completely agree. I feel now we should back out all the sortkey changes, forcibly if needed. And once again, please defer all further changes to core modules, including Chinese ones. Benwing2 (talk) 07:29, 9 December 2022 (UTC)Reply[reply]
Obviously I am taking it seriously, which is why I am working to solve the issue. If we roll back all of the changes, that will also undo a large amount of work which did more than re-implement what was already there in a different format. Theknightwho (talk) 07:40, 9 December 2022 (UTC)Reply[reply]
The practice should be that memory issues take priority above anything else. If there's so much as a single page in CAT:E that fails due to a memory error, there should be no changes whatsoever to core modules to add functionality. — SURJECTION / T / C / L / 07:57, 9 December 2022 (UTC)Reply[reply]
The memory issues are solvable. We should not be left in a position where it is practically impossible to add functionality. Theknightwho (talk) 08:02, 9 December 2022 (UTC)Reply[reply]
Having memory issues on entries is unacceptable. Forcing some editors to use workarounds is not. — SURJECTION / T / C / L / 08:03, 9 December 2022 (UTC)Reply[reply]
Which is why I am working to solve the issue. Ultimately, it is not great that we are forced to use lite modules, but it is the situation that we are in. Theknightwho (talk) 08:11, 9 December 2022 (UTC)Reply[reply]
You say that, but do a massive change that turns out to be a failure that manages to only increase memory usage (the whole sortkey thing), and then instead of working to reverse it, start working to integrate even more functionality into the core modules, which only exacerbates the problem. — SURJECTION / T / C / L / 08:19, 9 December 2022 (UTC)Reply[reply]
We have no way of knowing that it increased memory across the board without doing a more systematic check. As you and I both know, changes that should by all rights reduce memory usage do not always (ever?) do that (e.g. fan), and the tests that I did do were showing reduced usage. By its very nature, CAT:E only shows the problems, not the successes. Theknightwho (talk) 08:22, 9 December 2022 (UTC)Reply[reply]
Even if it decreased memory usage on pages that were close to but under the memory limit*, that doesn't really matter much IMO. The status quo ante was that CAT:E was usually empty or close to it, and occasionally a page or two would go over the limit and then someone would apply lite templates to that one entry. That was manageable. The result of your changes was to cause dozens of pages to overflow the limit in a matter of weeks. Just look at the history of Template:m-lite. Over half of the edits were in the past month alone, even though the template was created a year ago.
As for whether to revert these module changes or not, I don't know. I can certainly see a case for it, but if your edits weren't just restructuring, but also made substantive improvements that would affect output, I obviously wouldn't want to get rid of that useful work. I also fear that switching back will end up causing issues. As you pointed out, it seems any change will reduce memory usage on some pages and increase it on others, implying there's a cost inherent to transitioning. But I would at least advise against making any more sweeping Module edits without public discussion of them, and testing, first.
* I wish I had extracted memory usage data for the critical pages right before you made the edits so we could check this claim more rigorously. I'm at least skeptical, based on what I've personally observed. 11:34, 9 December 2022 (UTC)Reply[reply]
That is fair. I am not going to make any further changes for the time being, because there seems to be no obvious way to proceed. Theknightwho (talk) 11:40, 9 December 2022 (UTC)Reply[reply]

There are currently no (relevant) errors in CAT:E. I don't anticipate any more should appear, though I will deal with any if they do. Theknightwho (talk) 17:27, 9 December 2022 (UTC)Reply[reply]

──────────────────────────────────────────────────────────────────────────────────────────────────── I don't think it should be the case that even a single extra memory error should block changes to core functionality, because then nothing can ever get done, but it should be the case that changes to core modules should be done very carefully, and if several new memory errors arise, the changes should generally be undone. I've been able to make lots of changes to core modules without causing CAT:E to fill up with memory errors, and sometimes I've had to restructure my changes when they did cause such errors to happen. I've also tried to reiterate several times that splitting modules into small pieces is *NOT* the way to decrease memory, but that's exactly what you did in the sortkey changes. I'm not sure all the changes you made recently but I would still advise backing out the sortkey changes. It shouldn't be too hard to do so; can't you just restore the sort keys to the language data modules? As for further changes, I would advise thinking of this like they do in companies that do software engineering; before you make a big change like obsoleting {{zh-l}}, create a design document outlining exactly what you plan to do and have it reviewed by people who are familiar with the core modules (e.g. me, User:Surjection, User:Erutuon, and IP 98.*). That way people can suggest better ways of doing things that won't increase memory, and design errors are more likely to be caught. Benwing2 (talk) 05:34, 10 December 2022 (UTC)Reply[reply]
On top of everything else, (at least) the Russian sortkey is messed up because it fails to remove diacritics like it should. I suspect several others are similarly messed up. Benwing2 (talk) 06:28, 10 December 2022 (UTC)Reply[reply]
@Benwing2 Which diacritics are the sort key failing to remove compared to the version that was there previously? Before I made any changes, the sort key made a single change (ё to е + a private use character [edit addendum: which was in fact broken, as Lua does not handle sorting characters outside of the BMP properly, so it was sorting ё before е (except for е itself)]). The sort key does not and should not deal with diacritics, except that. Theknightwho (talk) 18:29, 10 December 2022 (UTC)Reply[reply]
And to deal with the rest - no, it isn’t straightforward to roll back all of the changes, because they were integrated with a large number of nontrivial changes. It will be a large job to attempt, and not necessarily straightforward. I also don’t see the benefit, unless we’re planning to roll back all of the high memory-use pages, which would be of little benefit anyway. Theknightwho (talk) 18:34, 10 December 2022 (UTC)Reply[reply]
Apologies for the terseness above - I was on mobile before. I appreciate everything you've said about clearing changes like these with others before making them, and I completely agree and will do so. I will look into getting some kind of Lua dev environment set up, or at least a set of Wikt core modules set up in my userspace, because with large-scale changes (such as integrating {{zh-l}}), it's very often difficult to work out exactly how those changes should be done without making changes incrementally. Certainly with {{zh-l}} in particular, it's going to be quite a difficult job, given that the Chinese modules essentially have their own module ecosystem. Theknightwho (talk) 19:39, 10 December 2022 (UTC)Reply[reply]
Fuck it to hell, I'm getting frustrated. Numerous people have indicated that the sortkey changes should be rolled back, yet you seem very resistant. Do we need a poll to convince you of this? Can you enumerate what sort of nontrivial changes you have made in the process of all the sortkey changes? Benwing2 (talk) 05:25, 11 December 2022 (UTC)Reply[reply]
@Benwing2 I'm resistant because you're not presenting a case for rolling back 50+ hours of work now that things are stable, especially when it would require re-implementing all of the changes that I made along the way. This affects about 150 languages - it's seriously not worth it.
And no, and I didn't make these changes blindly - I also cross-checked that they were actually correct as I went, which in many cases they weren't. Or at the very least, they were lacking. Rolling everything back would reintroduce a hell of a lot of junk. Theknightwho (talk) 05:34, 11 December 2022 (UTC)Reply[reply]
My point is that you almost certainly increased the memory of a lot of pages, which are now hovering right at the edge of the memory limit due to all the *-lite templates added to bring them down to that limit. That is what everyone else is saying as well. This is going to hurt us down the road every time a change is made to any of those pages. Things may be stable now but I honestly see no gain to all the sortkey changes and a lot of downside. I think it's pretty clearly a failed experiment, and the right thing to do is to back it out. Again you haven't enumerated what other changes you made along the way to 150+ languages; was it simply creating all the sortkey modules (which can be deleted or left alone once the changes to the data modules have been backed out) or a bunch of other random fixes? If so, what are those? Once again, can you please make a list of the "nontrivial changes" you keep referring to? Benwing2 (talk) 06:18, 11 December 2022 (UTC)Reply[reply]
@Benwing2 It was a very large number of fixes as well, which quite honestly took up the majority of the time because they involved manual checking of everything. That is one reason that it took so long. In addition to all that, some required bespoke logic (e.g. Module:za-sortkey, which is particularly complex, and to a lesser extent those such as Module:kaa-sortkey); some involve unwieldy numbers of substitutions (e.g. Module:aqc-sortkey and those for many other Caucasian languages); and others are consolidated sortkeys for multiple languages (e.g. Module:Grek-sortkey and Module:fr-sortkey, but there are others). See CAT:Sortkey-generating modules. That's without mentioning the many fixes that I did along the way, too, which involved everything from pre-1918 letters in Russian to stuff as simple as supporting diacritics on capital letters with languages where someone had only entered lowercase letters into the sortkey. Actually, that was for entry names, but you get the idea - lots of small fixes.
Remember that many of the sortkeys we had were either very old, or had been added by people not very familiar with coding (who often copied the older stuff, too). Very few (maybe none, I can't remember) were even using the remove_diacritics option. As a result, they were littered with issues. What I implemented was intended to clean that up, and to allow us to have a robust, standardised format which could easily be copied (and in fact, it already has been). This was not done mindlessly, and I would appreciate if you'd have a look at what I've actually done. Theknightwho (talk) 06:40, 11 December 2022 (UTC)Reply[reply]
I still maintain that having separate modules is the wrong thing to do in the majority of cases because it increases memory usage. This should not be controversial. I have observed it repeatedly and for this reason I avoid splitting modules in two unless it's possible to bypass one of the modules entirely (Surjection's changes are generally of this sort). If you made a bunch of fixes to the sortkeys, the correct thing to do now is to port those back into the language data modules themselves whenever possible. For example the Russian sortkey module is quite simple and could easily be put back into Module:languages/data2. Yes this may be a significant amount of work but IMO you brought it on yourself by deciding to do a big overhaul of the sortkey system without understanding the added memory pressure it would bring. Benwing2 (talk) 06:50, 11 December 2022 (UTC)Reply[reply]
I should add, "whenever possible" means if you have a really complex sortkey module, it should stay as-is but otherwise ported back. Things like shared sortkey modules can be ported back by having a single variable in the data module that is used in several places. Benwing2 (talk) 06:53, 11 December 2022 (UTC)Reply[reply]
@Benwing2 The cases where it makes sense to port back are where:
  1. There is no bespoke logic. With normal sortkeys, is there is a predictable order to the substitutions? If so, it should be possible to port any with double substitutions.
  2. They're not shared by languages in different data modules. I know that doesn't apply to Module:Grek-sortkey, for example, so it'll need to be kept.
Theknightwho (talk) 07:02, 11 December 2022 (UTC)Reply[reply]
Yes the substitutions are applied left to right. Why do we need to keep modules just because they are shared across languages? I'm not sure I see the need for this. Benwing2 (talk) 07:05, 11 December 2022 (UTC)Reply[reply]
@Benwing2 Because you can't have a single variable in the data module for a sortkey used by el, grc and so on, as they're in different data modules. Theknightwho (talk) 07:08, 11 December 2022 (UTC)Reply[reply]
Sure but if they are simple enough it's worth the duplication in the most-used languages to avoid extra module loads. In particular for Greek I would put in-data-module versions of Module:Grek-sortkey for at least 'el' and 'grc'; similarly for 'fr' and 'wa' (in the same module), 'frm' and 'fro' (in the same module) and potentially also in 'nrf'. Put comments indicating where the same code is duplicated so it can be kept in sync. Benwing2 (talk) 07:52, 11 December 2022 (UTC)Reply[reply]
@Benwing2 That was precisely what I was trying to avoid, because with Greek in particular they weren't in-sync (despite having that note). Given this affects a relatively small number of languages, I don't think performance concerns are justified. Theknightwho (talk) 07:58, 11 December 2022 (UTC)Reply[reply]
French and Greek (Ancient and Modern) are highly used languages. Keep in mind with your set up, the little modules are loaded repeatedly on every page. Please try it both ways and see what the memory difference is; this will indicate whether it's justified or not. Benwing2 (talk) 08:06, 11 December 2022 (UTC)Reply[reply]
@Benwing2 I can, but given the randomness of the changes, it won't be enormously helpful. After this, I suggest we try to put together what 98 suggested and have a basket of (say) 100-200 pages that we can test changes on and have the stats reported back in some way. That would allow us to see what's going on a bit better, because at the moment we're all working off hunches. Theknightwho (talk) 08:13, 11 December 2022 (UTC)Reply[reply]
As a practical matter, though, there are only a few languages that use the Greek alphabet, so memory errors are extremely rare. There are only two such terms in the {{redlink category}} exclusion list, and they date to before Surjection's work on the modules. The real problem is the Latin-script sortkeys. Chuck Entz (talk) 08:20, 11 December 2022 (UTC)Reply[reply]
@Chuck Entz Remember that sortkeys are also used on anything in column templates (and probably in various other modules as well), so irrespective of the language they can and will affect large pages unexpectedly. Theknightwho (talk) 08:22, 11 December 2022 (UTC)Reply[reply]
My point is that there aren't any Greek-alphabet entries that are big enough to have problems, with only two language sections and no translation tables. I suppose there might be some Latin-script entries with lots of language sections having Ancient Greek in their etymologies, but I can't think of any that have ended up in CAT:E. Chuck Entz (talk) 08:31, 11 December 2022 (UTC)Reply[reply]
@Chuck Entz True, though it will affect any with lists of Greek (though I can't think of any likely to have that, off the top of my head). Conversely, putting things in separate modules as I've done should (in theory) be saving memory on pages with large numbers of translations, because none of them will be pointlessly loading the sortkey table into the language object. Instead, they're just loading the string with the name of the sortkey module (which doesn't get invoked). It is an impossible balancing act. Theknightwho (talk) 08:39, 11 December 2022 (UTC)Reply[reply]

C edit

This page is protected for some reason. Please make the following substitutions:

  • {{l-lite|mul|L|gloss=50}} => {{l-lite|mul|L|t=50}} (It would be possible to change {{l-lite}} to accept this parameter name, but it's deprecated anyway.)
  • {{l-lite|mul|D|gloss=500}} => {{l-lite|mul|D|t=500}}
  • {{l-lite|en|C#}} => {{l-lite|en|Unsupported titles/C sharp|C#}}
  • {{der-lite|nb|ett|𐌂}} => {{der-lite|nb|ett|𐌂|sc=Ital|tr=c}}
  • {{der-lite|nb|grc|Γ|t=gamma}} => {{der-lite|nb|grc|Γ|sc=polytonic|t=gamma|tr=G}}
  • {{der-lite|nb|phn|𐤂|t=gimel}} => {{der-lite|nb|phn|𐤂|sc=Phnx|t=gimel|tr=g}} 18:55, 9 December 2022 (UTC)Reply[reply]

Done. Theknightwho (talk) 03:40, 10 December 2022 (UTC)Reply[reply]

Merry Christmas edit

I wish to you and all users of Wiktionary Merry Christmas and Happy New Year. Leonard Joseph Raymond (talk) 22:14, 25 December 2022 (UTC)Reply[reply]

@Leonardo José Raimundo Thank you - and to you! Theknightwho (talk) 22:27, 25 December 2022 (UTC)Reply[reply]

? edit

Why did you revert my edits? The words are written in different scripts (Latin, Cyrillic, and one Inuit). At least be consistent and add each of them on every single page. And as for "aza", I only removed duplicates. What you just did makes no sense at all. Shumkichi (talk) 08:46, 2 January 2023 (UTC)Reply[reply]

@Shumkichi Because that is irrelevant. We put things that look similar there, regardless of whether they’re the same script. I then reverted all the ones where you removed anything. Theknightwho (talk) 12:00, 2 January 2023 (UTC)Reply[reply]

Macedonian Ќ ќ, Ѓ ѓ etc. edit

Hello Theknightwho! Many words in the Macedonian entries, in the headword lines, in declension tables, in derived and related terms etc. are messed up, they link to wrong or nonexistent words. I assume this is related to the problem pointed out above, about the Serbo-Croatian "ć".
In Macedonian Ќ ќ /c/ and Ѓ ѓ /ɟ/ are separate letters, different than К к /k/ and Г г /g/, and the diacritic shouldn't be stripped away. It shouldn't be stripped away from Ѐ ѐ (сѐ, нѐ vs се, не) and Ѝ ѝ (ѝ vs. и; See ◌̀#Macedonian) too. The diacritic mark should be stripped away only in the accented letters: А́ а́ Е́ е́ И́ и́ О́ о́ У́ у́ Л́ л́ Р́ р́; See ◌́#Macedonian. Gorec (talk) 11:55, 6 January 2023 (UTC)Reply[reply]

@Theknightwho It seems the problem is in Module:mk-sortkey!? --Gorec (talk) 14:36, 6 January 2023 (UTC)Reply[reply]
@Горец This is fixed! Theknightwho (talk) 19:56, 6 January 2023 (UTC)Reply[reply]
👍 Thanks. Gorec (talk) 14:19, 7 January 2023 (UTC)Reply[reply]
Hello @Theknightwho. I want to report another problem, which I think is related to the above. When we add Macedonian translations that contain accented letters (А́ а́ Е́ е́ И́ и́ О́ о́ У́ у́ Л́ л́ Р́ р́), the autogenerated edit descriptions that appear in Revision history or in Contributions history always show redlinks, even if those entries already exist. For instance, there is an entry забележителен, but in the autogenerated edit description "забележи́телен" is shown as redlink (t+mk:забележи́телен (Assisted)). Can this be fixed somehow? Thank you. --Gorec (talk) 20:49, 23 January 2023 (UTC)Reply[reply]
@Горец Thanks for this. It's definitely a related issue, but I think it's a bug in the translation plugin (which is something I don't know anything about). I would put something on the WT:Grease Pit. Theknightwho (talk) 20:52, 23 January 2023 (UTC)Reply[reply]

Number list errors in Korean edit

Could you please take a look at 일곱째 and the other similar pages in CAT:E? The only other person who has recently edited a module invoked on these entries is Benwing2, but their edits were to a function related to Korean usage examples, whereas yours were to entry name normalization code, which seems much more likely to be relevant. Maybe the code is trying to compare NFC-normalized Korean to NFD-normalized Korean, or similar. 18:58, 6 January 2023 (UTC)Reply[reply]

Actually, it's not even limited to Korean. See Arabic أَرْبَعُونَ(ʔarbaʕūna), South Levantine Arabic مية ألف‎, Serbo-Croatian tisuća (which also isn't linkable), etc. 19:01, 6 January 2023 (UTC)Reply[reply]
Correct. After a false start, I realised it's because the NFD form is being pushed to custom entry name templates (which is desirable), but that means they need to convert it back to NFC again at the end. I've updated Module:Kore-entryname and Module:ar-entryname to do just that. Theknightwho (talk) 19:37, 6 January 2023 (UTC)Reply[reply]
And what about the Serbo-Croatian term being unlinkable? 19:38, 6 January 2023 (UTC)Reply[reply]
I'm getting there. Theknightwho (talk) 19:39, 6 January 2023 (UTC)Reply[reply]
This is fixed. I will do a review for any others. Theknightwho (talk) 19:55, 6 January 2023 (UTC)Reply[reply]
Thank you! 19:56, 6 January 2023 (UTC)Reply[reply]

tt outside of multitrans edit

As I saw you have recently been editing the expensive translation pages, I decided to run the script I wrote to check for a very specific kind of error: the use of {{tt}} or {{tt+}} outside of {{multitrans}}, thus leaving ⦃⦃unreadable code like this¦¦¦¦¦¦¦¦¦⦄⦄ in the visible page output. The script detected problems on the following pages:

Not a big deal, just thought you should be aware of this failure mode. 23:16, 12 January 2023 (UTC)Reply[reply]

Good catch! Not sure what happened with go, but love was down to me forgetting to include the first checktrans section. The savings with these are immense - just look at water/translations. Theoretically, we could put one invoke at the top of the page (taking no arguments), put the whole rest of the page inside <nowiki>/</nowiki>, and then process the whole page from a single call. There would be downsides to that (e.g. one error breaking everything), but it might be something to consider for very large pages. The current lite templates are a huge faff for tiny gain. Theknightwho (talk) 23:34, 12 January 2023 (UTC)Reply[reply]
Would that break section edit links? 23:43, 12 January 2023 (UTC)Reply[reply]
I'm not sure. Given {{head}} is probably the major culprit (meaning there's little use in doing this for each language separately), it would be a problem for sure. Theknightwho (talk) 23:47, 12 January 2023 (UTC)Reply[reply]
In this minimal example, it seems like wrapping ==Headers== in a template that just spits out its input makes the [edit] links go away. I think that's probably a particularly bad thing on pages that are really large, because those are the ones where the [edit] links are most useful, vs. having to deal with the full page's wikicode when you just want to make a minor correction. It could still be worth it overall as a last resort if the alternative is having visible output break. 23:59, 12 January 2023 (UTC)Reply[reply]
Yes, you're right. I've realised it's possible to use Erutuon's {{multitrans-nowiki}} this way, and have changed the sandbox to a (somewhat mangled) version of . It's removed all the edit links there as well. The way around this might be a Javascript gadget, though. Theknightwho (talk) 00:03, 13 January 2023 (UTC)Reply[reply]
Such a gadget would have to not only add the links when action=view, but also figure out when action=edit which section is supposed to be the one being edited (using some ad-hoc URL parameter from the links), remove everything before and after that section from the edit window, and then upon saving add everything before and after back to the saved page content. And deal with "show changes" / "preview" too. It would be doable but it seems likely to be inelegant/brittle. 00:13, 13 January 2023 (UTC)Reply[reply]
One way to do it might be to use the Labeled Section Transclusion extension (which we should have installed, as it's the way Wikisource works). The model would be:
  1. Put the sourcecode for the very large page on a subpage. Everything must be nowiki-fied, except for the <section> tags and the headings we want to have edit buttons.
  2. Use <section> tags to divide up the subpage as appropriate. A section for each language, and one for each etymology (if more than one) is a reasonable compromise.
  3. A single invoke on the main page processes all of the transcluded sections in one go.
I don't know how the extension would react to its output being shoved through Lua, though. Theknightwho (talk) 00:33, 13 January 2023 (UTC)Reply[reply]
Theknightwho (talk) 00:29, 13 January 2023 (UTC)Reply[reply]
That's creative. It might solve the [edit] link problem, but from how you described it to me, a pretty technical user, I anticipate a significant cost of extra confusion for average editors, possibly to the extent that just losing the [edit] links would be better (and there may be other complications I can't think of, as I'm not intricately familiar with that extension). IDK. 00:39, 13 January 2023 (UTC)Reply[reply]
Yes, I agree. It's inelegant at best. I did try forcing the parser to process a template before it processed the argument (i.e. wikitext) being passed to that template, and a small part of me still thinks it might be possible with some very careful manoeuvering. What I tried was:
  1. Passing the wikitext we want to preprocess as a sole parameter to a template (as with {{multitrans}}).
  2. Setting the main invoke as the sole parameter name of the template (i.e. {{template|{{#invoke:}}=...}}). This results in the parameter name being the desired output.
  3. Have a second invoke within the template itself which outputs the only key in frame:getParent().args.
The reason for doing it this convoluted way was to see if the parser would process parameter names before parameters themselves, which seemed like it had an outside chance of letting me dynamically deploy strip markers before the argument itself got processed, but I couldnt seem to get it to work. Ideally, this would be a way to let us use nowiki tags without placing them on the page (thereby reducing confusion), but would of course be less efficient than doing the whole page all in one go. Theknightwho (talk) 01:00, 13 January 2023 (UTC)Reply[reply]

Hiding vandalistic user names edit

For future reference: it doesn't do much good to hide a vandal's user name if it's still there in the rollback message. You can't hide the edit if it's the current one- but you only have to hide the edit summary. I prefer to hide as much as possible of that kind of vandalism so they have nothing to show for their efforts- it's like painting over grafitti. Chuck Entz (talk) 05:49, 16 January 2023 (UTC)Reply[reply]

Good point - I'll remember that for the future. Thanks. Theknightwho (talk) 05:52, 16 January 2023 (UTC)Reply[reply]

Preferred style for transliteration modules? edit

Hi there - I was taking a look at this edit The way it's currently written is definitely shorter but is really hard for me to read. Using uppercase, underscore delimited for constants with their proper utf-8 names is a widely used naming convention. Also naming the range of characters which are syriac diacritics, as syriac diacritics, improves the readability since it's not clear what that range means otherwise.

I partially piggybacked off @Erutuon's style, here's a simple example:

I know you characterized my preferred style as "pointless" but I thought I would offer an explanation for the reasoning behind it without protesting the changes you made. I'm open to better understanding how I can adhere to a style that's consistent with the expectations of the wiktionary community. Just let me know.

ColumbaBush (talk) 20:36, 18 January 2023 (UTC)Reply[reply]

Gratuitously reverting my edits on per cent edit

You're a right foul git. Why are you reverting my edits without explanation? I gave a quotation from the Oxford Dictionary, whilst you gave no reason or evidence. Qiu Ennan (talk) 10:52, 25 January 2023 (UTC)Reply[reply]

Because per cent is also dated in the UK, and as a native speaker of British English who lives in the UK, I feel like I'm in a reasonably good position to say that. I also cannot find that quote in the OED. Are you sure that's where you got it? Theknightwho (talk) 10:59, 25 January 2023 (UTC)Reply[reply]
Per cent is not dated in the UK: it's used in the BBC amongst many other sources. Despite being a native British English speaker (purportedly), you are also prone to mistakes and so you should still provide a source instead of using anecdote, especially when removing other people's contributions. Also, look at the dictionary: per cent is the first form whilst percent is labelled US.
Also, my quote is from the New Oxford American Dictionary – Qiu Ennan (talk) 02:56, 26 January 2023 (UTC)Reply[reply]
I have looked at “the dictionary”. Have you seen this one? Note that it’s also a British dictionary. And no - your quote is from StackExchange, as the link on that page doesn’t actually work.
At the end of the day, you can be as rude as you like, but it won’t change anything. Theknightwho (talk) 12:58, 26 January 2023 (UTC)Reply[reply]
(Historical note: I blocked "Qiu Ennan" because he seems to speak some made-up Anglish and has fought previously against real everyday native English speakers like myself and Theknightwho, and against the truth of how the language is spoken. This harms the project and is just really fricking annoying.) Equinox 02:34, 28 January 2023 (UTC)Reply[reply]

I think you wrongly reverted my edit on muff diver. edit

{{tlb}} is supposed to be placed on the same line as the head. ―Biolongvistul (talk) 18:31, 27 January 2023 (UTC)Reply[reply]

@Biolongvistul It's clearer the way it is. Theknightwho (talk) 18:37, 27 January 2023 (UTC)Reply[reply]
If you say so… ―Biolongvistul (talk) 18:38, 27 January 2023 (UTC)Reply[reply]
In this case I agree with User:Biolongvistul, I've never seen {{tlb}} labels placed on their own line and it looks weird to me that way. Benwing2 (talk) 19:50, 1 February 2023 (UTC)Reply[reply]

I think you wrongly reverted my edit on mugu edit

The South African and West African usage have the same meaning . Igbo/Nigerian scammers operate around the world including in South Africa however Mugu is an Igbo word. Explain at once why you have separated obviously same definitions without providing an edit summary. Beaneater00 (talk) 22:09, 27 January 2023 (UTC)Reply[reply]

Because in South Africa, it's an alternative form of moegoe, which comes from Afrikaans (and before that, the etymology is uncertain). It's not at all clear that we can simply merge them together, especially given that moegoe doesn't only mean "fool", as the definition is a little more complex than that. There's no harm in keeping them as they are, as it shows additional nuance. Theknightwho (talk) 22:15, 27 January 2023 (UTC)Reply[reply]
Who says that moegoe comes from afrikaans ? The other language listed is an urban cant. The Afrikaans usage could just as well reflect Cape Coloured or urban Black African use. This 'moegoe' does not exist on the af.wiktionary, I searched long and hard for an Afrikaans dictionary online, Do you know anyone who speaks Afrikaans ? The only things I found on the web for this << moegoe >> were verbatim copies of your Wiktionary entry. w:André Brink who you cite was a linguistic 'reformer' and anti-Apartheid writer . Perhaps he injected Bantu vocabulary into Afrikaans and it's not an indigenous Afrikaans word that evolved separately , with a separate range of meaning. The definitions which you have cited are apparently garnered from his book quote Why do you have so many comments on your page from today and yesterday about your contentious edits ? It is not apparent to me that you have done anything more than read the entry that already exists and take it as divine word .Beaneater00 (talk) 23:20, 27 January 2023 (UTC)Reply[reply]
If you're disputing the etymology, then please take it to the Etymology Scriptorium, which is where we discuss these things. Theknightwho (talk) 23:29, 27 January 2023 (UTC)Reply[reply]
I'm going to restore the existing version unless you can provide someone to dispute it with. Beaneater00 (talk) 04:21, 28 January 2023 (UTC)Reply[reply]
I've already told you where to go if you want to dispute the etymology. If you do that without some kind of consensus, I'll just revert you and stop you from editing the page, because you'll be edit warring. Theknightwho (talk) 04:28, 28 January 2023 (UTC)Reply[reply]

you gotta use sandboxes edit

Hi. I see yet another error, 'Lua error in Module:data_consistency_check at line 807: attempt to index local 'frame' (a nil value)'. I've asked you before to use sandbox modules, you really need to use them even though they take more work than modifying the production modules directly. Benwing2 (talk) 19:44, 1 February 2023 (UTC)Reply[reply]

@Benwing2 I naively assumed the data consistency check could only be called in one way. In any event, we don't want the new check to show on most pages, as it's really verbose. Theknightwho (talk) 19:56, 1 February 2023 (UTC)Reply[reply]
OK but you should still be using sandbox modules, which you seem resistant to doing. Essentially you need to copy the module itself to a userspace module, along with any calling modules, then test in the userspace module, and then push all relevant modules to production at the same time. User:Erutuon has a different way of doing this that may be more clever. Benwing2 (talk) 20:20, 1 February 2023 (UTC)Reply[reply]
@Benwing2 What I'm struggling with is predicting where errors will crop up. I spent about an hour writing that new function, and it worked on the page I expected it to (and which I wrongly assumed used the only way of calling it). In other cases, I've previewed changes on several pages where I expected any errors would manifest before pushing something, before realising the error happens on (say) 1% of pages. It's frustrating, because I'm not sure that a sandbox would solve that as I don't just push things blindly, but I'm not sure of the best way to test things like this short of having a functioning mirror, as often it's just not possible to know where the errors will happen. Theknightwho (talk) 20:28, 1 February 2023 (UTC)Reply[reply]
I see. Some suggestions: (1) Program defensively if possible, e.g. in this case, it's possible to fetch the current frame using mw.getCurrentFrame() so you don't need to pass it in. (2) You can often work out all the places that call a module using Special:WhatLinksHere and being selective with the 'Namespace' dropdown. (3) If all else fails, you can always search through the dump file; I've done that several times when I need to change a function I know is called from various places. The dump file is about 1G compressed and it takes about 13 minutes to search through it entirely using a Python script on my (rather old) Mac Book pro (if you're interested in my scripts, let me know; they are in [1] but this repository is huge and needs clean up, which I can do). You can also extract out just the Module space code, which is a lot smaller, and search through that. Benwing2 (talk) 20:44, 1 February 2023 (UTC)Reply[reply]
@Benwing2 Thanks. I had the same idea re the frame, as all I really needed was to make sure preprocessing happened. That's a very good idea re the dump file - I'll check that out.
I've considered setting up a testpage which would ideally throw errors if any other page could throw one. That might be quite tricky to do, but it shouldn't be too hard to set up something that covers all the usual bases in one place. Theknightwho (talk) 20:57, 1 February 2023 (UTC)Reply[reply]

Block of Dan Polansky edit

Hi. I am not active on the English Wiktionary, I am active mostly on the German Wiktionary and very little on the Czech Wiktionary, where I had recently a short discussion with Dan Polansky. I'm not his friend, and I don't want to be his advocate or stand up for him. First of all, I would like to clarify which side is right and whether the blocks are correct and reasonable and Dan Polansky deserves them or whether they are exaggerated and unfair.

As a reason for the last block you stated: Continued engaging in personal attacks, despite numerous prior warnings and blocks. This sounds rather general, so I would like to politely ask you if you could provide diffs that show his personal attacks leading to the current block and also at least some of the numerous prior warnings he was given. Thank you very much. Amsavatar (talk) 22:34, 1 February 2023 (UTC)Reply[reply]

@Amsavatar Hello. Just to give you an idea of the problem we’ve had with Dan, I think it would be good for you to look at his talk page here. Note the numerous sections where he makes personal attacks against other users ([1], [2], [3], [4], [5], [6]), including the one which he made today ([7]) which I blocked him for. Calling unwitting users “semi-intelligent aliens”, albeit amusing, is not acceptable, and is likely to drive contributors away given how frequently he does this sort of thing. Some of the other personal attacks, however, are considerably more serious, and display a level of hostility that is fundamentally at odds with a collaborative project.
Please also note his extensive block log, which includes a (later reduced) indefinite block which refers to this discussion, during which Dan receives the block and which gives a little background. The discussion was continued here. The numerous attacks on his talk page have all been made since he returned, and I have given him blocks of doubling length (1 week, 2 weeks etc) on the basis that he knows very well what the problem is, and simply refuses to acknowledge that his behaviour is a problem. He has been told hundreds (if not thousands) of times to cut it out.
If you insist, I will be happy to get you as many diffs as you want, but I’m going to tag @Chuck Entz, @Surjection, @Benwing2, @Vininn126, @-sche who can all attest to his obstructionism, rudeness, inability to participate in good faith and general detriment to the health of the project. Like me, they are all administrators (with Chuck and Surjection also being bureaucrats). Theknightwho (talk) 00:10, 2 February 2023 (UTC)Reply[reply]
IMO Dan is in a class of his own. Along with what is mentioned above, Dan refuses both to cooperate with others and acknowledge even slightly the problematic nature of his edits. He also has a lot of energy and will engage in endless arguments, swamping pages like WT:Beer Parlour. I believe blocking him is warranted and I don't see any likeliness of him improving over time, esp. as he has been a very-long-time contributor and has been problematic for the entire time. Benwing2 (talk) 00:26, 2 February 2023 (UTC)Reply[reply]
Not much more can be added. Dan is problematic. Vininn126 (talk) 07:31, 2 February 2023 (UTC)Reply[reply]
"including the one which he made today ([7]) which I blocked him for"
That does not contain anything that rises to a personal attack however. His sputtering on his talk page is many things, but bannable it is not. A ban of one month for that is evidently excessive. ←₰-→ Lingo Bingo Dingo (talk) 18:52, 2 February 2023 (UTC)Reply[reply]
@Lingo Bingo Dingo In the wider context of everything else, it's a continuation of exactly the same thing. I have merely been doubling the length of the block, as recommended by Equinox in the discussion on -sche's talkpage. Every single time Dan has returned, he has immediately started engaging in exactly the same behaviour as before. Since his return on 30 January, he:
  1. Made this attempt to publicly shame me (which I therefore discounted). The same comment also shows him openly making up policy in a disruptive manner.
  2. Engaged in yet more rules lawyering.
  3. Continued to push to discount people's votes when he doesn't like them:[1] [2].
It's all just a continuation of the exact same behaviour he was engaging in before. In addition to those, there are plenty of new edits which, when taken together, show that his attitude has not changed one bit. Before his return, I also came across this thread on the Czech Wiktionary which had some quite revealing comments (which I've used Google Translate on, so they're a bit squiffy in places):
  • In the English Wiktionary, there are countless liars and liars, as well as bullies; it's no honey. They are a disgrace to Anglo-Saxon culture.
  • If it seems insane, it's probably because it is insane, and the administrators of English Wiktionary appear to be a bunch of incompetent morons. In addition, they appear to be a bunch of liars, liars, people unfit to keep correct corporate or official records, fraudsters, etc. Well, again, it's human, all too human. Better than Hitler, Stalin, Mao, Hegel, and other allegorical vehicles.
  • That the number of powerful people displaying grossly objectionable behavior has increased on the English Wiktionary in recent months and years is another matter; there were always quite a few bad people, but they didn't have that much power.
Can you seriously argue that those seem like the statements of someone who is going to work positively and productively with other users here? Theknightwho (talk) 19:35, 2 February 2023 (UTC)Reply[reply]
So you are unwilling to reflect and reconsider your actions. Fine, enjoy. ←₰-→ Lingo Bingo Dingo (talk) 20:50, 2 February 2023 (UTC)Reply[reply]

sic template edit

Hey, I noticed that the {{sic}} template is not displayed properly in entries (see the quote in up the creek for example). I suspect it is caused by one of your recent edits in important modules. Could you please take a look at it? (Sorry if I'm wrong and there is another reason for the error.) Thanks, Einstein2 (talk) 23:54, 2 February 2023 (UTC)Reply[reply]

@Einstein2 Should be fixed. Was a bug that wasn't throwing errors, but was giving the wrong result. Theknightwho (talk) 23:58, 2 February 2023 (UTC)Reply[reply]

FYI edit

There are cases where using {{inh}} without specifying a term does not add the [Term?] and corresponding request category, e.g. {{inh|en|enm||*ar}}. I'm not sure if it's just when {{{4}}} / {{{alt}}} is given but {{{3}}} isn't, or if there are more complicated rules. If we were talking about {{der}} / {{bor}} then there would be things like {{der|ro|sla}} (specifying a language family avoids the term request), but that wouldn't apply to {{inh}} AFAICT. 23:15, 3 February 2023 (UTC)Reply[reply]

Good point. I'm currently about to publish a big memory-saving edit on dar, and this came up re the Middle Persian borrowing of Tat dar. It seems as though that if {{{tr}}} is given that that also changes things, as it requests the native script instead (which is something I should have double-checked). I think this should all be possible to account for with the following logic:
  1. If {{{3}}} is not given, then check if {{{4}}} has been.
    1. If yes, do nothing.
    2. If not, check if {{{tr}}} is also given.
      1. If yes, display [Script?], demand {{{parentlangname}}} and categorise as a native script request.
      2. If not, display [Term?], demand {{{parentlangname}}} and categorise as a term request.
There may be yet more rules (which I will check), but that should at least cover the current use-case and the most common one. We'll cross the {{der}} and {{bor}} issue when we get to them, I think, as dealing with language families obviously makes it a little trickier.
I'll publish dar before I add this, as I just want to get it out of the way (even if it will be wrongly categorised for a brief period). Theknightwho (talk) 23:33, 3 February 2023 (UTC)Reply[reply]
Thanks. Also compare the current display of Template:inh-lite/sandbox vs. what it would have looked like before today's edits. 00:31, 4 February 2023 (UTC)Reply[reply]
I've incorporated the new change into {{m-lite}} instead, which also covers various other edge cases. It gives the wrong output if a transliteration is given for an empty term in a language that uses the Latin script, but that's an extreme edge case. Plus, {{m}} and {{inh}} behave differently under those circumstances as well, which doesn't seem quite right.
Slightly confusingly, the argument parentlangname works differently in {{inh-lite}} and {{m-lite}}: in {{inh-lite}}, it works as mentioned above. In {{m-lite}}, langname fills that purpose already. As such, parentlangname is a boolean that changes the error message accordingly. Theknightwho (talk) 01:33, 4 February 2023 (UTC)Reply[reply]
As you’ve probably spotted, I refactored this in a way that avoids duplicating the lists of lang codes in each module. I only encountered one language family in use - sla - but I’m not satisfied with the bodge I did to get it working properly. I think I’m going to have to add the family codes as the second-last branch of the main switch, which unfortunately means duplicating those ones specifically. However, this should make the code easier to read. Plus, I’ll update the consistency checker to account for it.
On that note, I’ll also add the checker to Module:data consistency check. Theknightwho (talk) 16:11, 4 February 2023 (UTC)Reply[reply]
Oh, and another thing: I’ll see if it’s feasible to use a similar method to incorporate proto-languages. That way, we can drop the normal modules altogether from some of the large pages, which should help. Theknightwho (talk) 16:14, 4 February 2023 (UTC)Reply[reply]
Thanks for implementing that. I had no idea there was a way to check the first character of a string using ParserFunctions. 11:11, 6 February 2023 (UTC)Reply[reply]
It was a faff, but much cleverer people than me managed to work out how to do it before we had Lua, so I took some old revisions of the Wikipedia string functions and used those as a starting point. The fact that memory usage scales logarithmically with each subsequent call made me suspect that removing all calls into Lua link templates would cause a big drop in memory usage, and thankfully it seems that was correct! The really tricky bit was actually removing the asterisk for the link.
You will very likely see an error about the penultimate character of a checked string not being supported by Template:str index-lite/logic. When that happens, just add the character to the switch table. Theknightwho (talk) 11:27, 6 February 2023 (UTC)Reply[reply]
Could you update Module:data consistency check to handle the restructured format? 23:47, 19 February 2023 (UTC)Reply[reply]
I've decided against using it, as the effect on performance is intolerable for very little benefit, unfortunately. I'll delete the pages shortly. Theknightwho (talk) 23:49, 19 February 2023 (UTC)Reply[reply]

Special:Diff/71217329 edit

I didn't check this edit thoroughly, but the Slovene link to eden should have the diacritic stripped, and if there's one error like that then there could be more. 10:47, 6 February 2023 (UTC)Reply[reply]

What about making a template that we could subst to generate the proper invocation to m-lite? I'm thinking something like {{subst:m-lite-generator|ru|ино́й}} => {{m-lite|ru|sc=Cyrl|иной|ино́й|tr=inój}}. Might not be worth it. Just a potential idea. 10:52, 6 February 2023 (UTC)Reply[reply]
Thanks. That subst idea sounds like a good time-saver.
I've been thinking of setting up templates similar to langname-lite which will let us do this stuff semi-automatically. It'll be clunky (as each term will need its own entry in the switch table), but it would mean we could take advantage of a data consistency check. That way, if the primary transliteration/entryname/sortkey function changes, any that need changing will get flagged up as well. Plus, they'll only need to be changed in one place. Theknightwho (talk) 11:00, 6 February 2023 (UTC)Reply[reply]
I've thought of the possibility of transliterations (etc.) getting out of sync too, but I assumed it would be a minor issue, assuming the old transliteration was still "good enough". For entry names that would be a much more significant problem as the link wouldn't work at all. Checking these things would be a good idea. That said, IDK about putting everything in another big lookup table. I probably would've gone for a solution like scanning every page using {{m-lite}} and checking them using an external script. The table would have the benefits of reducing wikitext clutter and duplication, however, so I'm not opposed to it. It just seems like it might increase the effort needed to convert a page to using lite templates. 11:20, 6 February 2023 (UTC)Reply[reply]
Another consideration is that so far we've been replacing things like {{m|faciō}} by {{m-lite|la|facio|faciō}}, but also replacing {{m|facio|facere}} by {{m-lite|facio|facere}}, so the entry name is not always a normalization of the display name. And similarly, some transliterations could potentially be manual overrides over the defaults, etc. 11:31, 6 February 2023 (UTC)Reply[reply]
I think with sortkeys it's probably a must, because I don't think we want weird stuff like private use characters to be in "public facing" markup. You're right about it potentially being more faff, though. I think with overrides, that could be dealt with by having some kind of override exit point that allows manual specification without causing a data consistency issue. I'll have a think about how to do it.
I have a feeling that scanning every page is likely to cause Module:data consistency check to start throwing memory errors, as apparently that's one of the main reasons the Chinese modules are such memory hogs (though I don't know quite how many pages are involved at any given time). Theknightwho (talk) 11:33, 6 February 2023 (UTC)Reply[reply]
It was trickier than I thought it would be, but I've developed template generators. e.g. {{subst:m-lite/new}} will generate the correct form of {{m-lite}} without the need to manually enter most of the info. Theknightwho (talk) 00:43, 8 February 2023 (UTC)Reply[reply]
It seems great! Thanks! 17:37, 8 February 2023 (UTC)Reply[reply]
[2]. Maybe it would help to make m-lite/new just return the original m template call when the argument contains a link. It would be possible to make a smarter implementation but not sure it's worth it. 01:57, 20 February 2023 (UTC)Reply[reply]
Thanks. It's probably possible to just subdivide the string with a gmatch or something, though the wikitext would start to get very messy. It's also "destructive", in that it's non-trivial to convert it back again if we don't need the lite templates anymore. Theknightwho (talk) 02:04, 20 February 2023 (UTC)Reply[reply]
Special:Diff/71338532: Input: "*w(a/u)". Output: "*w(a&#47;u)". It doesn't display right in the HTML output either. 03:11, 20 February 2023 (UTC)Reply[reply]
This is an unfortunate side-effect of the changes I made to Module:languages to escape formatting characters. I'll update Module:lite-new to resolve them into their normal forms. Not sure why it wasn't displaying correctly, though - possibly some kind of double-escape effect going on. Theknightwho (talk) 03:13, 20 February 2023 (UTC)Reply[reply]
On , {{subst:bor-lite/new|zh|en|bar}} should expand to {{bor-lite|zh|en|bar|sort=己01}}. 03:43, 20 February 2023 (UTC)Reply[reply]
"I have a feeling that scanning every page is likely to cause Module:data consistency check to start throwing memory errors, as apparently that's one of the main reasons the Chinese modules are such memory hogs (though I don't know quite how many pages are involved at any given time)."
My idea was more along the lines of implementing a Lua module to check one page for correctness, and then using an off-wiki Python (or $FAVORITE_LANGUAGE) script to check the output for every page that uses these templates. Or maybe the whole thing could even be done in $FAVORITE_LANGUAGE. That seems in some ways a much easier solution than giving every single term "its own entry in [a gigantic] switch table", but the switch table does have other benefits. And either way we'd have to handle cases of intentional entry name/transliteration overrides. 02:13, 20 February 2023 (UTC)Reply[reply]

Removal of Unsportedpage sorter edit

Hi, regarding this change.

I wonder if you accidentally made a mistake when removing :- from the {{unsupportedpage}}. This has resulted in that the page is now wrongfully sorted under U, for Unsupported. For example; Category:Swedish terms spelled with :. --Christoffre (talk) 22:38, 13 February 2023 (UTC)Reply[reply]

@Christoffre Hiya - I amended {{unsupportedpage}} so that it should be sorting automatically, but it seems that some of the titles are still having issues. I'll do a better fix for it tomorrow. Theknightwho (talk) 22:53, 13 February 2023 (UTC)Reply[reply]
OK, didn't know you where working on the template. I was about to change it back, but I'll just leave it for now as it will eventually be correct. --Christoffre (talk) 07:50, 14 February 2023 (UTC)Reply[reply]

"Risky characters", Template:ja-r and Module:ja-ruby edit

There are still entries flooding in to Cat:E from an absentminded error you made earlier, but there's also something else at work: entries that use {{ja-r}} either directly or through list templates are throwing an error with the message " Lua error in Module:ja-ruby at line 508: Separator "%" in the kanji and kana strings do not match" (see 々#Usage_notes_2 for a typical example). When I checked the transclusion list of one of them, I saw that a couple of your edits to the Module:links family had to do with "risky characters, and it looks like "Separator '%'" is one of those risky characters. I haven't figured out whether this is the cause, but I figure you can connect the dots a lot faster than I can, so here it is, for whatever it's worth. Chuck Entz (talk) 02:24, 18 February 2023 (UTC)Reply[reply]

@Chuck Entz I'm dealing with the Japanese issue now. It's a simple fix. Theknightwho (talk) 02:25, 18 February 2023 (UTC)Reply[reply]
@Chuck Entz That should now be fixed. These sorts of errors are (sadly) to be expected with the changes I'm making, as they're generally caused by nonstandard/problematic uses of these core module functions. I do try to screen for them first, but it can be very difficult to know what horrors will come out of the woodwork.
In this case, % probably shouldn't have been chosen to be used like this, because it's used in URL codes (e.g. wiki/%26 will take you to the page for &). That causes problems if rubytext ever needs to be used with numbers, for instance. Theknightwho (talk) 03:10, 18 February 2023 (UTC)Reply[reply]

Language-specific module handling edit

I don’t know what you are doing, but it has created bare errors: “Lua error in Module:languages at line 159: bad argument #2 to 'gsub' (string/function/table expected)” e.g. دوسر and אריסא Fay Freak (talk) 21:28, 18 February 2023 (UTC)Reply[reply]

@Fay Freak This is dealt with. Theknightwho (talk) 21:42, 18 February 2023 (UTC)Reply[reply]

Templates with URLs wrapped in Template:lang edit


Chuck Entz (talk) 05:20, 20 February 2023 (UTC)Reply[reply]

@Chuck Entz Thanks - will take a look. Theknightwho (talk) 05:21, 20 February 2023 (UTC)Reply[reply]
@Chuck Entz Fixed. Theknightwho (talk) 05:48, 20 February 2023 (UTC)Reply[reply]

Error on Wiktionary:List_of_languages,_csv_format edit

Thanks all of your hard work recently with the language data. Do you know if the module error on Wiktionary:List_of_languages,_csv_format a side-effect of the recent changes to Module:languages/data2? I don't know exactly when that page stopped working but it was sometime after February 1st. It seems that Module:JSON_data is also broken, possibly from the same cause. JeffDoozan (talk) 20:49, 20 February 2023 (UTC)Reply[reply]

@JeffDoozan Hiya. Yes it is - it's because it assumes the scripts for each language will be stored as a table, but I amended Module:languages to allow them to be stored as strings if a language only has one (for performance reasons). I judged Wiktionary:List_of_languages,_csv_format to be low priority, but I will get to it! Theknightwho (talk) 20:52, 20 February 2023 (UTC)Reply[reply]
Okay, don't rush to fix it on my account. JeffDoozan (talk) 20:55, 20 February 2023 (UTC)Reply[reply]

Something weird is happening with the translation adder edit

Hi. I wonder if it's anything to do with your edits on Module:scripts. If you try to add a translation to e.g. Ukrainian, the script automatically added is "C", not "Cyrl". Anatoli T. (обсудить/вклад) 05:46, 23 February 2023 (UTC)Reply[reply]

@Atitarev It's because I updated Module:languages so that we can store the list of scripts for a language as a string instead of a table if it only has one; Ukranian only has Cyrl listed, where (e.g.) Russian has Cyrl and Brai, so wasn't affected. I did this because modules shouldn't be reading from Module:languages/data2 etc. directly, but instead should be getting the list from Module:languages, which will take this issue into account. The translation adder seems to be a special case, but I'm looking into it.
In terms of practical impact, though, this shouldn't really cause any problems: the script isn't actually specified in the translation template that gets added unless the language doesn't have it listed (which means it wouldn't be affected by this issue anyway). Theknightwho (talk) 16:04, 23 February 2023 (UTC)Reply[reply]
It's because Module:languages/javascript-interface uses Module:languages/alldata, right? 16:27, 23 February 2023 (UTC)Reply[reply]
I didn't realise - thanks. I had assumed everything was being routed via Module:languages, due to the big warning at the top that nothing else should be accessing the data directly. Theknightwho (talk) 16:28, 23 February 2023 (UTC)Reply[reply]
Thanks, it seems to work now. I didn't express myself clear before. It wasn't "weird", it was sort of broken. I had to manually remove "C" from "Script code:", otherwise it was giving "Please use a valid script code" error. Anatoli T. (обсудить/вклад) 21:50, 23 February 2023 (UTC)Reply[reply]
Ah of course - that's true. @Erutuon fixed it, in any event. Theknightwho (talk) 21:52, 23 February 2023 (UTC)Reply[reply]

something broke in Module:User:MewBot edit

Hi. I use getLanguageData() in Module:User:MewBot to fetch data on all languages. Between yesterday and today this broke; now it hits the 10-second max running time after it's processed between 1100 and 1150 languages when before it got through all 8173 languages. This breaks various bot scripts such as the one I just wrote to remove redundant Chinese translations (which needs to fetch info on languages so it can convert lect translations from 'zh' to the appropriate code). Do you know why this might have happened? Did you add significant amounts of info to each language that would have resulted in this? Thanks! Benwing2 (talk) 20:02, 25 February 2023 (UTC)Reply[reply]

@Benwing2 I’ll take a look when I get home. I certainly haven’t added lots of info to each language, but I did make the change that mul and und have every script - which does get pulled through if you access Module:languages/data/all as MewBot does. Does the issue still happen if you exclude those two? If that solves the problem, we can work out a way to handle scripts with those two langcodes more efficiently (e.g. I’ve already put special logic into findBestScript to avoid this issue). Theknightwho (talk) 20:21, 25 February 2023 (UTC)Reply[reply]
Unfortunately it's not that; excluding those two doesn't help. Benwing2 (talk) 20:49, 25 February 2023 (UTC)Reply[reply]
@Benwing2 Should be fixed now. Theknightwho (talk) 21:16, 25 February 2023 (UTC)Reply[reply]

this revision edit

Hey. Re: "го́йда-хуёйда". I couldn't think of anything similar in English then. Now I recall. It's like someone childishly says "coffee- fuckoffee" or something, LOL. Anatoli T. (обсудить/вклад) 01:12, 1 March 2023 (UTC)Reply[reply]

Lmao okay, that makes sense. Theknightwho (talk) 16:29, 1 March 2023 (UTC)Reply[reply]

Chinese simplification conversion edit

Hi. Please include the missing language codes for major Chinese varieties: cjy (Jin) and cdo (Min Dong), e.g. Min Dong 國家国家 (guók-gă). Thank you! Anatoli T. (обсудить/вклад) 06:17, 1 March 2023 (UTC)Reply[reply]

@Atitarev I've done this. I accidentally missed out the c codes for some reason. Theknightwho (talk) 16:30, 1 March 2023 (UTC)Reply[reply]

lang-POS or POS|lang? edit

Hi. Does it matter if we use e.g. mul-symbol or symbol|mul? Neither is marked as deprecated. kwami (talk) 21:18, 2 March 2023 (UTC)Reply[reply]

Do you mean {{mul-symbol}} or {{head|mul|symbol}}? Either is fine, but I prefer the latter, as {{mul-symbol}} is just acting as a wrapper for the head template (but in a way that loses a lot of the functionality). Theknightwho (talk) 00:58, 3 March 2023 (UTC)Reply[reply]
Okay, I'll try to go with the more general 'head' template. kwami (talk) 06:44, 14 March 2023 (UTC)Reply[reply]

Vietnamese headword templates edit

Hello, it seems that Nôm characters listed within headwords do not show up anymore in the entries. I'm not familar enough with these so I'm not sure if it's due to your edits at {{vi-noun}} (and other Vietnamese headword templates) or something else entirely, so pardon me if my assumption missed the mark. PhanAnh123 (talk) 11:23, 4 March 2023 (UTC)Reply[reply]

@PhanAnh123 I have a feeling that I know what the cause is. I'm just finishing up something else, and then I'll deal. Theknightwho (talk) 11:24, 4 March 2023 (UTC)Reply[reply]

There's a Wiktionary Discord?? edit

Thanks for doing something to fix the tone markings in the Wu romanization. Hope other dialects like Wenzhounese can get added soon.

Right now I don't like how Shanghainese is the only supported dialect since it's not even considered the "purest" Wu (and that would be Suzhounese). WiktionariThrowaway (talk) 19:30, 6 March 2023 (UTC)Reply[reply]

There is! The info's at WT:Discord server.
It would be good to add more lects - I agree. Right now, I've been focusing on getting automatic transliterations working for other lects, which was the main impetus for improving the way tones are handled with Wu: e.g. (6wu). More generally, I've been trying to better integrate the modules for the Chinese lects with Wiktionary's core modules, which should make adding new lects more straightforward. Theknightwho (talk) 19:41, 6 March 2023 (UTC)Reply[reply]

ₚосія edit

Thanks for fixing this up. I don't know what to do with the transcription, though. Looks like we'll need to enter it manually, but I don't see an option for that. kwami (talk) 06:46, 14 March 2023 (UTC)Reply[reply]

@Kwamikagami There is no current support in the Ukrainian modules for manual translit. This could potentially be added but it might be a lot of work. Probably better is to modify the Ukrainian transliteration code to recognize the ₚ char (which just displays as a box on my Mac in Chrome; we probably need a font addition to the CSS to make sure this shows up correctly). Benwing2 (talk) 02:34, 15 March 2023 (UTC)Reply[reply]

hyphens remaining at the end of lines when displaying suffixes across lines edit

Hello, Theknightwho. I see you edited Module:links recently. I wonder if it would be possible to make the {{l}} and {{m}} link displayed as nowrap, as it looks rather clumsy (in the case of suffixes) when a hyphen remains at the end of a line on its own and the rest of the suffix goes to the next line. (An example, see near the bottom right corner.) I've been thinking about replacing hyphens with non-breaking hyphens for the display, but they cannot be copied as real hyphens. Also I'm aware that a nowrap would affect spaces as well (and word-internal hyphens should be handled too) so it's not ideal either. Maybe an uncopiable nonbreaking space could be inserted after a leading hyphen? Do you have some idea? Adam78 (talk) 15:05, 14 March 2023 (UTC)Reply[reply]

@Adam78: See Wiktionary:Grease pit/2022/August § Wrapping of hyphen in affixes. J3133 (talk) 16:20, 14 March 2023 (UTC)Reply[reply]

Thank you! Adam78 (talk) 16:33, 14 March 2023 (UTC)Reply[reply]

memory issues again edit

Hi, I see a substantial and increasing number of one-char Chinese entries in CAT:E again, just curious if you've made changes that result in this? Benwing2 (talk) 02:36, 15 March 2023 (UTC)Reply[reply]

@Benwing2: Wiktionary:Beer parlour/2023/March#Retiring derivative subpages. Chuck Entz (talk) 05:44, 15 March 2023 (UTC)Reply[reply]
The memory errors in CAT:E increased by ~ 12 in the last 30-60 minutes. I suspect this: [3] I think we need to think more carefully about how to make changes without ever-increasing memory usage. Benwing2 (talk) 08:14, 16 March 2023 (UTC)Reply[reply]
@Benwing2 I suspect you're right. I added those as a precautionary measure, as I don't know if anyone is pushing raw category/file links through any of the substitution methods (I'd hope not, but it's possible). It's not necessary if things are done via Module:links, though, as it's already set to ignore these four at a much earlier stage if it detects them as embedded links. See Module:links#L-209: they just get returned as-is. By extension, this means everything from {{also}} to {{lang}} is covered as well, as they all work via the link module at some stage. Do you think I can probably remove these? Theknightwho (talk) 08:21, 16 March 2023 (UTC)Reply[reply]
I suspect they can be just removed; but if you're adding those just as a precautionary measure, you might want to remove them but in their place add some temporary template tracking code to see if they're actually doing anything. Within a few hours the Special:WhatLinksHere tracking "category" will have entries in it if the code is actually doing something, and then you can correct the callers as appropriate. Benwing2 (talk) 08:35, 16 March 2023 (UTC)Reply[reply]
@Benwing2 Seems like something’s pushing raw category links through, though I’m unsure what as of yet: [4]. That shouldn’t be happening, really, as it’s obviously pointless. Theknightwho (talk) 16:02, 16 March 2023 (UTC)Reply[reply]
Add a call to error() to the code (without saving) and preview one of the pages. Benwing2 (talk) 16:47, 16 March 2023 (UTC)Reply[reply]
@Benwing2 Yep - found the issues:
  1. Each substitution method in Module:languages does temporary substitutions to protect formatting twice: once before preprocessing, and once straight after the substitutions/just before postprocessing, in case any formatting/weird stuff gets added by a module. Everything then gets put back at the end. A couple of transliteration modules were adding categories by concatenating them, which meant these were being picked up in the second round. I’ve adjusted those modules so their categories get dealt with properly.
  2. When there are embedded links, Module:links still runs the unlinked text through makeDisplayText, which is done by iterating over all the gaps between the links. This is to pick up things like false palochkas, character escapes and whatever. However, I was only checking for piped links, as I didn’t think there was any way for an unpiped link to reach that stage. What I’d missed was that - very occasionally - there will be unpiped category links that get fed directly into the link template, which basically just get ignored for the reason I explained earlier. However, it meant they were incorrectly being treated as unlinked text at this stage. I’ve updated Module:links accordingly. I’ve not been able to figure out a pattern that perfectly matches piped and unpiped links in all contexts, but I’m confident the two separate patterns I came up with are 100% accurate to the parser (and are an improvement over what we had before). Theknightwho (talk) 17:07, 16 March 2023 (UTC)Reply[reply]
Great, sounds good. When matching links I've always used separate patterns for piped and unpiped links; User:Erutuon sometimes uses a single pattern maybe using %f but it may not be 100% accurate. Benwing2 (talk) 17:12, 16 March 2023 (UTC)Reply[reply]
@Benwing2 @Erutuon Given we were already being more permissive with our embedded links than standard wikitext anyway, I decided to make it as permissive as possible (since I noticed we do actually use "illegal" links like [[<]] in headword templates already). As an extreme example, it can now cope with {{l|mul|[[ [[ []] ]]}} to output [[ [ ]]. The only thing you can't do is use [[ or ]] (though you can get round the display text problem with nowiki tags). Theknightwho (talk) 20:31, 16 March 2023 (UTC)Reply[reply]
Cool. I wonder if we shouldn't fold Module:languages/data/patterns into Module:languages; small modules can cause big memory issues so we might get better luck with fewer modules. Benwing2 (talk) 20:38, 16 March 2023 (UTC)Reply[reply]
@Benwing2 I did actually do that originally, but there are massive performance issues unless it's either declared in the function itself or cloned, which is a pain as three functions use it. The reason I opted for a separate module was to save it being loaded unnecessarily. Theknightwho (talk) 20:43, 16 March 2023 (UTC)Reply[reply]
Hum. We still have a bunch of new entries in CAT:E since yesterday. Not sure what caused it. Benwing2 (talk) 20:44, 16 March 2023 (UTC)Reply[reply]
@Benwing2 In all honesty, I think we should probably look at sorting out the Chinese modules. They're seriously bloated.
I did develop a bunch of string functions which are in Module:string utilities, which are designed to use the string library wherever possible - only using ustring as a last resort. It varies, but they definitely do help, so we could start using those in more places. Theknightwho (talk) 20:49, 16 March 2023 (UTC)Reply[reply]

──────────────────────────────────────────────────────────────────────────────────────────────────── I agree and I think we should seriously look into my earlier suggestion of encoding the T<->S tables more efficiently as they have to be a big part of it. Benwing2 (talk) 20:51, 16 March 2023 (UTC)Reply[reply]

@Benwing2 They are definitely a contributing factor, but there are quite literally thousands of Chinese data modules in the subpages of Module:zh. I've noticed T:zh-pron, T:zh-dial and similar using 8-15MB each in some cases, and I'm convinced it's due to this. By comparison, the traditional/simplified tables are small fry.
A major concern I have is that the Chinese modules make very little use of the main infrastructure, or at best it's piecemeal, which means there's a lot of duplication going on. I think we should probably tackle Module:zh-pron and its submodules Module:cmn-pron, Module:yue-pron etc. first, as it's the main Chinese module. I've done a few things, but they probably need a total rewrite. Theknightwho (talk) 20:58, 16 March 2023 (UTC)Reply[reply]
@Benwing2 I think one cause of the spike yesterday was this diff, which was fair enough as there was a minor bug in a local function of Module:string utilities that tries to convert ustring-only patterns into string-compatible ones: ? after a multibyte character wasn’t being converted properly, which affected a few gsubs. Since patching that and restoring it, the number of entries in CAT:E has decreased. Theknightwho (talk) 14:16, 17 March 2023 (UTC)Reply[reply]
Great, thanks for fixing the underlying issue. Benwing2 (talk) 15:26, 17 March 2023 (UTC)Reply[reply]

Hebrew Entries in CAT:E edit

The Hebrew-specific modules haven't been changed recently enough to cause this. I suspect it may have something to do with delimiters in the parameters being messed with by another module. Chuck Entz (talk) 15:14, 15 March 2023 (UTC)Reply[reply]

@Chuck Entz Odd that it only affects these three entries. I’ll have a look. Theknightwho (talk) 16:19, 15 March 2023 (UTC)Reply[reply]
Those are fixed, but there's a separate issue that seems to be correlated somehow with |pausalwv= (though there are a couple of exceptions with the same error but without that parameter and a few with the parameter but no error).It happened after @Erutuon edited Module:he-common, but I have no idea if that caused it. Chuck Entz (talk) 15:17, 16 March 2023 (UTC)Reply[reply]
@Chuck Entz User:Erutuon accidentally deleted a line when making that change, leading to the issue. Benwing2 (talk) 16:58, 16 March 2023 (UTC)Reply[reply]
@Benwing2: Thanks for fixing it. Embarrassing mistake. It happened because I had logged around that statement and then deleted the whole block of text. — Eru·tuon 18:33, 16 March 2023 (UTC)Reply[reply]
@Erutuon No problem at all. Benwing2 (talk) 20:39, 16 March 2023 (UTC)Reply[reply]

Your edit on module links seems to have broken stuff edit

See for example Special:Diff/69225452/71670296 which tries to fix it. – Wpi31 (talk) 14:47, 16 March 2023 (UTC)Reply[reply]

@Wpi31 This is fixed. It was a strange issue with certain redundant wikilinks in links. Theknightwho (talk) 14:50, 16 March 2023 (UTC)Reply[reply]

bor en cmn & zh-l edit

Hey: I saw this diff and I have been trying to copy it elsewhere. I wanted to say that there is some functionality of zh-l that is lost in bor en cmn: look at Diaoyutai & Fengqiu (manually adding character forms) and also the * functionality in zh-l (where no traditional or simplified are displayed when you put the other into zh-l). I don't know if you're working in this area or not, but I assume you are/were. If I make no sense, nevermind. --Geographyinitiative (talk) 22:06, 17 March 2023 (UTC)Reply[reply]

@Geographyinitiative These can both be solved by the same thing, actually. You can use // to give manual forms: {{l|cmn|釣魚臺//釣魚台//钓鱼台}} gives 釣魚臺釣魚台钓鱼台 (Diàoyútái). Because (1) manual forms override automatic ones and (2) empty forms aren't shown, you can put // at the end to stop automatic simplification: {{l|cmn|詞典//}} gives 詞典 (cídiǎn), whereas {{l|cmn|詞典}} gives 詞典词典 (cídiǎn). On the other hand, * at the start is used for reconstructions, as with other languages. Theknightwho (talk) 22:14, 17 March 2023 (UTC)Reply[reply]

Module:nan-pron-Hainan edit

Hi. Should we use IPA for this? I'm speaking of tone specifically. There was general consensus a couple years ago that Chinese should be in IPA just like any other languages, but we had one major contributor (who was adding a lot of valuable material) who threatened to quit Wiktionary if we switched to IPA. kwami (talk) 02:31, 18 March 2023 (UTC)Reply[reply]

@Kwamikagami I have no opinion on this, to be honest. @Justinrleung? Theknightwho (talk) 02:33, 18 March 2023 (UTC)Reply[reply]
@Kwamikagami: You mean with the tone letters, like ˥ ˦ ˧ ˨ ˩ ? I think this should be discussed more widely, not just for this particular module. — justin(r)leung (t...) | c=› } 04:19, 18 March 2023 (UTC)Reply[reply]
Agreed. kwami (talk) 04:34, 18 March 2023 (UTC)Reply[reply]

{{ryu-r}} edit

Hi, I see that you're updating ryu-r to be in line with ja-r. Would you mind creating {{ryu-r/multi}} and {{ryu-r/args}}, similar to {{ja-r/multi}} and {{ja-r/args}}? This would be immensely helpful for some of the pages that are very close to the lua memory limit and has quite a number of transclutions of ryu-r, such as 人 and 月. – Wpi31 (talk) 09:21, 23 March 2023 (UTC)Reply[reply]

@Wpi31 Good idea. I wonder if it's possible to do this without forking the code even more, as they'll inevitably get out of sync again. Theknightwho (talk) 09:23, 23 March 2023 (UTC)Reply[reply]
@Wpi31 This is done. Theknightwho (talk) 10:19, 23 March 2023 (UTC)Reply[reply]

optimization and memory errors edit

Hi, I see CAT:E has ballooned again with memory errors. We really need to optimize Module:languages (especially) and Module:links to avoid doing unnecessary work. I think this needs to be your top priority currently. You should add checks in various places to see if there are any chars that could potentially cause issues, and do nothing if not. Currently it seems we're doing a whole lot of unnecessary splitting, processing and rejoining. I also see you're calling mw.clone() on Module:languages/data/patterns in various places; that uses extra memory, is it really necessary? Benwing2 (talk) 22:04, 23 March 2023 (UTC)Reply[reply]

@Benwing2 This happened after I implemented the fix for the Korean transliteration issue, unfortunately, but I've been too tired to deal.
I remember that the pattern cloning was necessary because there was a massive, unexplained slowdown without it, but I can't seem to replicate it now. Looking back, I have a feeling it was probably down to the mw.ustring.gmatch bug.
I'll have a look at further optimisation tomorrow, as you're right that it's not ideal at the moment. Theknightwho (talk) 22:16, 23 March 2023 (UTC)Reply[reply]
OK thanks, please do take care of yourself and get some sleep! (You mentioned working in law, and I know at least in the US some law firms are notorious for working their employees to death.) Benwing2 (talk) 22:31, 23 March 2023 (UTC)Reply[reply]
@Benwing2 Thanks - appreciate it! Theknightwho (talk) 23:04, 23 March 2023 (UTC)Reply[reply]
Hi, there are still 28 or so terms in CAT:E. We really need to bring them down. I can help you work on optimization, but I can't do it alone as I don't understand some of the specifics of the code you've written. Benwing2 (talk) 04:29, 26 March 2023 (UTC)Reply[reply]
@Benwing2 Hiya - sorry, I forgot about this yesterday. I've dealt with most of them, but the two Han characters are proving quite tricky. I'll see what I can do. Theknightwho (talk) 20:20, 26 March 2023 (UTC)Reply[reply]
Thanks. How did you bring them down? What I'm suggesting is looking for opportunities to avoid unnecessary work in Module:languages, not simply using ever more *-lite templates. IMO doing this is not hard and very important. I would do it myself but I don't know which sorts of optimizations of this nature are safe. Benwing2 (talk) 20:22, 26 March 2023 (UTC)Reply[reply]
@Benwing2 Mostly with {{tt}}, but ~5 needed lite templates. I agree that we need a longer term solution, but I was trying to get these out of the category in the short-term, as it may take a bit of work to touch up Module:languages (as there's quite a lot to go through, and I'm not happy with the overall structure right now). In the immediate term, I think it's okay to keep adding more lite templates as and where necessary, as we can always remove them once the memory issues are a bit less pressing. Theknightwho (talk) 20:27, 26 March 2023 (UTC)Reply[reply]
The optimizations I'm suggesting will not require major rewriting of Module:languages and will need to be done even once rewritten so IMO you may as well look into them now as they will save a lot of time worrying about adding lite templates and such. What I'm suggesting is similar to what I already do when parsing <...> inline modifiers; first check to see if there's a less-than sign anywhere, and if not, avoid loading Module:parse utilities and fall back to simpler code. In your case, check e.g. for brackets, apostrophes and other things that might require you to split the string into parts and process each part individually, and fall back to simple code that doesn't do any splitting. Essentially you optimize for the most common/simple case and avoid invoking (and ideally even loading) the more complex code. Benwing2 (talk) 20:33, 26 March 2023 (UTC)Reply[reply]
@Benwing2 I've made a few changes along these lines, which seems to have helped. I've had to keep one instance of mw.clone in doTempSubstitutions: depending on the params, certain additional patterns get inserted into the table of patterns that gets iterated over. If you don't clone the pattern table, these seem to get inserted into the version of the table sitting in packages.loaded, which means they're still there the next time the pattern table gets loaded. On the other hand, using mw.loadData isn't an option as we need to insert the extra patterns, plus it seems to cause a memory increase anyway. Theknightwho (talk) 00:35, 27 March 2023 (UTC)Reply[reply]
Also just as an FYI, I've used the U+100000-1FFFFD range for the PUA substitution characters, because we can take advantage of that in capture patterns. e.g. "\244[\128-\191]*" will always match a character in that range, but it also means that patterns like "^[\128-\191\244]*" are usable as well (for a string of PUA chars at the start of a string). The only false positives would be non-UTF-8 compliant, and they get caught earlier in the process. Theknightwho (talk) 00:49, 27 March 2023 (UTC)Reply[reply]
Over the years, we've had several cases of people erroneously creating entries with PUA characters in the entry names. I hope that kind of thing won't lead to weird and hard-to-diagnose errors before we spot such entries and delete them. Chuck Entz (talk) 01:05, 27 March 2023 (UTC)Reply[reply]
@Chuck Entz They won't cause errors - those characters simply won't go through a lot of the text processing that other characters will. Given they're PUA characters, it doesn't really make much sense for them to be processed anyway, so it doesn't really matter. It still works, though, if you really want to do it: e.g. {{l|en|􀀀􀀀}}, {{l|en|'''􀀀􀀀'''}}, {{l|en|w:'''􀀀􀀀'''}} becomes 􀀀􀀀, 􀀀􀀀, 􀀀􀀀 and so on. I can put a pre-check on it to stop people doing this, though, as there's a chance the text might get mangled (but again, as they're PUA characters, people shouldn't be making these in the first place, so what's getting "mangled" is actually totally meaningless). Still no errors, though. Theknightwho (talk) 01:15, 27 March 2023 (UTC)Reply[reply]
Thanks. Yeah sometimes mw.loadData() increases memory usage; I ran into that with the {{place}} data, where separating out the functions and tables and using mw.loadData() on the remainder increased memory on a test page with about 60 invocations of {{place}} from I think 25 to 29M. I think the issue is the wrapping of tables, which adds a bunch of overhead, which is only made up for if you load it a whole bunch of times (apparently 60 wasn't enough). As for PUA chars in entries, I wouldn't worry too much about them esp. if the pre-check for them will add memory. Benwing2 (talk) 01:26, 27 March 2023 (UTC)Reply[reply]
I'm not sure it's the number of uses, as doTempSubstitutions gets used (tens of) thousands of times on large pages. Who knows.
Re the precheck, if text:match("\244") then error(...) end at the beginning should be enough, I think. Theknightwho (talk) 01:30, 27 March 2023 (UTC)Reply[reply]
If it helps any, the error at is due to:
ɨ + ̱
U+0268 + U+0331 (Combining macron below)
I'm not sure how to enter that into {{str index-lite/logic}}, or I would have fixed it myself. Chuck Entz (talk) 02:09, 27 March 2023 (UTC)Reply[reply]
@Chuck Entz Thanks - that happened after I updated the lite templates to escape * at the start of translits, as it was sometimes wrongly causing a new list to start. I just got rid of the lite template here, as the page is well below the limit. Tangut translit is intensive (as it's all in a character database), but the page is still only at 46MB.
Generally, it's good to avoid adding characters to {{str index-lite/logic}} wherever possible, as the longer it gets, the more likely other pages are to hit some of the other page limits. Theknightwho (talk) 02:17, 27 March 2023 (UTC)Reply[reply]
For the record, I have been trying to bring down 一 and 茶 for the past few days but to no success – 一 is now not in CAT:E presumably due to changes with the Korean templates as mentioned, but there is too much stuff in the descendants of 茶, which I think a multidesc template would help. – Wpi31 (talk) 05:51, 27 March 2023 (UTC)Reply[reply]
Look ma, no memory errors! Looks like you've done a lot of work on this, thank you! Benwing2 (talk) 16:54, 29 March 2023 (UTC)Reply[reply]

serialization errors edit

Hi, there are a bunch of errors now in CAT:E. Looks like they are due to a malformed regex coming from one of the serialization modules. Benwing2 (talk) 04:41, 29 March 2023 (UTC)Reply[reply]

@Benwing2 Turns out you still need to escape [ even if it's the whole pattern. Really dumb. Theknightwho (talk) 04:47, 29 March 2023 (UTC)Reply[reply]
I am purging CAT:E but I still see some errors related to malformed regexes, maybe other chars need escaping? See also Module:pattern utilities. Benwing2 (talk) 04:55, 29 March 2023 (UTC)Reply[reply]
@Benwing2 Sorted (though it was a pain, for some reason). Theknightwho (talk) 06:02, 29 March 2023 (UTC)Reply[reply]

Block edit

Hey. You have to block me, I'm about to go fucking crazy Van Man Fan (talk) 22:36, 30 March 2023 (UTC)Reply[reply]

something wrong with Classical Latin, Late Latin, Renaissance Latin etc. categories edit

I assume this is related to your etymology-language changes, but categories like CAT:German terms derived from Classical Latin are now empty (see CAT:Empty categories), and the category text for a category like Category:German terms derived from Late Latin redlinks the word 'Late Latin' to Category:Late Latin language when it should link to Category:Late Latin. Benwing2 (talk) 01:55, 1 April 2023 (UTC)Reply[reply]

@Benwing2 Good catch - fixed. Theknightwho (talk) 02:04, 1 April 2023 (UTC)Reply[reply]

re: und edit

I think you need to rethink your edits relative to this code. For instance, Philistine is a real language of unknown affinities, and has 11 lemmas. Many of them are questionable, but we can't rfv them- because your edits have messed up the processing of "und-phi", which is an exception code based on "und", not "und" itself.

Also, "und" has been used as a placeholder when the correct language code is unknown. Most of the time this is a bad idea, but there are some cases where it's at least arguable. For instance, while inheritance by "und" is clearly wrong, borrowing doesn't require any relationship between the two languages, so borrowing into a language with an unknown code can certainly happen. On a practical level, these "und" descendants have been added by very knowledgable people to deal with problems they couldn't solve. They aren't going to be easy to clean up. Chuck Entz (talk) 16:00, 1 April 2023 (UTC)Reply[reply]

After spending a good part of my day on module errors, it looks like the problem with Philistine has nothing to do with the "und" part. Instead, it's due to it having nil instead of a family code. There are a number of languages like this, mostly in the Americas but also in Africa or the ancient Near East. What they have in common is extremely limited attestation due to having died out before a decent corpus could be collected or produced. With such a limited corpus, it's often impossible to determine whether a given language is related to others, or even whether it's an isolate. I've spot-checked a number of these, and they all throw the same error in the {{rfv}} template, due to the module's checking the family in order to decide which rfv forum to send it to. The simplest solution would be to just assume that anything with a null family code goes to RFVN, and skip the rest. We really do need to have {{rfv}} working for these, since unknowns tend to attract crackpots or overconfident idiots who are only too happy to fill in the blanks. See for instance the trouble we've been having with Illyrian (never mind that we're pretty sure it's Indo-European), though Philistine itself is a pretty good illustration, too. Most of the Philistine entries were added and/or edited either by BedrockPerson and socks or by ShlomoKatsav, who probably knows better by now. Interestingly, most of ShlomoKatsav's entries were deleted after an RFV initiated by what seems in retrospect to be a BedrockPerson IP sock. But I digress...
Also, the two Proto-Turkic entries in CAT:E seem to be the same thing going in the other direction: they're only attested in one very old manuscript covering a multitude of languages- so it's apparently impossible to tell exactly what language they are- but there's enough data to figure out which branch they belong to. Chuck Entz (talk) 02:43, 2 April 2023 (UTC)Reply[reply]
@Chuck Entz You make some very good points, which I hadn't considered - I've removed the ban.
This is something that came up due to the problem of substrates, where (at least internally) they're treated as etym-only variants of und, but with the language family of whatever their parent is; as what distinguishes them is that they have families for parents, not regular languages. Before the major update to etym-only languages that just happened, they were being caught by the ban on using family codes in descendant templates, as the template would grab the parent, see that it's a family, and throw an error. However, after the update, they'd be getting processed as und instead - and therefore working.
To be honest, I suspect the old logic was simply an oversight, as substrates are hardly used. If we allow und for descendants, then there's no reason we shouldn't allow substrate descendants either. Theknightwho (talk) 16:05, 2 April 2023 (UTC)Reply[reply]
@Chuck Entz I've fixed the family bug. The underlying issue was that the language's inFamily first checks if the language is in family X. If not, it then checks if the language's family is in family X, (and if not checks that family's family etc). It's that second step which was throwing the error, because you can't check what family nil is in. I've changed it so that it immediately returns false if a language doesn't have a family; the practical result being that Module:request-forum will see that these terms aren't Italic, so sends them to WT:RFVN. Theknightwho (talk) 23:05, 2 April 2023 (UTC)Reply[reply]

Incorrect Middle Chinese final for some characters edit

I saw you had recent edits to Module:ltc-pron/data so I wonder if you know where the following bug is coming from.

The final in the Middle Chinese table (4th row) for certain rimes loads the wrong rime character but the right number.

Examples where you can see include: (showing 眞 instead of 質) and (showing 冬 instead of 沃)

It seems the items in the "data.fin_conv" variable in the module are the ones affected.

Zywxn (talk) 16:32, 2 April 2023 (UTC)Reply[reply]

@Zywxn: these are in fact correct and not a bug: 質 is the entering tone equivalent of 眞, likewise for 冬/沃 and the other ones in data.fin_conv. The entering tone variant is by all means phonologically the same as the non-entering tone equivalent except that it ends in a stop and being an entering tone.
For the purpose of the module, saying that a character has 冬 rime and entering tone is sufficient for telling the reader that it has 沃 rime. – Wpi31 (talk) 19:25, 2 April 2023 (UTC)Reply[reply]

errors with ancestral_to_parent edit

See CAT:E. Benwing2 (talk) 03:34, 3 April 2023 (UTC)Reply[reply]

@Benwing2 Thanks - was a metatable issue, where the parent now caches _type during the creation of an etym-only language due to running :hasType(), but that was interfering with the etym-only language's own :hasType method by giving a false positive when it checked for a cached _type table: [5] Theknightwho (talk) 04:12, 3 April 2023 (UTC)Reply[reply]
This is definitely a non-obvious interaction; can you add a comment by your change in [6]? Otherwise it won't be at all obvious why this is being done. Benwing2 (talk) 04:17, 3 April 2023 (UTC)Reply[reply]
@Benwing2 Done. Theknightwho (talk) 04:21, 3 April 2023 (UTC)Reply[reply]

My bot created a bunch of these, which is an indication that the category code isn't working quite right. I've stopped it from creating more but the category code needs fixing here. Benwing2 (talk) 05:50, 4 April 2023 (UTC)Reply[reply]

BTW I can delete these once you fix the category code. Benwing2 (talk) 05:51, 4 April 2023 (UTC)Reply[reply]
@Benwing2 Thanks - all of these have been dealt with. Theknightwho (talk) 10:54, 5 April 2023 (UTC)Reply[reply]

Middle Iranian errors edit

Hi again, see CAT:E. I purged the category, which eliminated the Old Iranian errors, but the Middle Iranian ones remain. I assume these categories shouldn't exist at all; "Middle Iranian" and "Old Iranian" are not languages at all, but families (and not even clades at that). However, this is a sore subject for User:Vahagn Petrosyan, who insists that these "languages" should exist for ease in porting over Armenian etymologies, where it's apparently difficult to figure out which of several candidate Old and Middle Iranian languages are the right donors for various borrowed terms, and so the linguists working in this area lazily put "Middle Iranian" and "Old Iranian". I have difficulty understanding how this can possibly work, as the phonologies of different Old and Middle Iranian languages are radically different, but go figure. Benwing2 (talk) 06:31, 5 April 2023 (UTC)Reply[reply]

You can make both Old Iranian and Middle Iranian codes categorize into "Iranian languages" if that will make things easier. For example, the etymology of սրահ (srah) would display "From Middle Iranian *srāh" but categorize into Category:Old Armenian terms borrowed from Iranian languages. Vahag (talk) 07:18, 5 April 2023 (UTC)Reply[reply]
@Benwing2 @Vahagn Petrosyan So Old and Middle Iranian are the only two etymology-only families that we have. Unlike with substrates, where they behave like variants of und, that doesn’t seem appropriate here. I can fix them up, but the question still remains whether we should even have something like this at all. Theknightwho (talk) 10:54, 5 April 2023 (UTC)Reply[reply]
I have come to the conclusion that despite some claims, Old Iranian borrowings in Armenian probably do not exist. The main reason for having separate "Middle Iranian" and "Old Iranian" has disappeared for me. I don't mind if the codes are deleted. You can bot-replace {{der|xcl|MIr.}}, {{der|xcl|OIr.}} with "Middle {{der|xcl|ira}}", "Old {{der|xcl|ira}}". Vahag (talk) 11:58, 5 April 2023 (UTC)Reply[reply]
I find it okay too, even though it is certain that both Old and Middle Iranian borrowings exist in Aramaic. Not every language conceptualization with a name needs to have a code, if it is not even a language per se and etymological storytelling does not require it either. It is perfectly okay if Sarri.Greek has written “Hellenistic Koine Greek” on various places because that’s what she got to know while language codes may not do anything here. I used “Middle Iranian” because of analogy and Systemzwang more than inherent reason. Fay Freak (talk) 14:29, 5 April 2023 (UTC)Reply[reply]
@Vahagn Petrosyan @Fay Freak From a technical perspective this is easy to fix - I agree it isn’t a true family, but it feels a bit crap to remove it entirely. This feels like the sort of thing we could keep for the purpose of categorisation, but nothing more.
That being said, I don’t know much about Indo-Iranian languages - I’m just commenting on the fact that there’s no technical limitation. Theknightwho (talk) 14:34, 5 April 2023 (UTC)Reply[reply]
Middle Iranian and Old Iranian complicate the categorization. Now we have Category:Old Armenian terms borrowed from Middle Iranian languages whose members do not directly appear in Category:Old Armenian terms borrowed from Iranian languages, which is not optimal. Let's remove them. Vahag (talk) 16:06, 5 April 2023 (UTC)Reply[reply]

Problem with categorisation for links to entries with multiple pronunciations edit

See for example 南#Coordinate terms, where [[Category:|南]] appears after 北 and 南. – Wpi31 (talk) 19:48, 6 April 2023 (UTC)Reply[reply]

@Wpi31 Fixed. I forgot that Module:zh-translit needed to be tweaked after a change to Module:links: [7]. Theknightwho (talk) 20:11, 6 April 2023 (UTC)Reply[reply]

This doesn't look right. There are a lot of such categories with 'Unspecified' for the script, e.g. Category:Requests for Unspecified script for Old Uyghur terms. I suspect you should avoid generating such categories for 'Unspecified' script as well as probably for the 'Undetermined' language. Benwing2 (talk) 06:41, 7 April 2023 (UTC)Reply[reply]

@Benwing2 Thanks - I’m on mobile right now, but I suspect this is down to substrates being handled as und. Will have a look when I get home. Theknightwho (talk) 20:23, 7 April 2023 (UTC)Reply[reply]

Georgian terms in CAT:E edit

If it helps any, these are completely unrelated to the Middle Persian issue. There seems to be something about Module:ka-form of in those entries that's not setting up the parameters for Module:links the way the latter is expecting since you changed the code. I checked the transclusion list for one of the entries: none of the Georgian-specific modules or Georgian-specific data in data-modules has been changed recently except for some changes to Module:Geor-translit on March 30. Chuck Entz (talk) 21:43, 7 April 2023 (UTC)Reply[reply]

@Chuck Entz I know what’s causing the problem, as it’s down to Module:links destructively modifying the input. Normally this doesn’t matter, but it does sometimes matter in loops when the same data container gets used over and over. This came up with a Pali module as well.
It’s possible to solve by doing what’s called a “shallow copy” (where you create a copy of the container that also contains the originals, versus a deep copy that copies everything inside as well). That way, any modifications to to the container don’t affect the original one. However, this has a (very minor) memory impact, and also seems to break modules that are accidentally passing the input data as a global variable. I need to catch all of those first (such as the one I just fixed in Module:interproject). The memory impact is negligible, but will probably require some additional saving measures on extremely vulnerable pages like go. Unfortunately, there’s no real way around that impact, as the destructive modification is the most efficient way to implement multiple forms for links.
Incidentally, Module:headword has suffered from this same problem for a while, so I’ll look into fixing that, too. Theknightwho (talk) 21:54, 7 April 2023 (UTC)Reply[reply]

Script request cats with module errors edit

I'm not sure exactly how to fix these, but I think the problem stems from assuming that a |tr= parameter is always an implicit request for a native-script equivalent. After weeding out a couple of incorrect language codes and fixing cases where the |tr= was an improvised substitute for using an empty first parameter with a display text in the second parameter to display without linking (I haven't figured out yet how to fix some of these in {{desc}}), there are a few cases where the |tr= is some kind of phonetic transcription meant to display alongside the term specified by the usual parameters. There might be a need for a new parameter for this function, but I'm not sure what to call it and it should be discussed first. At the very least I think you should avoid generating a native-script request category for Latin-script languages, since {{auto cat}} will throw a module error if you create it. It may also be desirable to display Latin-script |tr= text for Latin-script terms as if it were a transliteration, even though it technically can't be one and there's no translit module for it. Chuck Entz (talk) 21:01, 8 April 2023 (UTC)Reply[reply]

@Chuck Entz I think this is what ts= is for, though using that still produces the request for native script terms. To be honest, I actually think it's worth lifting the ban on having that for the Latin script, as it's still an accurate description for something like {{cog|lus||ts=puŋᴴ}}, which produces Mizo [Term?] (/⁠puŋᴴ⁠/). Theknightwho (talk) 01:17, 9 April 2023 (UTC)Reply[reply]

Template:trans-see wrong markup edit

I think Template:trans-see currently generates wrong markup.

See, for example, amenity, then Translations, then "convenience — see convenience".

Currently generated makup:

<div class="NavFrame"><div class="pseudo NavFrame">...</div></div>

Should be

<div class="NavFrame pseudo">...</div>

So it should be sinlge .NavFrame element and I believe "NavFrame" should also come before "pseudo". Disfated (talk) 12:41, 9 April 2023 (UTC)Reply[reply]

@Disfated Thanks - I’ll take a look. Theknightwho (talk) 13:09, 9 April 2023 (UTC)Reply[reply]
@Disfated Are you sure that's what you're seeing? I see this:
<div class="pseudo NavFrame"><div class="NavHead" ... >...</div></div>
Which matches the pre-Lua version: [8]. Theknightwho (talk) 15:34, 9 April 2023 (UTC)Reply[reply]
Actually, I've just spotted there's a third div outside of those two which shouldn't be there. My bad. Theknightwho (talk) 15:35, 9 April 2023 (UTC)Reply[reply]

Module:data consistency check edit

We need to be more selective with where we put this. Using it on something like Module:zh/data/st is like throwing a device in a bucket of water so we can be 100% certain we know what the problem is...

It was working fine for over a month, and I spent a long time working on that check. Something new has thrown it off, but it’s important that we keep it. Theknightwho (talk) 22:08, 9 April 2023 (UTC)Reply[reply]
@Chuck Entz - ping failed the first time. Theknightwho (talk) 22:10, 9 April 2023 (UTC)Reply[reply]
That's okay, I have you watchlisted. While you're at it, take a look at Category:Proto-Indo-European language- this is the first time we've ever had an out-of-memory error in {{auto cat}}. I suspect it has something to do with family trees, since PIE has the biggest one. Chuck Entz (talk) 22:15, 9 April 2023 (UTC)Reply[reply]
@Chuck Entz If you check Module:zh/data/ts or Module:zh/data/st now, you’ll see there are tons of results, but it runs quickly with a relatively low overhead, all things considered. Seems the serialisation checks are causing problems, which is not surprising given how massive Module:Hani-sortkey/data/serialized is; but using it means we can sort 1.5k Chinese terms with a 1.5MB overhead instead of a 15-20MB one.
The {{auto cat}} issue I’m aware of. It’s because I just consolidated Module:etymology languages into Module:languages, as the modular separation between them was becoming untenable. On pages with very few calls (e.g. 1), require is actually way more memory efficient than mw.loadData - using around 50% less in some cases. Currently, we’re in a brief transitional stage where we’re doing both, so once that’s over these issues should go away. Should hopefully be tomorrow. On a related point, I also rewrote Module:family tree/nested data to be more comprehensive and flexible, but it’s now an even bigger memory hog than it was before. It’s probably why we’re hitting the limit for the first time, as the PIE tree uses 37MB alone. That should drop down to about 25MB with the changes, bringing us well within the buffer. Theknightwho (talk) 22:33, 9 April 2023 (UTC)Reply[reply]
@Chuck Entz The memory footprint of family trees has now dropped significantly, with Category:Proto-Indo-European language now using 42MB; the tree on its own uses 27MB. Incidentally, I think Category:Proto-Niger-Congo language has the largest tree - it's nearly twice the size. Theknightwho (talk) 18:09, 10 April 2023 (UTC)Reply[reply]

Middle Iranian edit

Just curious what the solution is for Middle Iranian and Old Iranian errors. We also have a ton of categories in Special:WantedCategories that reference 'Middle Iranian languages' and for which {{auto cat}} doesn't work. Benwing2 (talk) 07:34, 10 April 2023 (UTC)Reply[reply]

@Benwing2 I keep meaning to get around to this. It's just me being slow, that's all. The solution is probably to just delete the codes, but I'll put a bar against the category modules creating categories like these if the object is etymology-only. Theknightwho (talk) 18:09, 10 April 2023 (UTC)Reply[reply]
There are also a lot of 'Requests for native script for ETYMLANG terms' as well as 'Requests for Unspecified script for LANG terms' in Special:WantedCategories. The latter should definitely not be generated at all; for the former, either we need to avoid generating them or modify the Requests category code to allow for etym langs (which would be easy but it's not clear to me what the right thing to do is). Benwing2 (talk) 23:06, 10 April 2023 (UTC)Reply[reply]
@Benwing2 I've dealt with the Middle and Old Iranian issue, which involved splitting them out into Module:families/data/etymology (as both families and etym-only languages need to be allowed for them to work). This means terms are now getting categorised in (e.g.) Category:Old Armenian terms derived from Middle Iranian languages. This touches on a wider point about etym-only languages/families, which is that they need to categorise terms into any parent language/family categories as well. That means all those terms would go in Category:Old Armenian terms derived from Iranian languages, as currently that doesn't happen.
This could also be how we handle request categories, but since they're a maintenance issue, we might want to limit it to the non-etym language only. Theknightwho (talk) 23:37, 10 April 2023 (UTC)Reply[reply]
Thank you for dealing with this. I'm a bit confused though about your assertion that we need to categorize e.g. all terms in Category:Old Armenian terms derived from Middle Iranian languages into Category:Old Armenian terms derived from Iranian languages as well. I don't think this is currently how these categories work; in general if something is categorized as being derived from Late Latin, it isn't also placed into the corresponding derived-from-Latin category. Instead the category itself has the derived-from-Latin category as its parent. This doesn't quite seem to be happening with Category:Old Armenian terms derived from Middle Iranian languages, which instead has Category:Old Armenian terms derived from Indo-Iranian languages as its parent; I'm not sure why. Benwing2 (talk) 00:12, 11 April 2023 (UTC)Reply[reply]
@Benwing2 It's because Vahag expressed dissatisfaction at the terms being divided up between the categories, and I'm inclined to agree; it's much more useful to have the terms in one place, in a similar way that "derived from" is a catch-all category.
The reason Category:Old Armenian terms derived from Middle Iranian languages is being put in the Indo-Iranian category is because Middle Iranian has Iranian as its parent, which means that its family is Indo-Iranian (not Iranian). Currently, the boiler is only looking for the family, but we need to make sure it handles the parent as well - which is how etym-only languages are handled. Presuambly this is because etym-only families haven't really been dealt with properly until now.
I also suggest we formally disentagle substrates from families, and just make their parent und. If they need to have a specific family, we can just set it in the data. This is now easier, because the etym language data has some of its keys automatically moved around before being returned, to make sure that it's fully compatible with the normal language data.
One last thing that occurs to me is that there's actually no reason why we couldn't use class inheritance during the creation of full language objects, too. For example, it might make sense to use it for Bokmål and Nynorsk, since we also have the code no for Norwegian. Theknightwho (talk) 00:32, 11 April 2023 (UTC)Reply[reply]
Hmm. If we are to add all terms derived from an etym language to the category corresponding to the etym language's parent, I think we should bring this up in the BP first as some people could conceivably object. As for substrates, I have no strong opinions here; go ahead if you want to make changes. As for the catboiler code, the relevant code is around line 1080 of Module:category tree/poscatboiler/data/terms by etymology; I don't think it will be hard to fix it. For regular-language inheritance, sure, I guess if we are stuck with all three of 'no', 'nb' and 'nn' as regular languages, we can make the latter two inherit from the former (although in practical terms, what would this get us?). Benwing2 (talk) 01:46, 11 April 2023 (UTC)Reply[reply]
@Benwing2 I see your point - it could get out of hand. I’ve fixed the issue in Module:category tree/poscatboiler/data/terms by etymology, so (e.g.) Category:Old Armenian terms derived from Middle Iranian languages is now getting categorised correctly. On a related note, I also changed it so that anything with und as a parent gets categorised in its family instead, which catches substrates, as well as a few others like Category:Terms derived from Xiongnu. Otherwise, they’d go in Category:Terms derived from Undetermined (which I don’t think should exist).
This could also come in handy with Norwegian, too: Category:Terms derived from Norwegian Bokmål doesn’t go in Category:Terms derived from Norwegian at the moment, and it probably should. Inheritance would solve that. Plus, it would simplify the data, as it’s a simple way of coordinating languages. Perhaps this is a way to solve the problems we have with langs like Chinese or Tibetan, where multiple L2s sometimes get placed under the same header. I appreciate that this starts to blur the boundary between language and family, but in practical terms we already do that. Theknightwho (talk) 18:51, 13 April 2023 (UTC)Reply[reply]

Template:l-lite etc. edit

Hello, your edits broke stuff again. {{cog-lite|en|test}} now shows as English test, likewise for the other etymology lite templates. – Wpi31 (talk) 04:31, 15 April 2023 (UTC)Reply[reply]

Stack Overflows edit

Your edits to Module:zh-translit seem to be the cause. If it helps any, the error in Appendix:Sinitic Swadesh lists goes away if you remove the language code "hsn" from the template, and the error in water/translations goes away if you remove the line with the Xiang translation. I'm not sure why this is only surfacing on these two pages and Appendix:Xiang Swadesh list. A search for Xiang translations finds 99 pages, some of which are also translation subpages that use {{multitrans}} like water/translations does. Chuck Entz (talk) 06:00, 20 April 2023 (UTC)Reply[reply]

@Chuck Entz Thanks. I introduced a change so that Module:zh-translit will check the term linked by {{zh-see}} if it can't find any instances of {{zh-pron}} on the page, which means it now works for simplified or variant terms as well. Occasionally, there'll be a chain (e.g. simplified forms of variants), so it iterates until it finds {{zh-pron}} (or neither of them). Theoretically, that means it can get into an infinite loop if you point a group of pages at each other with {{zh-see}} without ever using {{zh-pron}}, but of course that should never happen...
It's relevant to note that you will get different results with different lects of Chinese, as it depends on whether the lect is referred to in {{zh-pron}} or not. If it isn't, it moves on to {{zh-see}}. I suspect there isn't a real mistake on the entries here, but rather an instance of two pages missing a Xiang reading, which causes an infinite loop due to missing data. It's not unusual for very common characters to be variants for certain readings, so a pair of these would cause this. Theknightwho (talk) 06:08, 20 April 2023 (UTC)Reply[reply]
I was right: can be a variant of in Min Nan, which can be a variant of , which in turn can be a variant of for a different reading (ad infinitum). I'll put in something to stop that happening, as it's an edge case I hadn't considered. Theknightwho (talk) 06:23, 20 April 2023 (UTC)Reply[reply]

Pali/Sanskrit Thai Script PUA Characters edit

In Module:languages/data/2, you wrote "FIXME: Not clear what's going on with the PUA characters here" apropos U+F700 and U+F70F. The basic issue is that YO YING and THO THAN drop their bottom parts when combined with a mark below. As basic rendering of Thai is fairly old, predating OpenType, there arose the convention of using those two PUA characters for the glyphs thus modified. There's also what I think is the barbarous practice (@Octahedron80 for comment) of using those truncated glyphs for YO YING and THO THAN in all positions when writing Pali. (It's certainly not a universal convention for Pali.) Standards-oriented font creators tend to reject this glyph encoding; the glyphs are highly unlikely to be encoded in Unicode as separate characters on the basis of Pali.

My view is this glyph encoding should be expunged from Wiktionary whenever it crops up. --11:27, 21 April 2023 (UTC) RichardW57m (talk) 11:27, 21 April 2023 (UTC)Reply[reply]

An example of unmodified and modified yo ying in the same word is วิญญู (viññū); an example of the instrumental plural suffixed with the quotative particle, วิญญูหีติ (viññūhīti), can be seen on line at the bottom right of the sample of p5 at . --RichardW57m (talk) 12:03, 21 April 2023 (UTC)Reply[reply]
At thwikt, these two dated letters will only show in headwords, as Thai fonts can natively show them. Traditionally, ญ & ฐ should be "unfooted" even without marks under them. Why? Because old people didn't write the "foots". Not only Pali, but also Sanskrit. When naming title, we just use normal letters. The dated letters are added in entry_name and change them to normal, in case if they are linked from somewhere.
Adding VS01 is the new method that could be used as variation sequences, but currently no standard for Thai is released and no font supports it yet. (The fonts would be updated after the standard.) PS: VS01 is used in other scripts. --Octahedron80 (talk) 14:20, 21 April 2023 (UTC)Reply[reply]

Sinitic pronunciation modules edit

First off, thanks for your recent work on mod:cmn-pron. While it is possible to improve the other modules in a similar vein, I think ultimately mod:zh-pron will need some major revampping, which I already have some basic idea what to do – this involves moving most of the repetitive lect-specific stuff into a data module or their respective modules (plus, the mixture of the labels and the if else conditionals is difficult to maintain), and standardizing the interface etc. This will be a time-consuming job, which hopefully I could start working on it in late May. In the meantime, I don't think there is much point in changing the modules, since they'll probably have to be redone to suit the new interface anyways. – Wpi31 (talk) 17:09, 26 April 2023 (UTC)Reply[reply]

@Wpi31 Yes, you're right! My main concern was to develop a more sensible baseline transcription which all the other conversions can work from, which is essentially Pinyin with some of the quirks removed (iu --> iou etc). That's now done, which should make the task of simplifying other parts of the module much simpler.
I've been putting off working on Module:zh-pron as it's so huge, but I'll follow your lead. Theknightwho (talk) 17:13, 26 April 2023 (UTC)Reply[reply]

Options for zh-forms edit

Hi again,

I created 韶神星 for the asteroid (6) Hebe, but am having trouble with zh-forms. The derivation is not 韶 + 神 + 星, but an abbreviation of 韶華 + 神 + 星. Is there not an option for that in zh-forms, or am I just missing it? kwami (talk) 00:05, 6 May 2023 (UTC)Reply[reply]

@Kwamikagami I don't think there's a dedicated way to do it, but I've changed the gloss for the first character in a way that hopefully makes it clear what's going on. Theknightwho (talk) 00:27, 6 May 2023 (UTC)Reply[reply]
That works! Thanks. kwami (talk) 00:52, 6 May 2023 (UTC)Reply[reply]

Multiword Thai, Khmer, etc. transliterations edit


Thanks for working on the Russian transliteration scraper. I am more keen about this possibility - multiword Thai, Khmer, etc. transliterations, be it with spaces (or ]][[ between words) and respellings (I am sure it can't be done otherwise - no tool will transliterate Thai with 100% accuracy). Also about using |tr= to respell words with multiple readings. Don't mind negative comments you received. Using Japanese kana to transliterate everything is also desirable. Feel free to experiment. It's up to you but the Russian transliteration is working, even if adding manual translits is a bit of a burden. Also @Benwing2. Anatoli T. (обсудить/вклад) 03:12, 8 May 2023 (UTC)Reply[reply]

Template:zh-wikipedia edit

Hello, hope you are well. I wanted to ask you if you would take a look at Template:zh-wikipedia. You can see it in use at 中國中国 (Zhōngguó). I believe that the order of Wikipedias on the Zhongguo entry (that I just linked) is not academically or ethically justified. I believe that Wiktionary should follow the order of Wikipedias as at [9], and that Template:zh-wikipedia should "auto-sort" (automatically sort) the words in the order of the Wikipedias that Wikipedia uses, no matter what order the Wikipedias appear in the editor's view in any given Wiktionary entry's Template:zh-wikipedia. (1) Even if you don't agree, would it be technically feasible to code in an auto-sort or something like that on Template:zh-wikipedia? Maybe for a different order? (2) Is the Wikipedia order justified within Wikipedia but yet not justified for sorting the Wikipedias when linked from Wiktionary? (3) Is there perhaps another order that you can think of? Even ones you don't think would be good? I've always felt there was something wrong with what I was doing, and I'd like to see if that resonates at all with you (or anybody reading this). Thanks for any feedback. --Geographyinitiative (talk) 12:46, 12 May 2023 (UTC) (Modified)Reply[reply]

To the removal of the Proto-Northeast Caucasian and Proto-North Caucasian edit


  1. Proto-North Caucasian. I think we should have removed everything related to Proto-North Caucasian, everything that it contains in itself and Proto-North Caucasian itself. Simply on the grounds that this is not a family, but a superfamily, which, moreover, is not yet a proven superfamily.
  2. Proto-Northeast Caucasian. There are still no good reconstructions about the Proto-Northeast Caucasian. Perhaps the only revision of the reconstruction of Starostin and Nikolaev (1994) was the work of Nichols (2003). However, she uses the # sign, which stands for pseudo-reconstructions, which was introduced by Williams (1989). Here everything (Appendix:Proto-Nakh-Daghestanian reconstructions) is incorrectly indicated by an asterisk, this can be misleading. This family has not been proven.
  3. Proto-Northeast Caucasian splits into two branches Proto-Nakh and Proto-Daghestanian. Proto-Nakh is quite well reconstructed, which is not to say, of course, about the Proto-Daghestanian. For some reasons, the Proto-Daghestanian can be left and not deleted, although it has also not been proven. Although recently, for example, in the work Schrijver (2021), the Proto-Nakh are compared with the Proto-Tsezian and Proto-Avaro-Andian. However, the Proto-Avaro-Andian is not worked out properly. And besides, Proto-Tsezian forms is adjusted to the data of Proto-Avaro-Andian.
  4. In accordance with paragraph 3, the modern comparison is not with the Proto-Nakh and Proto-Daghestanian, but directly with the Daghestanian groups (Proto-Tsezian and Proto-Avaro-Andian). Which makes one doubt the existence of Proto-Daghestanian.
  5. Proposed trees:
  • Daghestanian ngn 🌳 ngn-pro
    • Avaro-Andian ngn-ava 🌳 ngn-ava-pro
    • Dargic ngn-drg 🌳 ngn-drg-pro
    • Khinalug kjj
    • Lak lbe
    • Lezgic ngn-lzg 🌳 ngn-lzg-pro
    • Tsezic ngn-tsz 🌳 ngn-tsz-pro
  • Nakh nkh 🌳 nkh-pro
    • Bats bbl
    • Vainakh nkh-vay 🌳 nkh-vay-pro

In a sense, such a division will go into contradiction for macrocomparative comparisons. Gnosandes ❀ (talk) 10:33, 13 May 2023 (UTC)Reply[reply]

Module:headword edit

When you excluded those languages from diacritic categorization, did you also intend to cause upwards of 150 module errors? When I first saw the module errors, I thought they would only be there for an hour or two while you were working on the changes to switch things over, so I didn't say anything. Chuck Entz (talk) 21:37, 13 May 2023 (UTC)Reply[reply]

The problem is that char_category() is local to a portion of the main function when it needs to be pulled out and moved up. Benwing2 (talk) 21:41, 13 May 2023 (UTC)Reply[reply]
@Chuck Entz @Benwing2 Fixed. I've been a bit off my game today, and didn't check CAT:E. Theknightwho (talk) 21:41, 13 May 2023 (UTC)Reply[reply]

Those darn colons edit

Hi. I'm not sure if this concerns you, but I noticed it first after you added the function of automatically adding categories.

Under the entries 1:a, 1:e, and 1:o there was a red Category:Swedish terms spelled with 1. I added {autocat}; but it only show one term, 180, and none of them above.

I assume it is those colons that mess it up, it usually is. --Christoffre (talk) 17:04, 14 May 2023 (UTC)Reply[reply]

@Christoffre I don't think so - numbers were excluded from all automatic categorisation (which was something that's been the case for a long time), but I've now changed that as I think they're rare enough to warrant categorising: people were manually categorising them anyway, so it's better to just do it properly.
Changes to modules take time to filter through to individual entries, as (literally) millions of pages can be affected. If you purge the cache on a page it'll do an immediate update, though. If you look at the category now, you'll see all three are in it. I'm sure others will filter through in time. 17:08, 14 May 2023 (UTC) Theknightwho (talk) 17:08, 14 May 2023 (UTC)Reply[reply]
Ah, it seems like I was just impatient. Thanks for the help anyway! --Christoffre (talk) 17:12, 14 May 2023 (UTC)Reply[reply]

Greek categories in CAT:E edit

Aside from the module errors, these look wrong. For some reason, they're displaying the vowels with iota subscripts as vowels followed by free-standing iotas. This is deceptive, since there are plenty of examples like Greek Αιγίνης (Aigínis), which aren't in Category:Greek terms spelled with ᾼ.

It must have something to do with fonts, since "Category:Greek terms spelled with ᾼ" displays with the iota subscript in the edit window for me, but not in the preview. What's more, if I use a Greek keyboard to type the Greek part in manually as separate characters, it displays as separate characters but doesn't link to the correct category: Category:Greek terms spelled with Αι even though the spelling looks identical to me in preview. For the record, I'm using a MacBook Pro with an old version of MacOS, and I get the same thing with both Firefox and Safari (not logged in on the latter). Chuck Entz (talk) 21:48, 14 May 2023 (UTC)Reply[reply]

I just noticed that there's also Category:Greek terms spelled with ◌ͅ- why both? Chuck Entz (talk) 21:54, 14 May 2023 (UTC)Reply[reply]
@Chuck Entz So I have a feeling I know what's behind this, but I need to dig through the guts of the category tree to work out exactly what's happening.
In terms of the display, it looks fine for me, but I think this is related to the fact that subscript iota (a combining character) capitalises as a non-combining iota. Something, somewhere is making the assumption that diacritics never change on capitalisation, but this is an instance where they do. However, if that's down to the font you've got, there's not a huge amount we can do.
The reason for having Category:Greek terms spelled with ◌ͅ is because I recently upgraded the standard characters function so that it's no longer restricted to atomic Unicode characters, as they're often pretty arbitrary and not linguistically interesting. Before, any terms in categories for diacritics were only there because the diacritic couldn't form an atomic character with the base character (e.g. there is no Unicode character B̊, so the function was seeing the individual character ̊ and dumping them there instead). Since I made the change, I realised that it's probably quite interesting to see the spread of diacritics, too, so basically did the inverse of the first change by making sure all the atomic characters got categorised by diacritic as well (e.g. Å). However, the diacritic categories only appear if the diacritic isn't otherwise used in the language: so French Ÿ gets the category Category:French terms spelled with Ÿ, but there is no category for French terms spelled with ◌̈, because ◌̈ is a standard diacritic of French. Theknightwho (talk) 22:30, 14 May 2023 (UTC)Reply[reply]
Judging by iota_subscript#Computer_encoding, the matter of uppercasing iota subscripts is a big mess. Unicode says to change them to adscripts (what I see), but a lot of usage stays with the subscript (what you see). An adscript isn't the same as an independent letter, which is why my typing the Greek ended up with a redlink. I'm guessing the module errors are due to different parts of the system (module code?) disagreeing on what exact Unicode codepoint an uppercased iota subscript is supposed to be: the part that converts the lowercase iota-subscripted letter to uppercase is using a different codepoint than the part that checks for uppercase vs. lowercase. Chuck Entz (talk) 00:01, 15 May 2023 (UTC)Reply[reply]
I'm sure you're already done for the day/evening, and I'll be gone most of the day tomorrow, but just to be thorough: the alpha with iota subscript copypasted from the displayed title on the pagen in σοφίᾳ (sophíāi) is the precomposed character U+1FB3 (see for the unicode info that displays even without an entry), and the category page has the precomposed character U+1FBC (see ). The first one is described as "GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI", where "ypogegrammeni" is the iota subscript, and the second one is "GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI", where "prosegegrammeni" is the iota adscript. The "Composition" section, however, has "Composition: α [U+03B1] + ◌ͅ [U+0345]" for the first and "Composition: Α [U+0391] + ◌ͅ [U+0345]" for the second. I have no idea how that translates to the codepoints the module sees, but I figure some extra data won't hurt. I would also note that {{uc:ᾳ}} > ΑΙ, which is the U+1FBC codepoint, and {{lc:ᾼ}} > ᾳ, which is the U+1FB3 codepoint. Chuck Entz (talk) 02:31, 15 May 2023 (UTC)Reply[reply]
This is getting stale. I just added the precomposed letters to the standardChars of Grek under "el" at Module:languages/data/2, but apparently with no effect (I noticed there were other precomposed letters there already). That's about all I dare to do myself, and even that may need to be reverted if there are unanticipated side-effects. We need to do something, though. Chuck Entz (talk) 01:38, 20 May 2023 (UTC)Reply[reply]
@Chuck Entz I've been putting this off, but I'll look at it now. Theknightwho (talk) 15:45, 20 May 2023 (UTC)Reply[reply]
@Chuck Entz This is now solved. The issue was that Module:category tree/poscatboiler/data/characters checks if single-character inputs are upper or lowercase, and throws an error if they're lowercase (with one edge-case exception if the uppercase form is actually standard, as sometimes happens with ı and I). It does this by running mw.ustring.upper on the input and then checking it's the same, because usually if you run it on a capital letter then nothing happens.
This caused a problem with , because the output from mw.ustring.upper is actually ΑΙ (i.e. it capitalises the subscript iota and splits it out as a separate full character). This isn't an error, as it matches the Unicode specification, but it isn't what we want in this context (because what we really want is title case, but there's no function for that). I've added special-case behaviour for subscript iota so that it remains unchanged.
I'll investigate if there are any other oddities like this: one other I've accounted for so far is ß, which capitalises as SS by default. Instead we use , which is the correct form for contexts like this. Theknightwho (talk) 14:07, 21 May 2023 (UTC)Reply[reply]

unwanted categories in Special:WantedCategories edit

Please take a look. There are a ton of categories here related to 'LANG terms spelled with X' that shouldn't or arguably shouldn't be there. Most blatantly, Cyrillic is entirely normal for Mongolian. I also don't think digits in terms is remarkable for English. Please review what's there and make appropriate fixes to remove these categories from being created, thanks! Benwing2 (talk) 09:04, 16 May 2023 (UTC)Reply[reply]

@Benwing2 Macedonian's been dealt with (which was an error on my part). Regarding numerical entries, they're not that remarkable in English, but if we take the most frequent number (1) and assume everything in Category:English terms spelled with 1‏ is a lemma (it's not), that only comprises 0.03% of English lemmas. I'd say that's rare enough to be notable. That being said, we may want to consolidate the categories for numerals.
Other than those, I'd say the categories are all fine: Tagalog, Danish and Norwegian don't use "C" in native words, for example. I think there needs to be a special exclusion for entries with "..." from being categorised as having ".", so I will deal with that. Theknightwho (talk) 09:58, 16 May 2023 (UTC)Reply[reply]
I think at most we want one 'terms spelled with numbers' category, although I still believe it's better not to have these; nothing terribly remarkable about a term with a number in it (probably English terms with Q and Z are fairly rare too but not remarkable). Do you mean both Mongolian and Macedonian? Benwing2 (talk) 10:27, 16 May 2023 (UTC)Reply[reply]
Also, all the "terms spelled with ." should go; these are just abbreviations. Benwing2 (talk) 10:28, 16 May 2023 (UTC)Reply[reply]
Also IMO Category:English terms spelled with &. Benwing2 (talk) 10:30, 16 May 2023 (UTC)Reply[reply]
@Benwing2 The only Mongolian categories listed relate to the traditional script, and they're filtering out. We do have a few categories for Mongolian Cyrillic letters, but only for the rare ones used in borrowings: see Category:Mongolian terms by their individual characters.
My initial inclination was to exclude numbers and characters like "." as well, but people were adding them manually anyway. I judged it was better to just handle it automatically, as if we're going to have them we may as well do it properly, and I don't want want to waste time periodically deleting them. Ultimately, they're pretty harmless. Theknightwho (talk) 10:42, 16 May 2023 (UTC)Reply[reply]
Not sure I agree on Category:English terms spelled with & - we don't usually have entries with it unless it's either obligatory or overwhelmingly common: R & B, &c., R&R and so on. Theknightwho (talk) 10:47, 16 May 2023 (UTC)Reply[reply]
The problem is that every one of these extra categories adds spam to the category list at the bottom of the page. On a page with a period or number in it, there may be several such categories if we allow them. So they're not totally harmless. I have already done runs in the past removing the manually added number categories and can do it again. Benwing2 (talk) 18:20, 16 May 2023 (UTC)Reply[reply]

Chinese links and transliterations edit

Hi. The punctuation marks are deleted on links but they don't transliterate then. E.g. at how much does it cost#Translations the Chinese/Wu 多少鈔票多少钞票 (1tu-sau 5tshau-phiau) works but 多少鈔票?多少钞票? doesn't. Could you please have a look? Anatoli T. (