Feel free to leave a note. -- Beland (talk) 21:34, 24 March 2018 (UTC)Reply

Spellchecking wiktionary edit

Had a look at the moss project and was wondering how difficult it would be to run it on wiktionary itself? (A bit strange since it's used as a spellchecker for the project.) –Jberkel 09:40, 3 June 2022 (UTC)Reply

@Jberkel: Turns out it only took a bit of downloading and few lines of code to do that. An initial run showed promising results, though Wikipedia and Wiktionary have slightly different style guides and content patterns, so there may need to be some tweaking to optimize typo reports. I just added some code to account for the fact that Wiktionary allows curly quote marks - Wikipedia editors wanted to prioritize spell-checking non-quoted material, especially since sometimes typos or obsolete spellings are retained on purpose. Though if you think the quotations on Wiktionary are ready to be spell-checked, I can do that instead, and we can use {{sic}} to tag incorrect-on-purpose quotations. (Quotes over 1000 characters will be spell-checked regardless.) I also expect there will also be some overlap with the built-in and your own very helpful wanted-page lists, though I do have some heuristics to try to distinguish actual misspellings from rare or non-English words that simply haven't yet been defined. I've started Wiktionary:Spell check as a collaboration point. Thanks for this excellent suggestion! -- Beland (talk) 20:42, 4 June 2022 (UTC)Reply
Wow, that's great, thanks! One immediate thing: it'd be helpful to have these (alternatively) sorted by language, if possible. – Jberkel 07:23, 7 June 2022 (UTC)Reply
@Jberkel: Hmm, interesting suggestion. I think I see how it can be done and made useful for situations where the "typo" is actually just an undefined word and the definitions that use it all belong to the same language. I will try to whip something up when I have some time. -- Beland (talk) 02:44, 26 July 2022 (UTC)Reply
Thanks for updating/regenerating the list; it's quite useful. :) One small thing I notice: it doesn't seem to handle long s, e.g. it parses "aduerſaries" (in ridicle) as "aduer" (ideally it would recognize ſ as s, the way pages with long s automatically redirect after a few seconds). Another idea: a lot of the results are obsolete spellings found in quotations in entries; it'd be useful to separate out potential misspellings / missing entries that are found "in wikivoice", in definitions i.e. in lines starting with # (not #*, #:, etc). - -sche (discuss) 22:03, 23 December 2022 (UTC)Reply
@-sche: Hmm, interesting points. Would forms with long s and obsolete spellings normally get their own entries, or are they not eligible? -- Beland (talk) 22:06, 23 December 2022 (UTC)Reply
If they meet the other criteria for inclusion — mainly, having been used 3 times, not just once — most obsolete spellings (using different letters, e.g. moone as an obsolete spelling of moon, or euery as an obsolete spelling of every) have entries. Certain special codepoints for character variations are ignored, so the entry for ſit is sit (although quotations are not forbidden from being entered using ſit), the entries for fin, stun are fin, stun, etc (so, the script should treat ridicle's aduerſaries quote as containing aduersaries, rather than just cutting the string off at aduer like it currently does).
My rationale for splitting "obsolete spellings in quotes" (especially obsolete spellings in quotes that aren't even English, e.g. the 1370s Galician quote(s) using caualo) from "potential typos in wikivoice / definitions" is that the first set is often not an error and/but entering them is a relatively low priority (we probably already have entries for the non-obsolete spellings, like moon), whereas the second set is higher priority, an undefined word in a definition is either a typo we should fix or a valid word we should probably define if we're using it in definitions. - -sche (discuss) 22:22, 24 December 2022 (UTC)Reply
@-sche: Gotcha, that makes sense. Work is about to get fairly busy, but I'll put this on my to-do list. -- Beland (talk) 16:32, 28 December 2022 (UTC)Reply