User talk:OrphicBot/Res Agenda

@Isomorphyc: Your planned project to import stubs for Latin, Ancient Greek, and Old Norse is quite interesting to me. I don't think anyone's done that kind of thing for a while, but it's a technique well suited to languages whose best dictionaries are out of copyright. That said, how much human attention would the final products require? And what would the format of the input have to look like? —Μετάknowledge^{discuss/deeds} 20:02, 19 September 2016 (UTC)Reply

@Metaknowledge: I was just going to use the Perseus files, which I have. This is something similar I made a while ago: User:Isomorphyc/Latin_Wiktionary_Misses. It is all the Latin words I could find us to be missing from about a dozen famous books. Either our Latin section is good enough, or my local stemmer is bad enough, that it seems to be only 2000 words that are missing. If I expanded it to the full corpus I have locally, it would probably get into the dozens of thousands, but I would have trouble curating a list like that. For Greek and Old Icelandic, I would probably start by trying to curate a list of the top 5000 words in books which I have read, then expanding it to non-hapax lagomena in the same works, which ought to be a dozen thousand or so. I'm probably pretty comfortable signing off on about a hundred words a day, provided the computer can generate the stub and I already know the word. I'd probably want to write a Lua stemmer in Old Icelandic before I start, which isn't a big problem. A more significant problem is parsing declension/conjugation information correctly out of LSJ, on the one hand, and perhaps still bigger is shoe-horning it into our existing templates/modules, which are a bit Byzantine and rather exigent about esoterica. In both languages I would probably want to check my stemmer's output against a corpus, and then both against Wiktionary's stemmer's output, before committing to principal parts, which isn't a big deal in Icelandic, but is more work in Greek, and always a slightly aleatory process given problems of textual corruption, homonyms, dialects and spelling variants. Another problem is being quite sure about accurately extracting Greek vowel quantities from LSJ, since LSJ omits both `obvious' macrons according to a somewhat complicated set of rules which is never explicitly stated, and includes a number of scanning errors, whereas Wiktionary requires all ambiguous vowels in Greek to be marked with either macrons or breves. One of the really nice things about our Greek section is that while it is not large, it is quite genteel and many of the entries are truly excellent in their precision. This is something I would not want to compromise as I add to it.

Thank you for taking an interest in this small task. I am afraid this note has become a bit gangly, for which I apologise. I did not intend to write quite so much about uncertainties. I was afraid others might think it an abomination, which is partly why I hadn't really wanted to discuss it till I had something more concrete. Isomorphyc (talk) 20:51, 19 September 2016 (UTC)Reply

This is excellently thought out. I was somewhat concerned about stubs being left to linger with inadequate inflectional information, etc (as was done based on translations many years ago, leaving some that still remain), but the method you've outlined is excellent. Thoughts for further expansion with regard to Latin: our Latin section doesn't need as many entries as our Greek, but the entries that do exist generally need more senses (and ideally quotations linking to Wikisource) — I wonder if those could be pulled from Perseus as well? Also, SemperBlotto's Latin inflection bot naturally doesn't create inflected forms only used in poetry, but I wonder if there's a corpus for those, or if they could be pulled from your local corpus to be created (I once found a list of syncopated forms used by Catullus and created them all (an example), but getting those by other major poets, especially Vergil, would be excellent). —Μετάknowledge^{discuss/deeds} 21:14, 19 September 2016 (UTC)Reply

@Metaknowledge: Thank you; it is, however, easier to plan these things than to do them. Mainly my importations will be stubs in the sense that I will likely provide glosses rather than proper entries. But I think the inflectional and similar mechanical information is what computers are good for.

For Latin, quotations will be difficult to collate with against specific senses, although it is not wholly impossible. Even Perseus seems only able to find its citations about 75% of the time. I suspect we could get to 90% quite easily with the Perseus corpus or the corpus at thelatinlibrary.com. Is Wikisource comparable, and is it easy to link to a specific sentence once it has been located? Are you comfortable with that direct a copy of the L&S quotations? Most likely I would probably put them in their own header, hoping users would move them, since word sense disambiguation would likely be unwanted.

For forms: I've noticed there are a reasonable number of Latin forms by now in lemma tables which are not links. I ought to create pages for these, just to get used to manipulating the format, assuming SemperBlotto and other users not mind. I asssume you want only the attested poetic inflections? The existing templates do have amavere type forms, but this is perhaps the only one. Would the conjugation template/module have to be rewritten to accomodate autgoing links for these forms from the table? Would they be included with attestation parameters? One really just has to write a stemmer that can find them in the texts, which is only moderately difficult.

It is very amusing to think about these things; thank you for bringing up these topics! Isomorphyc (talk) 01:20, 20 September 2016 (UTC)aReply

Re collating quotations: Isn't what we do in Greek entries pretty similar (except, of course, done by hand)? I also don't really know how complete Wikisource is, but it seems to have a goodly amount and there's some infrastructure here for linking to it in quotations (see MOD:Quotations/la/data).

Re forms: Yes, uncreated Latin forms in tables should be dealt with. Last I heard, Semper is happy to pass on the responsibility of dealing with that. You'll just have to make sure that your bot doesn't create any inflected forms with breves or other nonstandard characters, which some anon has occasionally been slipping into declension tables. I do think that only attested poetic inflections should be created, and I don't think they need to go on the tables. —Μετάknowledge^{discuss/deeds} 01:54, 20 September 2016 (UTC)Reply

@Metaknowledge: For quotations: I never directly compared the LSJ entry with the Wiktionary entry. I have been favourably impressed two or three times that the delineation of senses was better in Wiktionary than I remember in LSJ, but certainly not always. I am not surprised-- but speaking personally I always read the dictionary entries but usually find my own quotes.

For Wikisource: I was about to say that Wikisource does not appear to have either Julius Caesar or Cicero. But I found De Bello Gallico and most of Cicero and Vergil, at least, at the Latin language Wikisource. For example, here: [[1]]. Is this the correct place to link? I think this can be done partially (for example, on a per work basis) very easily, and fully, with some difficulty. It's worth pointing out that we can also have more quotes than L&S, and especially for rare words we can have all of them. More attractively to me, we could quote full sentences. To my mind, the need in L&S to be so brief also makes the quotes more difficult to read than they need to be.

For the inflections-- I will go ahead and add working on the inflections to my to-do list. Isomorphyc (talk) 02:58, 20 September 2016 (UTC)Reply

Yeah, using {{Q}} to point to la.wikisource.org. I have a lot of thoughts on other things to do with the classical languages, but this seems like more than enough to add to your plate for now. Thank you for responding so fully to everything and keep up the high quality work! —Μετάknowledge^{discuss/deeds} 03:11, 20 September 2016 (UTC)Reply

These occasional wide-view conversations are very helpful sometimes. I had the Res Agenda file mainly for my own reference, but thank you for noticing it! Hopefully a few of these will have prototypes in the coming months. Take care!Isomorphyc (talk) 03:24, 20 September 2016 (UTC)Reply

Add topic