Wiktionary talk:Votes/2014-10/Request for bot status: WingerBot

Irregular pronunciations and transliterations edit

@Benwing and @Wikitiki89 I agree to the full automation of Arabic (MSA) transliterations, now that the module does a great job, although a bit of a concern are words with irregular pronunciations, notably loanwords (long vowels as short, vowels "o" and "e", words where ج and غ are pronounced as "g", not "j" and "ḡ" (using translit. symbols for simplicity). It's minor, just would like to know if you wish to preserve the manual transliterations or replace with automatic for those? --Anatoli T. (обсудить/вклад) 23:54, 20 October 2014 (UTC)Reply

I think that wherever we already have a manual transliteration for these, we should leave it be, but I don't think there is a problem with loanwords without a manual override. --WikiTiki89 00:03, 21 October 2014 (UTC)Reply
What's the plan for the bot? Replace all transliterations? If all transliterations are left as they are, there are a lot of them that are non-standard.--Anatoli T. (обсудить/вклад) 01:59, 21 October 2014 (UTC)Reply
By "for these" I meant loanwords. --WikiTiki89 02:05, 21 October 2014 (UTC)Reply
I guess it's a question for Benwing. I mean, does the bot job include these cases? E.g. if it comes across مُودِم m (mūdim, modem), will it try to remove transliteration now that diacritics are present to make it مُودِم m (mūdim)? What would it do if diacritics are missing? I don't want to complicate things, it's probably easier to insert diacritics where necessary and use automatic transliterations, just curious about the intended actions. The project page doesn't describe them very well and the bot will not know if a word is an exception or just an erroneous transliteration. --Anatoli T. (обсудить/вклад) 02:31, 21 October 2014 (UTC)Reply
The bot attempts to vocalize based on the transliteration. Currently it won't match up a short vowel in translit with a long vowel in Arabic; instead, it flags them so that I can review them and decide whether the translit is erroneous or it's a loan word. I may allow it in the future to match more leniently so loanwords get vocalization. As for removing transliteration, it only does that when the manual transliteration is identical to the automatic one (after canonicalizing the manual translit to handle the varying ways that manual transliteration is written). The canonicalization makes use of the Arabic, so it's able to determine e.g. when a capital letter should be converted to an emphatic or when something like 'sh' should be converted to 'š' and when it should be left alone. However, it only canonicalizes when it's safe to do so, meaning it won't replace short vowels with long ones or vice versa, and it won't replace o with u or e with i, and it won't change g to j or anything like that. Part of the bot is a big table of canonicalizations, and it's able to canonicalize both with and without the Arabic available, but in the latter case it's more conservative (in this mode it will always leave 'sh' alone, for example).
In the case of e.g. مودم (mūdim, modemm, what I will probably have the bot do is first split into something like مودم (mūdim or modemm, and then attempt to vocalize the headwords (which will succeed in the first case but fail in the second unless I allow lax matching), then check for redundant transliterations, which will remove the first one but not the second. (BTW I'm trying to get @CodeCat to agree to a change that will avoid displaying the same headword or translit twice in case of redundancies.) In a case like مُودِم m (mūdim, modem), if it's not possible to split the transliteration into multiple heads, it definitely will not remove the transliteration, since it won't be the same as the auto-generated translit.
The upshot is that the bot only does things that are "safe", and I'm pretty careful in how I define "safe" operations. It will canonicalize transliterations to the extent it is safe to do so, to remove much of their "non-standard" nature. It won't remove manual transliterations that add something, only redundant ones.
The code is now on github: [1] Benwing (talk) 11:17, 21 October 2014 (UTC)Reply
Return to the project page "Votes/2014-10/Request for bot status: WingerBot".