Howdy! I have been on a bit of a transliteration binge recently, and I had a few questions I thought you might be able to answer (In part because you worked on this some).
I was looking at Module:uga-translit and wondering:
- Do we have a decision about the correct transliteration of π (ΚΎa/αΊ£), π (αΊ/αΈ«), π (ΚΎi/α»/i), π (ΚΎu/ủ), π (sβ/Ε)?
- Why was this never fully implemented?
In the same vein of deciding on specific tranliterations, I wrote module:Ital-translit based on Appendix:Old Italic script and was wondering:
- Do I need a vote to decide on particular encoding/transliteration principles for certain languages? For instance, the South Picene lemma mefiΓn (which I want to move to Ital) could be lemmatized:
- ππβπππ (me iΓn) with β and π (which looks like the form used in South Picene)
- ππ:πππ (me:iΓn) with a colon
- ππβπππ (me iΓn) with π (the Unicode character encoded for Γ, but that doesn't look like the form in South Picene)
- ππ:πππ (me:iΓn)
- What do I need to change to get both β & : to be transliterated as f?
- Ital-translit currently has a standard behavior for all Ital characters and then exceptions by language. This means that if character, which is not in a particular language's sub-alphabet. is added, it will be transliterated regardless using the standard correspondence. Should I disallow this behavior and only permit transliteration of characters within a language's sub-alphabet?
Sorry for all the questions, but I thought you might have useful answers/opinions.
I can't help you at all on the first part, sorry.
For the Italic alphabets, the common set was chosen so that it could apply for all languages. If it doesn't apply to all languages equally, then it shouldn't be in the common set. Alternatively, you could transliterate the language-specific features first, and let the common set handle whatever remains after that.
Something you need to be careful with is using gsub with '.' to replace multiple-character combinations. That's not going to work. Sadly, extending it to '..' will not work either in case you were thinking of that. The way I handle these situations is a bit more elaborate but it works much better at least.
- "rest" contains characters yet to be processed, "parts" is a table containing characters or sequences that were recognised.
- Look at the "rest" string for the longest match with each one of the character search sequences.
- Once the longest match is determined, insert that into the list of parts. If no match was found at all, just insert the first character.
- Remove the processed characters from "rest".
- Repeat until "rest" is empty.