Module talk:ja

Latest comment: 4 years ago by Eirikr in topic Extending Romanization

test cases edit

Thanks ZxxZxxZ! Wyang (talk) 04:13, 11 April 2013 (UTC)Reply

no problem, I hope it works correctly since I don't know Japanese and can't test it. --Z 04:19, 11 April 2013 (UTC)Reply
I noticed this: ["ヴ"]="vye",["フ"]. It's incorrect, ヴ is "vu" but becomes "v" with any small vowel letter (ヴャ - vya), same with "フ" (fu). --Anatoli (обсудить/вклад) 04:23, 11 April 2013 (UTC)Reply
That's not the actual conversion. In the function kata_to_romaji(f) the three character string is detected as "'([ヴフ])ィェ'"; eg. ツュチフィェ -> Module error: The function "kana_to_romaji" does not exist. (a hypothetical sequence of course). Wyang (talk) 04:26, 11 April 2013 (UTC)Reply
I haven't changed because it's not clear how "kr3" function is used. Perhaps "ィェ" should be "ye" but "ヴ" and "フ" just "vu" and "fu". "ヴ" and "フ" become "v" and "f" in front of small vowel letters ァ, ィ, ェ and ォ (and those with "y"). The same is true for a few other letters, like ツ, デ, ト, ド, etc., which are sometimes used in loanwords. --Anatoli (обсудить/вклад) 04:37, 11 April 2013 (UTC)Reply
kr3 is used only if ヴ/フ is followed by ィェ. ヴ is still -> "Module error: The function "kana_to_romaji" does not exist.". Wyang (talk) 04:40, 11 April 2013 (UTC)Reply

Moved from User talk:ZxxZxxZ edit

Hi. Thanks for your help there. There is one more debugging request: for function M.romaji_to_kata(f), I want to have the string replaced using rk4, rk3, rk2, rk1 sequentially, like the previous one. When I invoke it, however, "tsyuchifye" should generate "ツュチフィェ" but it instead now generates "ツュチフィェ". Could you please take a look and see where went wrong? Thanks. Wyang (talk) 04:48, 11 April 2013 (UTC)Reply

NP, I just took a look on how the Japanese writing system works to see if there is better ways to convert terms. One thing that I noted is that romaji looks to be irreversible (for example, wi may be both ヰ and ウィ), so is it really possible to convert from romaji to katakana? --Z 05:13, 11 April 2013 (UTC)Reply
Ah, I should have removed these one-to-many ones (also zu). It is convertible. ヰ is obsolete in modern Japanese, so wi should be mapped to ウィ. Similarly zu should correspond to ズ, not ヅ. I suppose there are alternative ways of writing this, by analysing what follows the consonant. It definitely requires more work; don't know if that would work though. Wyang (talk) 05:21, 11 April 2013 (UTC)Reply
It's irreversible, unfortunately or, at least, may not be very accurate. People can usually bear with "おう" being converted to "ou", in words where should be "ō" but the other way around is worse. We don't romanise "東京" (とうきょう) as "toukyou" but "Tōkyō" but "大きい" (おおきい) is also romanised with "ō" as "ōkii. Letter "ヅ" can be romanised as "dzu" to make it different from "ズ" (so it's used when typing) but usually it's "zu". --Anatoli (обсудить/вклад) 05:31, 11 April 2013 (UTC)Reply
Katakana/hiragana to romaji would be useful to create romaji transliteration and romaji entries, so would katakana to hiragana (to build sorting keys in categories). Not sure about hiragana to katakana but most animals, onomatopoeia, etc. have variant spellings in katakana. --Anatoli (обсудить/вклад) 05:36, 11 April 2013 (UTC)Reply
This shouldn't pose a difficulty, if the algorithm is: 1) de-macron, "ō" -> "ou", 2) "to" -> "と", 3) "o" -> "お". Wyang (talk) 05:40, 11 April 2013 (UTC)Reply
You probably missed the section about "ōkii", it's おおきい (ookii), not おうきい (oukii). I don't understand what you meant by 2) "to" -> "と", 3) "o" -> "お". --Anatoli (обсудить/вклад) 05:46, 11 April 2013 (UTC)Reply
I see what you mean. Macron 'o' is essentially a conflation of the combinations 'oo' and 'ou'. There would be no ambiguity if "ō" is disallowed in the input from the beginning (or if not disallowed, set to 'ou' by default unless specified, as 'ou' from Sino-Japanese words would greatly outnumber 'oo' which is mainly of native origin. Wyang (talk) 06:03, 11 April 2013 (UTC)Reply
What I mean is, when this is used at romaji entries: Tōkyō may be used with {{ja-romaji}} with no specifications (as ō is by default 'ou') and this produces とうきょう, but ōkii has to have {{ja-romaji|rom=ookii}} or {{ja-romaji|hira=おおきい}} to limit it to 'oo'. Wyang (talk) 06:08, 11 April 2013 (UTC)Reply
(before edit conflict) That's just how it is, the standard is to use "ō" here and many publications. Notable exceptions: the particles "は" (letter "ha") "へ" (letter "he") are transliterated as "wa" and "e", letter "を" is transliterated as "o", not "wo" (in any position). The tool can still be useful if the transliteration standard is not changed but will require manual override. Archaic letters can be ignored in romaji-kana conversion.
(after edit conflict) I see what you mean. we could have additional params for back translit but it remains to be seen where and how these modules are used, so that an adjustment or a collective decision could be made. We didn't use automatic transliteration before, so... --Anatoli (обсудить/вклад) 06:17, 11 April 2013 (UTC)Reply
BTW, I don't know if a conversion table is necessary for ACCEL creation of JA entries (as the script like Template:ja new (Japanese version of Template:cmn new) will not be using Lua). But if you do, we could agree to type romaji like "Toukyou" and "ookii", so that the conversion to hiragana happened correctly. --Anatoli (обсудить/вклад) 06:31, 11 April 2013 (UTC)Reply

to do edit

1)

Needs to convert string-final "n" to ン in kana_to_romaji(f). I added

if mw.ustring.sub(text,mw.ustring.len(text),mw.ustring.len(text)) == 'n' then text = (mw.ustring.sub(text,1,mw.ustring.len(text)-1) .. "ン") end

but it didn't work. Wyang (talk) 21:17, 11 April 2013 (UTC)Reply

2) hidx

3) geminate consonants (done) Wyang (talk) 04:14, 12 April 2013 (UTC)Reply

1) You mean when it is at the end of the word it should be ン? --Z 04:52, 12 April 2013 (UTC)Reply
Yes. See the testcases. shinkansen is not converted correctly. I think converting final 'n' to ン prior to list conversion would solve that problem. Wyang (talk) 04:55, 12 April 2013 (UTC)Reply
I'm not what the exact problem is with ン but ン is ALWAYS "n", also in front of ナ, ニ, etc. It gets an apostrophe ' in front of any vowel (large) - 遠泳 (えんえい = en'ei). Small ones are not used after ン and we don't ever romanise it as m, ng. --Anatoli (обсудить/вклад) 05:01, 12 April 2013 (UTC)Reply
See Module talk:ja/testcases. shinkansen -> シンカンsエン. I guess it's because 'en' is converted first and there is nothing to convert the remaining 's' to. Although converting the final 'n' to ン first doesn't seem to work either. Wyang (talk) 05:06, 12 April 2013 (UTC)Reply
n/s case fixed. --Z 05:23, 12 April 2013 (UTC)Reply

Thanks! Looks like everything listed has been done now (1,2,3). Wyang (talk) 05:25, 12 April 2013 (UTC)Reply

Documentation edit

Can somebody please write the documentation? I would, but I don't know how everything in it works. Please, we need to be careful to document our modules so others can use them more easily. —Μετάknowledgediscuss/deeds 04:51, 15 April 2013 (UTC)Reply

Excellent. Thank you! —Μετάknowledgediscuss/deeds 20:16, 15 April 2013 (UTC)Reply

romanization of ~っ edit

@TAKASUGI Shinji I'm not sure 't' is suitable either; なーんてねっ (nānte net) seems odd to me. I chose "h" so that あっ would become ah (I had totally forgotten about h as another method of romanizing long vowels).

Also, FWIW, I rewrote the romanization code recently and the old code simply didn't romanize ~っ at the end of a phrase at all, i.e. あっ (a), which I thought was somewhat problematic. What is your opinion on that behavior? —suzukaze (tc) 07:46, 22 January 2017 (UTC)Reply

@Suzukaze-c: As you know, there is no established transcription for the final っ. I used t because it matches well at least for あっという間. Some scholars use q ([1]), which is based on the long tradition of the phonemic notation /q/ but may look too exotic. Others use an apostrophe ([2], [3]). @Atitarev, Eirikr, Haplology, Wyang, エリック・キィ: What do you think of romanization of the final っ? — TAKASUGI Shinji (talk) 09:20, 22 January 2017 (UTC)Reply
This was discussed before: Wiktionary:Tea room/2014/August#六. Wyang (talk) 09:28, 22 January 2017 (UTC)Reply
I’d forgotten it, thanks. We just omitted the final っ until the revision as of 2016-11-10T11:11:06, which used #. — TAKASUGI Shinji (talk) 09:57, 22 January 2017 (UTC)Reply
"#" was a shortlived personal experiment that got published by accident, please don't mind that part.
I was unaware of both the discussion and the policy, thanks. A lot of policies here seem kind of outdated in comparison with current practice though, for example the points under 'relaxed rules' and the entry layout on Wiktionary:About Japanese. Can we use this opportunity to consider changing the policy on final っ? —suzukaze (tc) 10:05, 22 January 2017 (UTC)Reply
  • There's policy, and then there's the technical side. I think the policy arose in part because omitting it is much easier -- if we make it "t" instead, or "h" instead, there are all kinds of odd corner cases that go funny, as explored in this current go-round.
So long as those corner cases can be properly thought through and planned for, I'm open to being convinced to change current practice. FWIW, I think omission works and is reasonably clear. ‑‑ Eiríkr Útlendi │Tala við mig 18:05, 22 January 2017 (UTC)Reply
Maybe in the case of あっという間に where there is a と to consider it could be transliterated as 't', but in other cases it could be transliterated as something else. —suzukaze (tc) 00:16, 23 January 2017 (UTC)Reply
How about deleting a space after っ in あっという間? That will yield atto iu ma. — TAKASUGI Shinji (talk) 13:54, 23 January 2017 (UTC)Reply
It's totally reasonable but I also feel like morphologically it's あっ+と+いう+間+に and should maybe be romanized as such. —suzukaze (tc) 21:07, 23 January 2017 (UTC)Reply
In this particular case, あっ and と are completely fused. There is no pause between them. — TAKASUGI Shinji (talk) 23:38, 23 January 2017 (UTC)Reply
Which is why I proposed "where there is a と to consider it could be transliterated as 't'". I know there's no glottal stop in the pronunciation in the case of あっという間. —suzukaze (tc) 17:34, 24 January 2017 (UTC)Reply
げっ (ge') / あっという() (a' to iu ma)suzukaze (tc) 08:28, 25 January 2017 (UTC)Reply
We shouldn’t use h. It is for a long vowel. — TAKASUGI Shinji (talk) 11:23, 25 January 2017 (UTC)Reply
Hmm, but we already use Hepburn-style rōmaji anyway. I personally am all for alternatives like q and ' but I also fear that it may confuse casual users of Wiktionary. Of course we could also do the previous status quo of romanizing it as nothing but I am of the opinion that romanizating it visibly is beneficial. Would directly using ʔ be too radical? —suzukaze (tc) 12:01, 25 January 2017 (UTC)Reply
Sorry guys. I have limited Internet access as I'm on a holidays in Thailand. I think っ after vowels shouldn't be romanised at all or or should be romanised as nothing. That's the common practice out there and this is not a unique situation when a foreign letter is romanised as nothing in some situations. --Anatoli T. (обсудить/вклад) 15:42, 25 January 2017 (UTC)Reply
 

Nan de syô, kô, pah' to akaruku nattari, pah' to kuraku nattari surun de syô?

 
suzukaze (tc) 01:54, 1 September 2017 (UTC)Reply
It should be an apostrophe if anything. It's a fairly common way to denote the glottal stop. h is inherently ambiguous and confusing since it is often used for long vowels in non-Hepburn (or rather, non-Unicode) romanization. According to this paper, while Kenkyusha's New Japanese-English Dictionary (研究社新和英大辞典), which the ALA-LC Romanization Table refers to, employs a breve ˘ to denote the glottal stop, selected words in OCLC WorldCat records represented the glottal stop with simply no representation or with an apostrophe, but none with ˘, let alone h or t.
(The quote provided by Suzukaze above is a weird one, as it employs both h and ' in place of っ. I wonder if Ishikawa actually meant a long vowel by h, but he also uses macrons (or circumflexes) so it seems unlikely, which is why it's so strange.) Nardog (talk) 14:54, 2 September 2017 (UTC)Reply
The apostrophe representing the glottal stop is also seen in Random House Japanese-English English-Japanese Dictionary (1997) and Pocket Kenkyusha Japanese Dictionary (2003). Nardog (talk) 18:46, 8 September 2017 (UTC)Reply
Since you have that much evidence, apostrophes are alright by me. —suzukaze (tc) 02:31, 9 September 2017 (UTC)Reply

mw.loadData being used by gsub edit

A note in the module says that mw.ustring.gsub can't use arrays loaded by mw.loadData(). I had switched the module to directly using the various subtables of Module:ja/data before seeing the note, and it seemed to work, so I guess that note is wrong. — Eru·tuon 05:59, 17 August 2017 (UTC)Reply

Are では and とは two words? edit

Currently with this module, in order to romanize the particles では or とは as dewa or towa, you have to put a space between the two kana, producing de_wa and to_wa. However, in certain contexts, では and とは both exhibit behaviors unpredictable from the way the case particles and are used. Simply で + は would suggest 'in, at, on', but when used at the beginning of a sentence, it could mean 'then' or, as an interjection, 'bye' (both often simplified as じゃ or じゃあ in speech); simply と + は would suggest 'with, to (sth)', but it could introduce a definition or a question asking for one, as in 「Wiktionaryとは?」("What is Wiktionary?"). Thus I think では and とは, when separated by spaces or in isolation, should be able to be recognized as one particle with irregular pronunciation, along with は. Nardog (talk) 10:33, 28 August 2017 (UTC)Reply

I agree with you. Wyang (talk) 10:34, 28 August 2017 (UTC)Reply
I think they should still be romanized as de wa and to wa. —suzukaze (tc) 01:08, 31 August 2017 (UTC)Reply
A late addendum: I agree with suzukaze on this. These are still clearly the particles or + . Even in their somewhat lexicalized contexts, other constructions are possible, such as というのは, といっては, とも, just plain で, でも, って, etc. Joining them in romaji renderings does nothing terribly useful, and raises the risk of confusion by obscuring that these are indeed distinct particles, just used in specific combinations. は is a very common element that can also appear after particles , , , より, から, etc. An additional technical consideration is that では could also validly be deha (出端, “moment of departure”).
Another 2p for the pot, anyway. ‑‑ Eiríkr Útlendi │Tala við mig 17:46, 2 February 2018 (UTC)Reply
FWIW the technical aspect is not a concern: we can already do () (ha wa). If we choose to transliterate "では" as "dewa", "で.は" could be "deha". —suzukaze (tc) 22:40, 2 February 2018 (UTC)Reply

export.romaji_to_kata() edit

might be redundant to Module:typing-aids/data/ja now. —suzukaze (tc) 08:50, 1 September 2017 (UTC)Reply

😢, if the time has come to farewell one of the earliest functions of the module. (although – shouldn't the data be at Module:ja/data instead? The function is still used in many Japanese modules.) Wyang (talk) 09:43, 1 September 2017 (UTC)Reply
It's probably not redundant as far as speed is concerned. The version in Module:typing-aids has got to be a bit slower than the original function, because it uses mw.ustring.gsub over and over. Then again, if the function isn't used heavily (that is, as heavily as kana-to-romaji would be used in a list of terms), that might not matter. — Eru·tuon 03:01, 14 November 2017 (UTC)Reply

ruby stuff edit

Example from では (de wa):

* {{ja-usex|'''では''' <tt>C-v</tt> (次の画面を見る)をタイプして次の画面に進んで下さい。(さあ、やってみましょう。コントロールキーを押しながら <tt>v</tt> です)|^'''で は''' <tt>C-v</tt> (つぎ の がめん を みる)を タイプ して つぎ の がめん に すすんで ください。(^さあ、やってみましょう。^コントロールキー を おしながら <tt>v</tt> です)|'''Now''' type <tt>C-v</tt> (View next screen) to move to the next screen. (go ahead, do it by depressing the control key and <tt>v</tt> together)}}

The ruby function is trying to find kana that correspond to C-v, but there are none. Should this be fixed by creating a way to exclude text from ruby annotation, and marking C-v with whatever that code happens to be? (That can easily be done with the code that now excludes decimal character entities and HTML tags.) — Eru·tuon 00:59, 2 September 2017 (UTC)Reply

What I ended up doing is making the ruby function ignore HTML tags (for instance, <tt>, named entities (&nbsp;, and numeric entities (&#32;), as well as anything encircled in double ampersands in both the annotated text and the annotation. It's a makeshift solution, so there may be problems with it later on. — Eru·tuon 18:32, 24 September 2017 (UTC)Reply

kanji sortkey possibility - XJIS edit

https://www.google.com/search?q=xjis+sorting (not sure what happens to characters outside of XJIS though) —suzukaze (tc) 09:16, 24 September 2017 (UTC)Reply

Extending Romanization edit

How can we make か゚行 and カ゚行 romanize wtih ng-'s and ラ゚行 with l-'s? MiguelX413 (talk) 02:20, 4 October 2019 (UTC)Reply

@MiguelX413, presumably you're talking about extending or repurposing this module for other languages than Japanese? Japanese in the context of language code ja has no such kana, nor such phonemes. ‑‑ Eiríkr Útlendi │Tala við mig 03:33, 4 October 2019 (UTC)Reply
@Eirikr: /ŋa/ does occur among older speakers of Japanese, and I do think that ラ゚行 is used by some Japanese Catholics to represent /l/ in some hymns. I want to add these to the module. MiguelX413 (talk) 03:58, 4 October 2019 (UTC)Reply
@MiguelX413:, one thing I had to learn a while back was the distinction between phones, as in the specific realization of how something is pronounced, and phonemes, as in the meaningful distinctions made by speakers of a given language. If I recall correctly, phones are transcribed using [square brackets], and phonemes are transcribed using /slashes/.
Looking specifically at these two sounds you mention, [ŋa] appears as a phone, but only as an allophone of [ɡa]. If two phones are allophones, that means they both represent the same phoneme. For [ŋa] and [ɡa], these are both equivalent to the phoneme /ɡa/. Put another way, if one person says [ŋa] and another person says [ɡa], a Japanese speaker would "hear" the sound (ga). These two phones represent the same phoneme, and are not contrastive (i.e. they do not represent separate and distinct phonemes).
Similarly, while the voiced alveolar lateral approximant [l] exists as a phone among some speakers of Japanese when pronouncing (ra ri ru re ro), such as older speakers of Tōhoku dialect around Morioka who pronounce this much more like la li lu le lo, this [l] is an allophone of the alveolar flap [ɾ] exhibited by most speakers of mainstream, mass-media Japanese. Again, these two phones are not contrastive.
Since neither /ŋ/ nor /l/ are phonemic features of Japanese, I'm not sure they belong in the ja module.
That may just be me, however. I'm curious what others think. ‑‑ Eiríkr Útlendi │Tala við mig 04:32, 4 October 2019 (UTC)Reply
@Eirikr: Some Japanese linguists use the aforementioned か゚行, カ゚行, and ラ゚行 for transcription in addition to う゚ (nasal う) among others. ホゥ is already included in the module, despite not being contrasted either. MiguelX413 (talk) 04:40, 4 October 2019 (UTC)Reply
@MiguelX413:, yes, I've seen similar notation. These appear to be attempts at using kana for phonetic transcription. I'd always understood this module as intended for phonemic transcription, which is a different thing. ‑‑ Eiríkr Útlendi │Tala við mig 04:46, 4 October 2019 (UTC)Reply
Return to "ja" page.