Open main menu

Wiktionary:Beer parlour/2008/August

This is an archive page that has been kept for historical purposes. The conversations on this page are no longer live.
Beer parlour archives edit



Another glimpse into the minds of our readers. A few things spring to mind.

  • People think we are Wikipedia.
  • People think about sex, a lot.
  • We get a disproportionate number of Arabic misses.
  • Spelling mistakes aren't as common as I thought.

Any ideas how to make ourselves more useful based on this list of links? Conrad.Irwin 16:46, 1 August 2008 (UTC)

It might be nice if that list could be sorted by script. -Atelaes λάλει ἐμοί 18:46, 1 August 2008 (UTC)
Now sorted, first by number of repeats, then alphabetically. (More or less) Enjoy! JesseW 20:38, 1 August 2008 (UTC)
Thanks a lot! Note that repeats may well come from the same person clicking more than once - as while there is some kind of limiter to the number of clicks, I don't think it prohibits doubles. Conrad.Irwin 20:47, 1 August 2008 (UTC)
Note the randy teenager effect. The Arabic script misses are Persian/Farsi not Arabic. (At least the top few are.) Interesting Robert Ullmann 17:26, 3 August 2008 (UTC)

rfc-level false positives

AFAICT we allow headers like "Pronunciation 1" and indented etymology and other headers below. If we do allow such Pronunciation #-type headers to structure entries, then the structure-problem-detecting bots need to adjust. Have they already been adjusted? Am I just seeing relics of the past? Should we instead work to combine all pronunciation sections in a given Language section into one, duplicating part of the entry structure? How long, then, until long pronunciation sections need to be put under {{rel-top}}? DCDuring TALK 17:12, 3 August 2008 (UTC)

We do not "allow" Pronunciation 1 etc. They are not provided for in WT:ELE, and the several editors who have used them have never put together a coherent proposal to modify ELE. So for now, they are "illegal" and very properly tagged.
Would be really good if there was such a proposal, but the last attempt (by DAVilla) was a serious of votes on separate ideas, with no coherent combination (that I could find at least). The discussion on the talk page outlined the problem.
What we've been using now is either a separate pron section for each ety, or putting it first if the same for multiple etys; if there is only one, then different pronunciations for (say) noun and verb are tagged within one section. Robert Ullmann 17:20, 3 August 2008 (UTC)
OK. Glad I asked before I messed up too many more entries. I'd just gotten started on rfc-level entries and 2 of first 4 had such problems, but I'd also messed up a few others. Two steps forward, one step back. Sigh. DCDuring TALK 18:14, 3 August 2008 (UTC)
Robert, such a proposal was made. In fact, two votes were initiated on this subject in 2007. I tis therefore incorrect to claim that no coherent proposal has ever been put together. It is also incorrect to talk about such structure as "illegal" or "what we've been doing", when you mean that you don;t like them and it 's not what you've been doing. Your suggested format is no more legal, since WT:ELE does not address this sort of case at all. When we put the matter to a vote, in order to clarify what we want to do, the voting ended in no consensus. This means that "we" do not agree on what we should be doing. Your recommendation is just one option open to us. --EncycloPetey 16:38, 4 August 2008 (UTC)
EP, can you point me to a workable proposal? The set of approve/disapprove vote and here (and see talk page there) that DAVilla put together had no consistent combination(s). I'm not aware of any other that has a complete solution. (Just saying Pron N is allowed is not adequate, it must address N pronunciations and M etymologies; there are several possibilities, none very pretty.) Please note the very intentional scare quotes on "illegal" ... and I am not making a recommendation above, just noting what ELE says now, including how it has been stretched a bit to have one pron cover multiple etys. And it isn't that I "don't like" Pron N (although I don't ;-) it is that this has not been sorted, and it should be. I have not yet seen a single consistant proposal. We should have one. Robert Ullmann 17:09, 4 August 2008 (UTC)
Regardless of what adjectives you use to describe these types of formatting, I think Robert makes a decent point in that we haven't yet achieved a policy level consensus. I think it reasonable to tag these, so that we know where to find them when we do come to a consensus on the format. -Atelaes λάλει ἐμοί 18:28, 4 August 2008 (UTC)
I have no problem with some sort of tagging on these for later cleanup, but think it wrong to impose a particular format at the present time. The problem is much worse for Latin, where the pronunciations can be inflection-dependent, rather than part-of-speech dependent. Marking one pronunciation as "verb" and another as" noun" in English entries is bad enough, but I don't think anyone would be served by having Latin entries with a "unified" pronunciation section marking one pronunciation as (nominative feminine singular, nominative neuter plural, accusative neuter plural, vocative feminine singular, and vocative neuter plural), which is what we'd have to do for a significant fraction of Latin adjectives with this issue. --EncycloPetey 00:42, 5 August 2008 (UTC)
It seems to be an implication of what you are saying for Latin (and, I assume, some other heavily inflected languages) is that pronunciation belongs at the level of inflected forms, of which the lemma form is but one. Fine. The difficulty would seem to be that, for some headwords, especially in English, contributors were trying to use the pronunciation differences to structure entries. They are doing so not because they believe that it is correct, but because, A., they do not perceive the problem that you are sensitive to, B., they do not want to repeat the space-consuming current layout of the Pronunciation section in many locations in an entry, or, C., the alternative Etymology basis for structuring entries is not attractive. This, in turn, would seem to be because the Etymology, 1., has not been entered, 2., is unknown, or, 3., is highly fragmented causing the entry to look dreadful.
Possible solutions to the difficulty would seem to be that, 1., we enter more Etymologies for the complex entries and, 2., that we reduce the amount of visible space taken by the pronunciation section to something similar to what Pronunciation takes in entries for the other on-line dictionaries that offer pronunciation.
We have evidence that users think that the definitions are hard to find. We have no evidence that a large number of users are unhappy with the lack of availability of Pronunciation (though many entries do not have it). I would be surprised to see any evidence that more than 10% of our users could read any phonetic alphabets. I would be surprised if more than 10% even knew what a schwa was. Etymologies may be useful for structuring and entertaining for word-lovers, but they too could stand to take less space by default. DCDuring TALK 02:00, 5 August 2008 (UTC)
Please try gathering the facts before making a statement about them. You state "We have no evidence that a large number of users are unhappy with the lack of availability of Pronunciation", when what you mean to say is that you haven't bothered to look at the evidence. We have an active and much-used system for requesting both symbolic and audio pronunciation. The difference with definition requests is that we don't have as clear a method for requesting them, so you see more complaints about that. --EncycloPetey 19:39, 13 August 2008 (UTC)
Query: How many of the requests are from admins attempting to construct entries? More telling would be the number of clicks on pronunciations. What portion of the est. 1MM visitors or 3-4MM page visits lead to a pronunciation hit, let alone an RFP or RFAP? DCDuring TALK 21:55, 13 August 2008 (UTC)
How could such a statistic be reliably generated? Many users who are interested in the pronunciation will also be interested in the etymology (or other information appearing above the pronunciation), and click on that and read down to the pronunciation without clicking on it in the TOC. Equally they might click on something below (e.g the definition) read that, then scroll up. If a user knows how to read any or all of enPR, IPA and SAMPA then they will not need to click on anything to view the key. Counting the number of rfp and rfap requests is equally an invalid indicator of interest as it ignores those who are interested in words that already have pronunciation or requests for pronunciation, excludes those who don't know about the templates and those who don't/won't/can't edit.
Don't get me wrong, I am interested in knowing what the level of interest is, but I don't know how we can reliably gather the information. Thryduulf 00:15, 14 August 2008 (UTC)
In our current state of statistical knowledge, almost any facts would be good. The ratio of audio file hits to hits to the pages that had audio files would be a start, obviously a substantial underestimate of total interest in pronunciation. A survey of users that included questions about ability to use phonetic alphabets would be another useful tool. Such facts obviously need a lot of interpretation, but might provide some insight. That our competitors seem to emphasis audio and minimize screen space devoted to pronunciation is certainly suggestive. I'd bet that, as a group, users who seek pronunciation information are probably among those who derive a lot of value from dictionaries and deserve more of our attention than their numbers alone would suggest. My sole interest is in making sure that delivering pronunciation information in all of its richness and variety does not deter users rarely or never interested in pronunciation. DCDuring TALK 01:02, 14 August 2008 (UTC)

Okay, here's a coherent proposal for you:

The presentation of page title should be changed to use a CSS class, so that it is no longer a header-1. Instead, language should be header-1. Part of speech will always be header-3, as it most commonly is now, with subheaders at deeper levels and also consistent. Without saying which of etymology or pronunciation should or may or ought not be first, the first one regardless will be header-2, and any deeper nested will be header-3.

So, for instance, we could have:

===Pronunciation 1
===Pronunciation 2
===Verb 01:31, 9 August 2008 (UTC)

How does your proposed structure cover entries where there is 1 pronunciation and multiple etymologies? multiple etymologies and multiple pronunciations? Entries where there is 1 pronunciation in the UK and 2 in the US? What about where there are 3 etymologies (a, b and c), etymologies a and b share a pronunciation in the UK and ety c has a different one, but in the US it is ety's b and c that share the pronunciation used in the UK for etys a and b, with yet another pronunciation for ety a? Then someone comes along and notes that in New Zealand they use the same pronunciation for all the etymologies, but it doesn't correspond with any of the US or UK ones, while in Australia they use the NZ pronunciation for etymologies a and c and for etymology b use the pronunciation the US uses for etm a?
I dislike entries structured by pronunciation because the pronunciation is so variable - geographically, over time, even by degree of formality (see unused for example). Thryduulf 21:31, 13 August 2008 (UTC)
The entry for unused should at least be split by etymology, and possibly split into two separate pages to match what we've done for used and used to. --EncycloPetey 02:41, 16 August 2008 (UTC)
We don't yet have separate etymologies for used, which, I think, might be a necessary precursor to having them for unused. A separate entry for unused to could be useful, especially since that usage seems to be the most common source of the "t" pronunciation, AFAICT. DCDuring TALK 03:06, 16 August 2008 (UTC)

Pronunciation competitive analysis

I have done a quick review, using of 18 (!!!) on-line dictionaries' handling of pronunciation. I used the word "controversy" because it was the first one that came to mind that had important pronunciation differences.

  • Four dictionaries (OPTED, Allwords, Webster1828, and Wordnet) had no pronunciation at all.
  • Four had audio, three of which worked with Firefox (Encarta, AHD,; Wiktionary did not have audio for this word ! Except for Encarta's version, little space was taken for the audio: just a small icon.'s dictionary uses Quicktime which Firefox doesn't like.
  • Two (ARTFL and present only stress pattern and use no phonetic alphabet. They only show the US stress pattern.
  • Five (Encarta, AHD,, Wordsmyth, amd Cambridge American) only show the US pronunciation.
  • Seven (Comp Oxfd, MW, Camb Intl, Wikt, Infoplease,, Ultralingua) show US and UK, Camb Intl showing a third.
  • Of those seven offering both, Wiktionary took the most vertical space (3.5 lines). and Comp Oxfd took 2 lines. Cambridge took 0 vertical lines, using a "show/hide" that took 1 horizontal inch. Expanded, Cambridge took much more space. The others took 1 or 1/2 lines (extra height for phonetic characters).

I conclude that we might well consider some effort to reduce Pronunciation space requirement especially on the initial screen, without reducing actual content. A prominent "show/hide" on or near the inflection line (or on the right if not squeezed out by ToC) would seem to fit the bill. DCDuring TALK 03:22, 5 August 2008 (UTC)

The only reason traditional dictionary devote only the minimal space to anything they include, is just that, space limitations, which Wiktionary doesn't have. There is a compelling demand from the public for dictionaries to include ever more words and meanings. THis means that everything that can be dropped or otherwise reduced in space (e.g. run-in words) will be to the maximum extent. Given that wiktionary, unlike any usual dictionary, can and aims to include easily up to 5 or 6 different pronunciation (gen US, gen UK, SAf, CA, AUS), amd include information that is not usually separate (hyphenation patterns are normally combined with the headwords, rhymes are not at the entry etc.), this concerns is a bit inappropriate. the best we can do is reduce the amount of space where appropriate (possibly by putting more stuff on a single line, such as IPA and (X-)SAMPA?).
Another option is to reorganize the page to de-emphasize pronunciation (which well over half the entries may be missing anyway), which is probably not the main concern for a vast majority of visitors, by moving it more toward the end of entries, for exemple.Circeus 23:12, 5 August 2008 (UTC)
These are all on-line dictionaries no more or less bound by space than we are. The space constraint has changed its nature, but not disappeared. The useful limit on the amount of information that can appear on a page corresponds to the useful limit on what one can get on a screen. We can have as many entries as we can maintain, but, if we don't give the user what is wanted quickly, they will move on to one of the 18 other on-line dictionaries (not counting Urban Dictionary or specialty dictionaries or other reference sources). Remember that we (all of get one seventh the number of visits that MWO gets and less than half of what gets. We are not dramatically ahead of OneLook or Bartleby. Nor is our relative position improving very much (if at all) over time. Effective search (not a strong point of the software and also not of our handing of search engines) and effective presentation of the desired information (about which I am concerned) are perhaps almost as important as whether Determiners should be a valid PoS header. DCDuring TALK 23:49, 5 August 2008 (UTC)
Also, there is nothing that says that we can't have many different pronunciations. I am only interested in minimizing the amount of space consumed on the first screen for items that are not used by many users. If we knew that 50% of our users were non-native speakers of English who could use at least one of the phonetic alphabets we offer and who could conveniently access our audio files, there would be much more justification for our use of space. Facts about our users would be nice. Learning from the judgments of our competitors about what users want is the best we can do without the underlying user facts. DCDuring TALK 23:56, 5 August 2008 (UTC)
"These are all on-line dictionaries no more or less bound by space than we are." No, these are paper dictionary in electronic forms, and as such they present all the same inherent space limitations as their paper editions because that is the only one that was taken in consideration at the time of their conception. Very few "online" dictionaries (I can't think of any offhand) are truly "online" and do not fall prey to expectations and limitations that arose strictly from those of the original paper medium. Circeus 01:49, 7 August 2008 (UTC)
Isn't it pretty to think so? It is a serious strategic mistake to underestimate one's competitors. All of us suffer technological limits because of our respective platforms, data sources, cost structure, etc. (Consider our search capability, data structure, decision making, resource constraints, inability to get promised XML dumps, etc.) CambID, COxfD, Encarta, MWOnline all look to me more "modern", less Unix-y, than Wiktionary. They haven't told me their plans and limitations so I have little more by way of facts to go on. Please recall that three of these "limited" competitors have significantly more visits than we do:, MW Online, and It is difficult to say where Encarta stands. DCDuring TALK 02:26, 7 August 2008 (UTC)
Would it be worth looking at other FL wiktionaries to see if there is anything that can be leveraged? What is the argument against making the Pronounciation section show/hide? How about using templates for headers? Was this option discussed before? E.g. {pron} instead of ===Pronunciation===. A template could be adjusted as new approaches come up. --Panda10 11:50, 7 August 2008 (UTC)
Excellent ideas. "de.wikt", especially, can be interesting to look at for technical ideas. But many of the FL wikts have needed to address the position/presentation of pronunciation, facing different considerations, I'm sure. DCDuring TALK 12:36, 7 August 2008 (UTC)
There are many technical reasons not to replace section headers with templates. One of these is that it disables section editing. We've rejected that idea as a community many times, having learned of its pitfalls from the French Wiktionary. --EncycloPetey 19:15, 13 August 2008 (UTC)
Yes, the space constraint is the first screen of information. I absolutely agree. A bigger problem than the pronunciations though are the damn ToC's. They're only good for jumping to a language selection and maybe part of speech. For three etymologies, no one's going to say, "I think I need Etymology 2" 'cause you've got to look at the definitions under each one till you find what you're looking for. And you don't know which Synonyms or Translations section you need either, until you find the definition you're looking for. So not only are the ToC's wasting space by not being collapsed, they are 95% useless altogether. DAVilla 01:15, 9 August 2008 (UTC)
The way you would use the ToC is to jump to something that seems more relevant to you that what you see on screen, maybe you know what PoS you need, maybe you know you need something not connected to the part of the etymology you can see on screen, maybe you know you need a translation or a pronunciation.
Sorry for breaking this here, but how am I going to know which translation section I need if there are two noun sections, such as with multiple etymologies or pronunciations? And if there's only one noun section, then why on earth do I even need the ToC? DAVilla 20:06, 11 August 2008 (UTC)
Think anon user, coming less than once a week. All the user can do is try something that our screen makes possible and easy. Often they don't have choice. In other cases they would just try the first translation section that their mouse gets to or start at the top translation section (or the only visible one). That is no worse than MWOnline. They use the ToC so they don't have to get their hands off their mouse (or use their mouse/pointer) to scroll down. At the slightest bit of discouragement (too long an entry, non-roman text (including lots of phonetics), words they don't understand (Determiner, Hypernyms etc., even Etymology)), we probably lose some percentage of these casual users. No one would pretend that we have any ideal solution, but it pays to think what the users' actual options are and how easily they can get discouraged and try our competitors, often finding exactly what they want and not getting confused by much more. DCDuring TALK 21:02, 11 August 2008 (UTC)
Why focus on putative problems instead of putative strengths? I long ago stopped bothering with several other dictionary sites because they consistently screwed up their IPA. One electronic dictionary I used always had the primary and secondary stress markers reversed. Most sites offer little or no navigation for long entries; I like the fact that we have a TOC that allows me to jump to a section or to scan to see what languages the entry is in. You continue to make hypothetical argument about things that you believe might discourage some users, with no data or support for these arguments. Then, you want to reform Wiktionary structure based on your musings. Please provide something measurable, or at least play a different note. --EncycloPetey 19:22, 13 August 2008 (UTC)
Yes, I believe that listing languages in the ToC is entirely useful. That much I haven't seen anyone contradict. DAVilla 03:23, 14 August 2008 (UTC)
I have reread EP's comments and still don't get the point. I hadn't made any argument against language headers. That would be why there were no facts or reasons offered. I was explicit about characterizing how a user might use the ToC, unsystematically, sloppily, but based on an intense need not to waste time on fruitless webpages and websites. The only alternative model user that I see advanced is EP himself. I am not interest in reforming Wiktionary based on my musings, I am interested in getting us outside of our own unexplored implicit biases. I would be delighted if someone had some actual facts about users to contradict my "musings", even if they had only anecdotal facts that had the ring of reality or musings about a different type of user. If there is only dismissal of discussion without facts or discussion, my explicit musings remain better than unsubstantiated, unexplored personal preferences or worse. The issue seems to be that musings about users would lead to questions about the wisdom of some design choices. That would be true. That would be the advantage. It might make it possible for Wiktionary to survive, prosper, and have more consequence in the world at large. DCDuring TALK 03:45, 14 August 2008 (UTC)
I will play this music as long as I am here. I was considering taking a survey of Wikipedians and WMF list members about their views on Wiktionary usability, because we can do that for free and ask them questions more easily that others for whom the privacy difficulties seem overwhelming. When I talk to the few people I know who even use an on-line dictionary, they tend to prefer or MWOnline, consistent with the statistics on the relative popularity of those sites.
There are no user-related facts to support existing practice either, unless you count our own somewhat whimsical preferences. The considerations I advance are based on the published wisdom of web-site usability professionals whose custom advice we can't afford. What is our current practice based on? It often seems to be based on sloganeering ("All words in all languages" [conspicuously excluding any mention of users) and academic concerns. I realize that we depend on this group of volunteers to get anything done, so our own whimsical preferences necessarily have a lot of influence, but some explicit consideration of users would also help.
Our ignorance about our users is profound. How would you characterize our current users? How many are language learners? How many are writers? What is the frequency distribution by number of visits per month? How many can read IPA? SAMPA? enPr? How many are looking for non-English words?
At present, we still seem to be given the leeway to ignore user preferences. Perhaps we should just admit that we will never be user-friendly enough to be a "people's dictionary" the way Wikipedia is a "people's encyclopedia". Even then, it would be nice to have some idea of what we were trying to do, if we are not trying to figure out what users want and how to deliver it. DCDuring TALK 20:00, 13 August 2008 (UTC)
Merriam Webster online has the same problem, except worse. They have no information on the first screen if there are multiple headwords with the searched-for spelling. You have to guess which of three "nouns" might be the one you want.
No dictionary tries to get as much information into a single entry as Wiktionary. Therefore most of them have much less need for navigational aids within an entry. It is remarkable that, despite having much less to fit on the screes, they also devote so little prime space to pronunciation, alternative spellings, and etymology. They are also fairly economical with space devoted to related and derived terms, synonyms, etc. They also don't have a lot of space devoted to headers.
We have the ability to put the ToC on the right hand side. Registered users who are let in on the secret can select that option. We have had it in operation for a while and gotten rid of a couple of problems, but I don't think it's quite ready for prime time. We would love to figure out how to get some kind on minimal definitional clue (gloss) onto the headings for the Parts of Speech and have other gloss-type navigational links within the entry as well. Not so easy with an editable wiki. DCDuring TALK 02:05, 9 August 2008 (UTC)
Thanks for doing this comparison. DAVilla
Uh... if they present less information, then why is it remarkable that it takes up less space? --EncycloPetey 19:22, 13 August 2008 (UTC)
  1. Because they don't find it worthwhile to take up space with multiple phonetic renderings of the same pronunciation, sometimes only offering (smoothly functioning) audio. CambIntl showed three pronunciations, but used a show/hide scheme (as default for all users). DCDuring TALK 20:00, 13 August 2008 (UTC)
And how did you determine their motives? You are making the fundamental logical error of ascribing cause to an effect without any support. If pronunciation is so seldom desirable, then why are there multiple major pronunciation dictionaries in print? Presumably there is a demand for pronunciations or publishers would not keep printing them. --EncycloPetey 02:37, 16 August 2008 (UTC)
Please don't misunderstand me. By my own reasoning, we certainly need Pronunciation because probably a significant language learners and some English native speakers find them useful. Those who find this useful probably are among those who get the most out of Wiktionary (apart from us). Evidence, indeed, is to be found from the fact that the best on-line dictionaries have pronunciation, including phonetic spellings. We would be foolish to discard what we have already done and need to continue to expand it and add more regional pronunciations and keep the three phonetic representations.
My sole concern is with the space taken by it, because it appears so prominently on the first screen that users see at an entry. It is not even the prominent place that pronunciation has that bothers me, only the amount of that prominent space that Pronunciation uses. I would like it if Pronunciation could be presented more economically so that we could have Pronunciation links appear wherever a user might want one, for every inflected form that appeared on a page, if warranted. Alternative spellings in vertical layout, "see", long etymologies, large fonts in headings, and the left-hand side ToC all gobble up valuable screen real estate, the ToC being the worst offender. DCDuring TALK 02:58, 16 August 2008 (UTC)

"SoP" Mandarin, Japanese, and Korean entries

Should we keep entries like "中国人" (中国 + , Chinese + person) or "日本人" (日本 + , Japanese + person)? In the English sense, these entries are purely a "sum of the parts", and they set a bad precedent for thousands of similar entries. --TBC 02:48, 7 August 2008 (UTC)

It only matters if they are idiomatic in the language in question. As I believe they would be, it's probably best not to include them. The Japanese entries fairly consistently class them as an indivial or the people, Mandarin as an individual. But my knowledge of these languages is very basic, so ask for others' opinions by placing the terms on RFD. 00:26, 9 August 2008 (UTC)

Romanization (again?)

It is my understanding that we do not allow Romanization of words in other scripts. Is this still the case? There do seem to be exemptions for Japanese forms. My own thought is that they are useful - one example would be words on a Greek or Thai menu in an English-speaking country. There have been a couple of such Sanskrit words entered recently that I have removed, so other people's thoughts would be appreciated. SemperBlotto 07:26, 7 August 2008 (UTC)

We do indeed; see Ahmed, Gopala, Govinda, lin, gong, xiexie, and nihao for examples. 07:28, 7 August 2008 (UTC)
Just a note that, with one exception (Gopala, which I have tagged for cleanup), these are all either English or East Asian words. -Atelaes λάλει ἐμοί 07:41, 7 August 2008 (UTC)
The current standard is that east Asian languages (e.g. Chinese languages, Japanese) are allowed transliterations. Let me start out by saying that I strongly disagree with this policy, as I have never completely understood the rationale behind it, but it is the current practice. However, transliterations are not allowed for other languages. Specifically, we have decided that Sanskrit transliterations are not accepted. There are a number of reasons for this. First, because most languages can be transliterated in a number of different ways. Whose system do we use? Should we use multiple schemes? Only doing entries in native scripts avoids this decision, and its accompanying drama. Secondly, this simply avoids a lot of redundant clutter. Finally, concerning users who lack input methods for non-Latin scripts: Yes, this is an issue, but a fading one. Windows Vista comes preloaded with a number of non-Latin keyboards, including Devanagari. Also, there are a number of methods to finding words which one can utilize without native script input capability, such as searching categories and indeces. -Atelaes λάλει ἐμοί 07:38, 7 August 2008 (UTC)
If there are two or three systems in use, we don't have to pick a system. Doing so would be rather POV in fact. Each of several transliteration systems could be incorporated if there were reason to do so. Consider that in Mandarin the mainland Chinese and local Taiwanese governments use different romanizations. (I assume we don't give the Taiwanese the shaft, do we?) The question is what constitutes good reason. Input systems is not, in my opinion, for the reasons you gave. I'd like to see it used in print, applying to proper names at least, or maybe we could be even more strict about that. 01:02, 9 August 2008 (UTC)
There are entire books that have been written in romanized versions of Chinese. The criteria for including the romanization should be the same criteria as including anything else: durably archived citations. bd2412 T 05:18, 14 August 2008 (UTC)

There's a long tradition of English-speaking Sanskritists (dating back at least 200+ years), and terms such as Govinda appear in scholarly English-language publications about Hinduism and Indian culture in general. Thus, I would support the addition of Sanskrit to the East Asian languages, in regard to the allowing of Latinized entry titles, for ease of navigation for our users who do not have Devanagari keyboards. 07:43, 7 August 2008 (UTC)

Nearly all languages have been commonly romanized in English scholarly papers. This is due to technical limitations which are not present on Wiktionary. This has been previously discussed with the decision to not allow transliterations. -Atelaes λάλει ἐμοί 07:49, 7 August 2008 (UTC)
Going through the archives, there hasn't been any major discussion on the inclusion of romanized terms since a year ago (at least, for the Bear Parlour). However, the discussion was inconclusive (there wasn't really consensus for anything), and much of the precedent for romanized entries seems to be have been set by a discussion dating back to 2005 (though again, no consensus was established). There's also been some discussion regarding romanized entries on Wiktionary:About Ancient Greek, but that hardly applies to all the other languages, and the page hasn't been established as a policy yet. Our policy on romanized entries, Wiktionary:Transliteration and romanization, doesn't really help either. EcycloPetey has stated that "Our practice has been to not include romanizations that are not a natural part of the language", but I can't really find any discussion establishing the consensus to back this up. So, the rationale not to discuss this issue because it was discussed before is largely irrelevant.--TBC 07:51, 7 August 2008 (UTC)
I think Romanized entries are useful when there is a popular, standardized system, as in the case of Japanese, Korean and Mandarin; and some classical languages have a strong tradition of Romanization in the academic community (e.g., Sanskrit, Ancient Egyptian, Sumerian). Many others either don’t enjoy a popular Romanization system, or else there is little standardization and numerous systems in use (as in the case of Arabic, Armenian, Thai, Khmer, Russian, and the Dravidian languages). In the latter cases, I don’t see the value in Romanized entries, with the exception of a few well-known words (Allah, mullah, amir)...although many or most of these words will qualify as English. —Stephen 07:52, 7 August 2008 (UTC)

Sanskrit has been standardized (using a lot of macrons and "dot unders"), but, as we've found with pinyin, it just isn't easy for most users to type the tone marks in, thus we have separate entries (due to the fact that redirects are frowned upon), such as xiexie.

In the case of Govinda, a solution needs to be made. There are several possibilities: redirect to Devanagari only, separate entry with Latinized title called an "English" word, separate entry with Latinized title called a Sanskrit word (equivalent to the entries for East Asian words), and perhaps others. The paramount issue is if users come here attempting to find the meaning of Govinda, Gopala, Gopi, nath, or any other commonly found Sanskrit term, they should be able to find it quickly and easily. Our collective ingenuity, I am sure, will come up with a solution for this, as providing correct, thorough information for our users, in a manner that is easy to find, really is important. 07:54, 7 August 2008 (UTC)

Regarding TBC's points above, many Sanskrit terms have become a "natural part" of the language due to the fact that Indians have historically and commonly used English, and many of the common Sanskrit terms are part of their proper names, and have consequently gained familiarity in English-speaking nations. 07:56, 7 August 2008 (UTC)

Err... my points? I think you're referring to Stephen's...--TBC 08:00, 7 August 2008 (UTC)

No, you did mention something about terms being a "natural part" of the language. I was saying that many of us hears these Sanskrit terms on a daily basis--at least those of us who work with expatriates from South India, which is a lot of people these days. :) 08:02, 7 August 2008 (UTC)

Oh, I was quoting another user on this site. His views don't reflect mine at all.--TBC 08:06, 7 August 2008 (UTC)

Going back to the issue at hand, I believe that romanized entries of foreign scripts should be kept. It's helpful for users when it comes to navigating the site, especially if they have a keyboard with Latin letters, or if they haven't installed the foreign scripts yet. As long as the entries come from established romanization systems (with attributable sources and so on), I really don't see why we should exclude them, especially when we do include romanized entries for Mandarin and Japanese.--TBC 08:06, 7 August 2008 (UTC)

For what it's worth, my opinion is that such entries would be useful to our users - but the ==Language== heading to use is problematic (they are not Sanskrit, Greek etc). I have added a ==Translingual== entry to gai - would that be an acceptable alternative? SemperBlotto 08:17, 7 August 2008 (UTC)
It's fine by me, but should the same be done with Mandarin and Japanese entries?--TBC 08:22, 7 August 2008 (UTC)
Well, for Japanese and Mandarin, we just use the language header Japanese or Mandarin, and add it to a Romanization category. Due to the difficulty of entering letters such as Japanese long vowels (āēīōū), I recall that we had decided that it would be a good idea to drop the diacritics (e.g., sayonara instead of sayōnara, redirecting sayōnara so we don’t get duplicate entries). —Stephen 08:28, 7 August 2008 (UTC)
"Translingual" seems confusing; how would "Romanized Thai" or "Thai (romanized)" be for maximum accuracy and clarity? 08:40, 7 August 2008 (UTC)
There isn't much logic in the language headers of transliterated given names. Many are "English", others "Translingual" or the original language, e.g. Aleksej is "Russian". I think "English" is reasonable for names used in modern India ( English is an official language there, and the spelling is typically English) and for Arabic names ( many immigrants give them to their children). For names transliterated from all other languages, I'd vote for "Translingual" or "Romanized Russian/Whatever". --Makaokalani 12:16, 7 August 2008 (UTC)
This requires some thought. For example, Ljubomyr, Ljubomir, Lubomyr, and Lubomir are all valid romanizations of the Slavic name Любомир—the name isn't English, but it is both Ukrainian and Russian, and probably belongs to several other Slavic languages under this Cyrillic spelling or a variation. There may also be native versions written in Latin, say in Czech or Croatian. Many transliteration schemes are used translingually, but some are specific to a particular target language, for example Nikita Khrushchev is Chruschtschow to Germans, but Khrouchtchev to the French.
I assume we would include one entry for a particular romanized spelling, and a separate “language” header for each source language. If so, then we can't use “Translingual”. “Romanized (Romanised?) Russian” isn't really a language, but it may have to do. Or “Romanized from Russian.” Michael Z. 2008-08-07 23:07 z
If a transliteration scheme applies broadly to a script instead of just one language, would it make sense to use a header that reflects that?
Wouldn't the type of transliteration clue the user into which governments it would likely be used under? Again, I refer to the flavors of pinyin as example. 01:02, 9 August 2008 (UTC)
I'm not sure about the first: some transliteration systems are consistent across an entire script (e.g., w:ISO 9), while other standards actually consist of a separate transliteration table for each language (e.g., w:scientific transliteration), w:BGN/PCGN romanization, w:ALA-LC Romanization). Also note that systems like the latter two have strict variants which require special characters, and modified variations which are very widely and consistently used in publication.
The French and German transliterations I mentioned are not necessarily any kind of government standards, they just happen to be spelt according to the phonetic rules of the respective languages rather than English. Michael Z. 2008-08-09 23:19 z

User 24.*, you are confusing the addition of transliterated/transcribed Sanskrit lexemes, which is forbidden, and the English terms of Sanskrit origin (which usually correspond to some approximative romanization, or are directly used as IAST transliterations), which is allowed if they pass usual WT:CFI, which thousands of philosophic and religious terms you're probably interested in usually do (not to mention 330,000+ Hindu deities ^_^). --Ivan Štambuk 11:00, 7 August 2008 (UTC)

I'm not confusing anything; the lexemes (such as nath, which was deleted entirely the other day) are equally important as the names, and all should be available to users with Latin keyboards who are looking for them, just the way lin, mao, or gong are available. 22:26, 7 August 2008 (UTC)
Yes, you are confusing them. You may be fully aware of the distinction yourself, but you are muddling them together in your arguments. You put Ahmed forth as a transliteration of a foreign word, when it is an English name (and tagged as such), although obviously its of foreign origin. Perhaps you have two arguments, one that we should make such terms available to users without non-Latin keyboards, and another that we are already keeping transliterations. But, if so, you need to draw a sharper distinction between them in order to avoid the aforementioned accusation. -Atelaes λάλει ἐμοί 22:32, 7 August 2008 (UTC)
Ah, different sense of "confusing". Confusing!
The fact remains that, as of a day or two ago, our users cannot find out what nath, a very common Sanskrit term, means, when they could before. Try putting "nath" into the search box and see what you can find. That's a problem that really does need to be solved (which is what this page is presumably for). Regarding foreign names, how about one like Jagdish--would that be considered an "English name"? To me, it doesn't matter what the heading says--that's for Wiktionary policy, but I would expect to be able to type it into the search box and get the information I needed (i.e., the etymology, Dev. spelling, etc.). 22:36, 7 August 2008 (UTC)
Ok, then put forth a comprehensive, workable proposal for doing so, which can be applied to all languages. Let's start with the issues facing Greek. First, which transliteration scheme should we use? There's the four systems listed at Wiktionary:About_Greek#Romanisation, and a few more at w:Romanization of Greek. How do we decide which one we're going to use? Also, none of those systems is typically used in its pure form, Greek words are often transliterated with a sort of mish-mash approach. And that's just modern Greek with modern pronunciation. There are a number of periods of Ancient Greek phonological development. Do we give preference to Classical, and treat φ as ph (or more accurately pʰ), or should we give Koine preference, in which case it would be f. How about diacritics? Should we include them? Some people think it's more accurate to do so, but it does make it more difficult for a lot of folks to type those extra characters. Better yet, how do we represent a tonal accent in Latin characters? Perhaps we should simply ignore that for now.....but someone's bound to bring it up. Also, what do we do about variant spellings and inflected forms? Ancient Greek had a solid half dozen dialects and three major time periods. Should we transliterate every variant? Also, Ancient Greek was a highly inflected language. Each verb has about five hundred inflected forms (not including participles, which nearly doubles that count). Do we transliterate all of them too? Now, perhaps we should give them all equal share, and flood each and every conceivable combination of Latin characters with a dozen transliterations in a thousand languages. What do you think? -Atelaes λάλει ἐμοί 23:06, 7 August 2008 (UTC)
This could be dealt with by attestation, just like any other word. If a Romanization is suitably attested three times in English texts, then it could be placed under an “English” header, with a regular POS, but a definition like “Romanization of X, from Ukrainian”. Keep in mind that the same or a different romanization of the original word may also be attested in French texts. If it shows up in three languages, perhaps it is then called “Translingual” (there are romanization schemes specific to a target language, and others which are meant to be international). This would require distinguishing transliterations of words from borrowed words, including fully naturalized borrowings and italicized foreign terms. Just putting this out for discussion—I don't know if it would be a good idea or bad. Michael Z. 2008-08-07 23:18 z
Additionally, so far we've only been talking about romanization, a specific type of transliteration (i.e., transliteration into Latin characters). What about other scripts? Do you think that English words have never been transliterated into Greek or Hanzi? If we're serious about this, we really ought to transliterate every word, in every language, into every script. -Atelaes λάλει ἐμοί 23:14, 7 August 2008 (UTC)
Sure, why not? In fact, we do have Mandarin entries transliterated from English, such as 可口可樂 (Coca-Cola), 吉他 (guitar), and 尼古丁 (nicotine).--TBC 00:32, 9 August 2008 (UTC)
I'm not sure I follow. A word is English only if it appears within English text. If it's written in Greek script then it isn't English text, unless the Greeks actually go about transliterating entire blocks of English-language writing into Greek script. Does that actually happen? Otherwise it's a Greek term, borrowed. 01:02, 9 August 2008 (UTC)
Regarding Greek (not my expertise), if there was agreement that we should do it, we could create entries for all four romanization systems, leaving out accents for ease of keying in the search box, and, to save having to make duplicate entries for all four, they could all redirect to the proper terms. We do the same for Korean, with two widely used romanizations there (Revised Romanization and McCune-Reischauer). But it wouldn't be done all at once. Thus, one would be able to find Hellas, demos, demotika, etc. by typing into the search box. However, I hadn't been discussing Greek, but instead primarily Sanskrit and East Asian languages. 23:26, 7 August 2008 (UTC)
Everything Atelaes said for Ancient Greek is applicable to Sanskrit too. There are at least 4-5 transliteration systems that are widely used on the Internet for Sanskrit (some more appropriate for machine processing, some for QWERTY keyboards), with plenty of "ad-hoc" schemes people casually use. Phonology of Vedic and Classical Sanskrit mismatch as much as that of Homeric and Koine Greek. Even within classical period there are differencs; the usual pronunciation of 'ph' is [pʰ] - but there are some Sanskrit school traditions that pronounce it as [f], and all of them are equally "proper" since there is no prescribed orthoephy (compare e.g. American and Brittish English). Should we therefore romanize as 'ph' or 'f' ? Removing those very important "dots under and above" and macrons would be absurd dumbing down distinct 36 phonems into 25 average Joe Sixpack is familiar with. I've already explained to you elsewhere that murali and muralī are not exactly the same words - they're pronounced and accentuated differently, and undergo different declension paradigm. Accents marks are equally important in Vedic Sanskrit as in Ancient Greek - both inherit free PIE accent, that is phonologically unpredictable and you must learn it by heart. Sometimes the position of accent differs nouns from adjectives, or among different meanings. Sanskrit verbal root also has plenty of inflected forms - 300-1000 depending on conjugation class and attested (sometimes insane) combinations of moods/tenses. E.g. bibhāvayiṣati - desiderative of causative of √bhū "to wish to cause to become" ^_^. Not to mention 8-12 participles per "tense system", each of which can be inflected in 3 genders x 3 numbers x 8 cases. And are you aware of this cute little thing called sandhi? Depending on environment, underlying e.g. agnis can be spelled as agnis, agniḥ, agniṣ, agniś or agnir. Should we add all of them, for every inflected form of every declinable?
This "Sanskrit word" nath you mention means nothing in Sanskrit. Perhaps you were referring to English word nath, which is borrowed from Sanskrit nātha ? I'm sure you have.
I'm also sure that your proposals stem from ulimately benevolent desire, but you cannot possibly expect to dumb down Wiktionary usage threshold of a (semi-)dead language 99% people on this planet have but a vague notion of what it is to a level I can appropriately describe only as "Jebediah". To whom? Why? Sorry but there is no sense to me in these "idiot-proof" proposals of yours. Maybe East Asian language contributors see sense in adding romanized entries (because people actually write languages that way), but I personally find the idea of adding illiterate Pinyin entries without tone marks very disturbing, and would rather see them pass normal CFI than to be legitimized as "people find tone marks hard to type".
Once again, mentioned Sanskrit words (in whatever transliterated/transcibed form) in English literature are not allowed. Used originally Sanskrit words borrowed into English (nath, yoga, dharma etc.) that interest you are allowed if they pass normal WT:CFI. Sometimes these are used exactly like IAST transliterations of Sanskrit words, so they become "English" (e.g. pundits that spell Krishna as Kṛṣṇa - then the latter spelling is allowed as an alternative English spelling of a former much more common spelling), if they can pass CFI. --Ivan Štambuk 15:06, 8 August 2008 (UTC)
I have no idea how Sanskrit works (mostly work on Mandarin and English entries here), but entries like murali could have a "see also" link directing the reader to the entry with the correct accent marks, like what we currently do for Pinyin here (lin has links to līn, lín, lǐn, and lìn). Regarding the issue of having different transliteration systems, I think we should have entries for all of them, with a note on each entry mentioning which transliteration system the entry is based off of. It may seem like a daunting task, but it would make Wiktionary a lot more informative and accessible to the reader.--TBC 00:45, 9 August 2008 (UTC)
More informative and accessible to whom exactly? Learning Devanagari is lesson 1 of every single Sanskrit grammar book you'll find out there (try a dozen of them freely available on in DjVu format). Every single "Sanskrit term" IP address mentioned was Sanskritism in English. Latin-script entries with removed accents, macrons and "dots under and above" cannot pass CFI because no one writes Sanskrit that way. The practice done for East Asian languages is very wrong IMHO, and I cannot support its extension to other languages, especially when the only real argument behind it is "it's already done for language X". It's like you're legitimatizing crime with other crime. --Ivan Štambuk 09:26, 9 August 2008 (UTC)
It makes it more accessible to those who aren't familiar with Sanskri, those who don't have a clue what Devanagari or Sanskrit-specific accents are. And calling the practice of linking transliterated entries to entries with correct accents a "crime" is a bit exaggerated, especially when it helps out readers who aren't familiar with accents (which would be common on a Wiktionary designed for the English audience).--TBC 01:00, 10 August 2008 (UTC)
Obviously I don't speak for Ivan Štambuk, but I think he intended the "crime" sentence as an analogy, not a description. I don't think he's saying that these transliteration exceptions are crimes; rather, he's saying that he doesn't even support the existing ones, so this sort of argument-by-analogy won't convince him to support the new ones, either. —RuakhTALK 01:20, 10 August 2008 (UTC)
Of course it's an analogy. But my point is, why compare it to committing a crime? Why not something less... harsh? Anyhow, this is getting a bit off topic...--TBC 02:22, 10 August 2008 (UTC)
Leave out the sandhi, says I, unless there's a really compelling reason all those forms should be included.
Maybe our opinions would be better shaped if we saw how at least a few words did in CFI. I'm not keen on overloading the burden of proof, but for at least a small sample of terms that's more objective than asking for opinions about how language is used. Let's see how it really is used. 01:02, 9 August 2008 (UTC)
So you'd like to do a CFI for "a few words", generalize the conclusion and apply it to hundreds of thousands of other lexemes? --Ivan Štambuk 09:37, 9 August 2008 (UTC)
Yes, as many words for a given transliteration scheme as are necessary to reach a conclusion. If examples that we would doubt could ever pass do, it illustrates a strong connection between the foreign script and the language, and there's not much point in putting each and every term to the test. If it's mixed results then obviously CFI would have to continue to apply until and if a rhyme or reason were supported for the types of terms that are commonly tranliterated, for instance root forms but not sandhi, or mainly proper nouns but not many common nouns or other parts of speech. (There may be camps on either side that dismiss the test entirely, and say that all should or shouldn't be included regardless, but this is more of a middle ground.) The results would have to be interpreted of course, but at least it gives us a foundation on which to base our opinions. I know that I at least do not feel entirely comfortable with the current solution, which seems to brush the which-language issue a bit broadly. Because I trust the opinions of other here, I believe that it's close to the right solution, but I wonder if we're all thinking about these in the same way, or if there's some bias to include or not include transliterations in some language simply because the people who speak that language are more inclusivist or deletionist than the whole. DAVilla 19:41, 11 August 2008 (UTC)

Based on the above discussion it seems there is a weak consensus that we should include Romanised forms of other scripts, and a weaker consensus that if they are included then they should be unaccented with accented forms redirecting to them. I suggest the best format for the entries would be an L2 header of the non-English language (e.g. Russian), the L3 header "Transliteration" a bold headword (e.g. da) and the inflection line # {romanisation of|<word>}. For example:

# {romanisation of|да|sc=Cyrl|lang=ru}

I suggest the "Transliteration" header as words can have more than one part of speech with the same transliteration (да (da) has 6). If a transliteration can have different targets in different schemes, we can add an optional "scheme" parameter to the "romanisation of" template which links to an appropriate page about that scheme, be that in a non-main namespace page here, a wikipedia article or elsewhere. Transliterations of words into scripts other than latin could use apropriate templates, e.g. perhaps {{Cyrillicisation of}}, or we could have one standard {{transliteration of}}, perhaps using more than one script parameter. Thryduulf 01:54, 9 August 2008 (UTC)

I would just like to go on record as saying that I am completely and utterly against this. I would like to reserve the right to be smug when this whole thing goes down in flames (on the other hand, if it succeeds, it reserves everyone else the right to ridicule me). -Atelaes λάλει ἐμοί 03:53, 9 August 2008 (UTC)
It's the other way around. Transliterations should be included as accented forms with unaccented entries redirected to them. Either way, the above format seems to work out well. I'd support implementing it.--TBC 02:51, 9 August 2008 (UTC)
I would support adding the L3 transliteration header at the original script entry that could have translitertions in several popular schemes used (like for Russian, Hebrew, Greek, Sanskrit, Arabic..). "Transliterations" for lots of languages provided here in tr= parameter for {infl} or {t} are in fact some kinds of romanizations which take into account phonetic peculiarities. That way transliterated lexemes would come up in search results, and some contributors' complaints of giving undue prominence to some ad-hoc or rarely used schemes could be resolved. --Ivan Štambuk 09:37, 9 August 2008 (UTC)
That is a wonderful idea, and by far the best I've heard yet on this topic. —RuakhTALK 17:26, 9 August 2008 (UTC)
The focus on outcome of search seems likely to give a useful answer without causing proliferation of low-quality entries.
In line with a general desire to avoid wasting space on the first screen for an entry that a user sees, perhaps this heading could appear toward the bottom of the page among Anagrams, References, Shorthand, and similar headings. There is a risk that the user will think that a mistake has happened, but the Language header should provide a clue and there will be some reassuring English to be seen for those unfamiliar with the headword's script. DCDuring TALK 17:53, 9 August 2008 (UTC)
Regarding placement on the original script page, I'd say that the L3 "Transliteration" header should be after information about the meaning of the word (e.g. after synonyms, antonyms, shorthand, related and derived terms) and before information about other words (e.g. anagrams, see also) and meta-information about our definition of the word (e.g. references, dictionary notes).
If there is more than one widely used transliteration scheme then these should, imho, all be included, listed either in strict alphabetical order (by scheme (my preference) or by transliteration) or "standard"/"official" schemes listed alphabetically first followed by other schemes, also alphabetically.
I would still like to see an entry at the transliterated spelling, as where this coincides with words in other languages that use e.g. Latin script natively there would be no indication that we have any information about the word. For example, да (da) is transliterated as "da" in at least one scheme, but as da is a word in at least 27 languages (there may be others we don't yet have entries for) and is also a translingual symbol the transliteration it would not appear to the casual user at all, and would presumably be overwhelmed in search results for those who know to "search" rather than "go". These entries would not be of any less use or any lower quality than the vast numbers of existing plural, verb form, adjective form, etc. we currently have (see for example da#Polish). Thryduulf 18:33, 9 August 2008 (UTC)
See references to other scripts may not help users much if there is more than one choice or if the transliteration entry is long (like "da"). In practice, a transliterated spelling entry might be subject to "attack" by RfV. Appearance on the "original script"/"non-Roman script" page would not have to meet attestation (unless we decide otherwise). It would seem that it should even be botable. DCDuring TALK 20:00, 9 August 2008 (UTC)
I don't understand either of your points. Taking the last one first, if a transliteration is acceptable on the original script page without attestation (i.e. да -> da), why would attestation be required for the appearance on its own page (da -> да)? If attestation of a transliteration is required then surely attestation that "да" is transliterated to "da" is also attestation that "da" is a transliteration of "да"? In the latter case, it would be silly to duplicate the attestation and so all of it would be presented on one of the pages (I have no preference which). Say the attestation is required on the original script page ("да"), any rfv on "da" would automatically be an rfv on "да" also and synchronising notices would be easily botable. If attestation is already present at "да" then an rfv launched on "da" would be an immediate pass.
In practice, we do not require attestation for much content. In principle, perhaps, we do, but we have no procedures for challenge, attestation/reference standards, etc. We have vast numbers of redlinks and other unattested or unsupported content on our pages. OTOH, as soon as something becomes an entry, we have procedures. Any departure from the procedures requires a VOTE, which is not exactly a bureaucratic formality. Anyone could challenge an entry. DCDuring TALK 21:31, 9 August 2008 (UTC)
Regarding your first point, if you are talking about noting multiple transliterations on the original script page (e.g. "да") then these would be identified by which scheme they are with links to explain the scheme. If we note only significant/official schemes then there will not be many to choose from. Most users looking up transliterations on the original script page will understand about different schemes, and using the same scheme for all words a single translation is hardly rocket science.
This is not what concerned me. DCDuring TALK 21:31, 9 August 2008 (UTC)
If you mean confusion on the transliterated script entry (e.g. "da"), then I don't understand this at all. In everything we do we assume the user knows what the language they are interested in, so in the example case the user will just click "Russian" in the table of contents, and they will be taken to the section of interest to them. This is no different to someone looking up the Polish entry. If you mean they will be confused by there being more than one transliteration scheme, then no they wont, because the others wont appear on this page. Say one transliteration is "da" and another is "dar" (I don't know if this is a transliteration or not), then the only way someone looking at "da" will know about "dar" is by clicking through to the original script entry and then looking at the transliterations section. Thryduulf 20:51, 9 August 2008 (UTC)
I was thinking that we were talking about "see" references on the top of the entry. But even if we are not, it is not always clear what language something is in. Someone may "know" something is "South American Indian" and not find anything that remotely resembles that in the ToC. If she were a linguist she might recognize "Guarani" on the list.
That said, I am not particularly opposed to such language sections in entries like "da", though I question the benefit of new headwords populated solely with romanizations. I am strongly in favor of multiple accepted romanizations on the "original script" entry. I am only taking the point of view of a relatively naive EN-5/(all else 1 or 0) user hunting for the meaning of something they just read (probably embedded in English text), not the point of view of a career linguist, language or linguistics major, or even an enthusiastic amateur linguist. If Wiktionary is basically only for the latter classes, then my comments may be safely ignored. DCDuring TALK 21:31, 9 August 2008 (UTC)

After a bit of thought, I have to oppose entries for transliterations. This would multiply the potential number of entries in Wiktionary by an order of magnitude for romanizations only, but potentially another one if we allow cyrillizations, hellenizations, etc. Huge as the volume of actual words is, it would get overwhelmed by ten to a hundred times as many bot-created trans-lingual entries.

Transliterations should be attached to native-script entries, either as a subheading (or perhaps in a separate namespace?). Bot-generated wordlists of romanizations with links to the source could help with the search problem. Michael Z. 2008-08-09 23:45 z

Being that this is the English Wiktionary, we would be focusing mostly on romanizations. Mandarinizations, hellenizations, and the like should only be included if the transliterated word has entered into the native vernacular, like with 吉他 (guitar) and 麻吉 (match). Size shouldn't be a problem at all, especially considering the fact that Wikimedia has around 350 high-performance servers, while each Wiktionary entry is just a couple of kb.--TBC 01:11, 10 August 2008 (UTC)
“Focussing mostly” on A, doesn't mean that Joe Schmoe and his bot won't create a zillion entries for B, C, D all the way through ZZZ. If we were to include various romanizations of, say, Chinese and Russian words, why wouldn't we include Cyrillizations of Chinese and Georgian, or hellenizations of Japanese and Thai? Why would you just assume this “native vernacular for other writing systems than Roman” rule?
I'm not concerned about server performance, but about search results. Search for капелюх, and you'll have to find the actual entry in the haystack of 50 romanizations, hellenizations, sinozations, etc.
Wiktionary's basic unit is the entry, which currently represents exactly one term. By adding entries for romanizations, cyrillizations, etc., a single term would be represented by potentially dozens of entries, all subordinate to one of them. This would be a mistake. Michael Z. 2008-08-10 01:51 z
"Why wouldn't we include Cyrillizations of Chinese and Georgian?" Because we are an English wiktionary. Our focus is on Romanizations. The Cyrillic Wiktionary would be the place for Cyrillic transliterations. By "native vernacular" I'm referring to foreign words that have been used so much in a language that it becomes assimilated into the vernacular of the country, like how deja vu has been incorporated into English.--TBC 02:16, 10 August 2008 (UTC)
Sounds wrong to me. We are a dictionary of all languages, in English. I don't read Greek, so I can't go to the Greek wiktionary to find out how Cyrillic words are transcribed into Greek. We would only use romanization to make foreign-script words clear to our readers, but we could have entries for all types of transliterations.
Native vernacular is a separate issue—these are attested borrowings, and should already be added to English Wiktionary. Michael Z. 2008-08-10 17:25 z

I also strongly oppose transliterated entries for foreign lemmas unless there is a generally accepted transliteration system for a particular language that is likely to be used by most of our readers for that language, and the editors of the language in question agree that it is worthwhile to have those entries. I have had this discussion with numerous people about entries in Modern Greek, and I would block a move to introduce such entries for that language. I do not want to see twenty varieties of greeklish for each real entry, or even four. We already put one transliteration into the real entry, and I don't want to see twenty varieties there either. Dodde's suggestion below seems eminently reasonable to me, with the caveat that the editors of each laanguage determine which, if any, transliterations go in to the actual lemma, if we decide to put them there (and you're not interfering at all, Dodde). -- ArielGlenn 01:01, 10 August 2008 (UTC)

Agreed, except I would also disallow soft-redirects. As Michael Z pointed out above, bot-generated word-lists (maybe in index:) could be created. --Bequw¢τ 06:32, 11 August 2008 (UTC)
I strongly disagree. I have no problem with word lists, but I fail to understand how we expect people to find them when the word they want to look up corresponds with a word in another language?
For example, I'm reading a Russian text that has been transliterated into English. I want to look up "da", I have no idea what that is in Cyrillic or how to type it if I did, so naturally I enter "da" in the search box and hit go (as previous experience on Wiktionary and Wikipedia has told me this is the quickest way to get to the entry). I'm presented with a page that looks promising as it has an entry for "transligual" and then lots of other languages. I scroll down but can't find Russian. It appears Wiktionary doesn't have an entry for "da" in Russian, I'll have go elsewhere.
Contrast this with my finding an entry for Russian on that page, which says "this is a transliteration of "Да"." So I click on the link and get taken to the page which tells me what I want to know. I go away happy.
How is making information harder to find helping our end users? Remember that even where we don't have an entry, there is no guarantee that the word list or original script entry (or anything else useful) will be near the top of the search results. Thryduulf 11:11, 11 August 2008 (UTC)
I agree with Thryduulf about the need for including transliteration in entries that would interfere with a "go" search finding the original script entry. I would hope that we could make periodic runs of bots or analyses of our soon-be-regular-again XML dumps to locate the cases where this occurs rather than attempt to anticipate the problem by proliferating romanization-only entries. DCDuring TALK 11:56, 11 August 2008 (UTC)
Da may be a poor example. Judging by a quick Google Books search for da tovarish, it can be included as an English entry (or can it? It seems to appear in dialogue, but doesn't require a translation). Michael Z. 2008-08-11 20:47 z
Before we jump to another example in light of your point, a point about the English word "da" in the sense you have just mentioned. Once some abbreviations, etymologies, and other top matter are added to da#English and da#Translingual, the utility of the transliteration entry in the English section would decline, because it would be buried. If the user thought it might be Russian, then the language header in the ToC would be a useful navigation aid. IOW, both the attested English "da" and the romanization under Russian (and Ukrainian and how many other languages using Cyrillic script?) would be useful. DCDuring TALK 21:28, 11 August 2008 (UTC)
I think I see your point: easier to find because it also appears in the top level of the TOC. The counter-argument is that everything else is marginally harder to find, because there is a redundant item in the TOC: listing it twice dilutes the page, and having entries for both да and da dilutes the dictionary. The Ukrainian word happens to be так (tak), but I'm sure da would have another half-dozen or more transliteration entries from other languages, and under the Russian header it would have a dozen “senses”: linguistic transliteration of да, ALA-LC transliteration of да, BGN/PCGN transliteration of да, etc.
The theoretical example at канадієць#Romanization has about 10 spellings in 16 romanization schemes. I'd hate to have to maintain separate entries for all of these.
This is why I think romanizations shouldn't have their own top-level sections, but should appear as a subsection of the native spelling—eventually some entries or search results would surely be swamped in more transliterations than natively-spelled terms. Michael Z. 2008-08-11 22:05 z
The way I'm think of the L2 entries for romanisations is that they will not require any maintenance as they will just be pointers to the information at the native script article in exactly the same way that plural entries are just pointers to the entry of the singular form.
I also completely do not understand your point about "diluting the dictionary"? How is providing useful information to our end users in a way that makes it easy for them to find diluting the dictionary? Thryduulf 23:57, 11 August 2008 (UTC)
Thryduulf's idea is excellent. We must keep our users foremost in our minds, and his/her proposal does this as regards pointing them to the information they are looking for. 00:19, 12 August 2008 (UTC)
Diluting the dictionary is putting data into the database redundantly, without adding any information. Diluting the dictionary is decreasing the signal-to-noise ratio far below one point zero.
For example, multiplying the number of L2 headings by an order of magnitude or two so that the vast majority represent transliterations which only link to terms, and only the small minority represent actual terms. Better to include all of the romanizations in a single subheading under a term—this would represent the exact same information, make it possible for the reader to take it in all at once, and avoid an endless maintenance nightmare. Michael Z. 2008-08-13 23:27 z
We are adding information, we are adding information on the romanisations. The L2 headers at transliterated script titles serve a different purpose to a subheading on the original script title. In the case of the latter, I fully agree with you that all the romanisations we choose to support should be under a sub-heading as in the example elsewhere in this discussion; they are there to benefit people who know the original script form and want to know about romanisations. The former are there so that those people who know only the romanisation can find the native spelling of the word they are interested in and from there click to the definition, etymology, pronunciation, inflection, etc of the word at the original script entry. At the transliterated script entry no information will be presented other than the language, romanisation spelling, a note that it is a romanisation and a link to the original script entry - for example:

# {{romanisation of|хлеб|lang=ru}}
which would be displayed as:



  1. Romanisation of хлеб.
It's purpose is for people who do not know what the original script entry is, so they can't look it up there. The template format will require no maintenance, except extremely occasionally, when it can all be done by bot. There is nothing to synchronise, as there is no need to mention anything about the word (POS, pronunciation, meaning, etymology, inflection, other romanisations, etc, etc) as there is no need for the person looking it up to know that as it is all at the original script entry. AIUI, as the format above means there are no links on the page (the only link is done by the template) it would not count as an entry for our statistics in the same way that misspelling entries are not counted currently.
I still dispute that this is diluting the dictionary - it is 'adding valuable information, and by your logic of L2 headers that only link to terms not being useful entries, then we should have to delete the vast majority of plural entries, verb form entries, adjective form entries, feminine/masculine/neuter form of entries, alternative spelling entries, etc. Anything that determines how many entries we have simply by counting L2 headers would just need to be slightly reprogrammed to exclude those that contain the romanisation template if the author desires to exclude those from the statistics. Thryduulf 00:00, 14 August 2008 (UTC)
I'll concede that this would be adding another entry point for finding this information, but I still think it's redundant and unnecessary. Searching for “xleb” gives useful results. Jumping to xleb gives “Full text search” as an option. Adding multiple romanizations to main entries will just improve the way this works.
(Wiktionary's 404 page is crappy, and would be improved with some redesign: clicking “Try exact match” is essentially a useless and confusing self-link, and the text search results should automatically be included below the 404 page. It also sucks badly that clicking a redlink like xleb gives you a completely different page than jumping to the name. But these are separate interface issues.)
Romanization wouldn't be quite as minimal as you indicate. Some romanizations would require, e.g., separate Russian, Ukrainian, Belarusian, and translingual headers. They might link to the same or different main entries.
And it's not the same as plurals and other conjugation entries. Those represent different words, not foreign respellings of words. Michael Z. 2008-08-14 06:31 z

Partial Consensus?

I think there are some elements of consensus around Stambuk's initial proposal, amplified by Zajac above and Dodde below. These romanization/transliteration entries need to be prevented from overweighting "real" entries. So, in broad outline, the proposal would be that:
  1. original script entries should have all recognized transcriptions toward the bottom of the language section in a searchable format, preferrable under a search/hide bar.
  2. a transliteration that was attested in English would appear in the English language section.
  3. an entry that prevented a "go" search from reaching the original script entry would contain a romanization that appeared under the target Language header, but was limited to a link to the original-script entry (and a minimal gloss ?).
  4. we do not have consensus on romanization-only entries. DCDuring TALK 00:15, 12 August 2008 (UTC)
#3 is a sticking point. Is there anyone here who support romanizations but only under those conditions? To me it sounds like a good reason to include all transliterations, but I'm not sure I'm in favor of it because it doesn't restrict the type of romanizations, e.g. even the very rare ones. DAVilla 05:02, 13 August 2008 (UTC)
#3 - adding gloss would be an overkill because it would unnecessarily burden the editors of thinking is there is any gloss pointing to FL entry that needs to be fixed every time they tweak its definition(s). Moreover, some of these FL entry have dozens of def. lines, and PoS headers with multiple etymologies.. If these kinds of "soft redirects" for romanizations that are already present as some other language's entry are to be made, they must be 100% bottable and there must be a way to discriminate against real entries (i.e. words that people actually use). Similar as is done for misspellings (except that people write those..), but more. Maybe even adding L2 header would be too much - how about enhancing {{see}} to add a collapsable table at the page top that would have lists of possible FL entries by language names that have {PAGENAME} as transliteration? What constitutes a "valid romanization" should be decided on per-language basis, and a bot could collect those from ===Romanizations==== sections and update appropriate targets. Just some thoughts.. --Ivan Štambuk 17:48, 13 August 2008 (UTC)
I have stricken the gloss possibility. I'm not sure there is a consensensus for this.
I like the idea of enhancing {{see}} so that on existing pages that are trasliterations (e.g. da) it links to native-script entries (да). Since {{see}} is only used on existing entries (an entry can't just be a {{see}} line) it maps very well to the situation in #3. It helps find the native-script entry faster but also clearly shows transliterations aren't L2 level "worthy" (and therefore don't have their "own" entry"). --Bequw¢τ 07:37, 16 August 2008 (UTC)

Minimal Consensus

Other possible consensus points:

  1. Might there be a consensus in favor of 1 only?

# Might there be a consensus in favor of 1 and 2 only? DCDuring TALK 18:44, 13 August 2008 (UTC)

No. 2 is already common practice, so there's no need to confirm consensus. Practically every English borrowing from a non-Latin-alphabet language can be considered a romanization, and clearly deserves an entry in Wiktionary if it meets WT:CFI.
I didn't mean to reopen that settled matter. DCDuring TALK 03:58, 14 August 2008 (UTC)
No. 1 I am in favour of, but perhaps some of the practical details should be discussed or tested on a limited scale? I'm sure that plenty of issues will arise and be resolved, but a few questions come to mind immediately. Michael Z. 2008-08-14 03:44 z

Another direction (romanized entries)

I am sorry to interfere with the discussion here. I just don't understand why everyone seem to wish to make easy things so complicated. This system with romanized entries is just so much work, and difficult to maintain a good quality of. Why not:

  • focus the information to the main entry (use a header or whatever to include information about romanizations/translitterations/transcriptions, with and without diacritics)

If doing this, wouldn't the appropriate entry show up in the search list (which seems to be one of the most important issues). If not, a bot can easily and effectively create soft redirect entries including a suitable template (or add the template to existing entries), stating "Japanese: did you mean [<proposed word/term>]?" {template name|<language name>|<proposed word/term>}.
This can be used for any script, any language, Japanese, Mandarin, Greek, name it. It also help not cluttering real entries with romanizations/translitterations/transcriptions that aren't even words. It's like having separate entries for all IPA-interpretations of words in all languages. With this proposal no information will get lost, it will be easier to maintain and easier to keep high quality and it will be less time consuming for anyone involved, and it will give the persons using the dictionary an easier time to find what they are looking for (those who are looking for romanized words will find it, those who don't will not have to see and read through all the "clutter" it means to add these full-size entries suggested in the discussion above). Please, atleast consider it... ~ Dodde 00:17, 10 August 2008 (UTC)

Having an entry for gao doesn't clutter anything; it brings our users to the information they need. Further, romanized entries such as nath could alternatively be simply redirects to the native script, also causing no clutter. 01:08, 12 August 2008 (UTC)
I've already told you that nath means nothing in Sanskrit - the correct transliteration is nātha. Unlike e.g. Hindi, world-final schwa is pronounced. Since there is unlikely a language in which nātha means something, search on it will yield proper Devanagari (or any other Indic script) spelling you seek, which doesn't exist yet though. --Ivan Štambuk 17:56, 13 August 2008 (UTC)
  • Comment - Point of fact: if nath doesn't mean anything and always has an "a" at the end, why do I find a definition for the word "nath" at नाथ? As most people cannot type with macrons in the search box, we are left with the situation that typing nath into the search box, since this entry was deleted entirely a week or two, does not bring users to the information they will come here seeking. This is simply one example, but important as an indicator of the failure of the current system to bring users with Latin keyboards quickly and easily to the information they seek. 04:16, 14 August 2008 (UTC)
Language doesn't always abide by the "correct" rules. I believe you completely, but I would also believe three citations if they were ever found. Yes, I know that nothing is forthcoming as yet, probably ever, but it's not impossible that our anonymous friend isn't the only one who's ever confused the two.
As to the correct transliteration, I prefer simple in or out criteria. I'm not a fan of the if-something-exists-then-an-entry-is-all-right principle. How can we be absolutely sure that nātha doesn't mean anything in any other language? DAVilla 03:07, 14 August 2008 (UTC)
You don't find definition for the "word nath", you find definition for word transcribed as nātha. nath means nothing in Sanskrit. Your conclusions on what people "can" or cannot type are needless generalizations. I'm sure that >95% people who have any kind of interest in Sanskrit are familiar with IAST. If the proposal for ===Romanizations=== header is to be adopted, search on nath would yield search results of Harvard-Kyoto trans. of nAth for the Hindi word, which gets you to proper Devanagari (or any other Indic script) spelling. English word w:nath is something completely different OTOH. --Ivan Štambuk 10:46, 14 August 2008 (UTC)

Searching for romanizations works just fine, and it doesn't need a zillion romanization entries.

Currently, searching for nātha or natha puts नाथ at the top, and nath puts it at the sixth result (with some other pages including “nath” before it).

When the various romanizations are added to a new section in main entries, then search will work even better. Michael Z. 2008-08-14 05:08 z

Yes, some Romanisation entries will have more than one language target on the page, this is why the L2 language header is used. However the point is that each language entry will be minimal and require no maintentance. Any number of languages multiplied by no maintenance is still no maintenance ( ).
Searching for some romanisations works, although I would not describe it as "fine", iff you know how to do it. Where the romanisation corresponds to a word in a language that uses Roman script natively it fails big time. Type "leb" into the search box and click go (as users have been conditioned is the best way to find what they are looking for on sites that use MediaWiki (not just Wikimedia sites) and you are taken to leb a Tatar noun meaning lip, with a note at the top of the page to "See also łeb" (we don't have an entry for this yet, but I'm guessing it's a Polish word).
If you clicked "search" instead of "go" (and why would you?) you have to go to the 6th result to get to bread, work out that the hit was from a translations section, find which translatiosn section it was in, then click леб to get to the Macedonian entry. Add the Romanisation to "леб" might help, but only if it comes higher up the search results than the translation at bread, and even then only if the user knows not to hit go for a transliteration but to click search instead (how do we expect them to know this?)
If the word you are interested in however is transliterated as "da", well you're out of luck. After finding nothing on the page for "[[da]" where we have 28 language sections (so how will adding a Romanisation entry devalue this?) you try clicking the search button you have to wait for the 116th result (scrolling past hits from user and project pages, many mainspace hits which do not even have a snippet of page text and others that have no clue as tot he language) and then only if you know the word you want is a transliteration of "да", which if you are looking up "da" you quite probably don't. I gave up searching for anything telling me it was a Russian word after 140 hits, as the last two pages contained partial mates of words (e.g. Danish, darling, Mandarin, etc). Most users would have given up before they got to the 40th hit.
Please can you explain in the light of this why you are against making it easy for our users to find what they want? Thryduulf
Of course the Macedonian леб is not found, because the entry doesn't exist yet, because Tatar леб doesn't include any romanization at all, and both Tatar leb and леб fail to link to each other as alternate forms. This is no argument for adopting new mechanisms, when existing ones have been utterly neglected. I'll add some of this, and let's see if things improve any.
Likewise da fails to find да, partly because the easily attestable English-language use of Russian da lacks any entry, and partly because the first headings in да lack romanizations. Having these basics in place would improve its findability. Of course, adding a number of romanization entries would, too. But again, let's fix the problems with the status quo before adopting new mechanisms.
I have very strong reservations about giving romanizations the status of native-script terms. Do you think that grouping all romanizations under a single L2 header could do the trick? This could be visually distinctive from regular headings for dictionary entries, a sort of appendix at the bottom of a page. Michael Z. 2008-08-14 20:21 z
Our users should be able to find the Russian "da" using the search box, no matter how that is done (most likely through a link to the native script entry under a "Russian" heading at da). The current situation is most unsatisfactory, as expressed by Thryduulf. 07:48, 15 August 2008 (UTC)
Our users should spend half an hour learning Russian Cyrillic, so they'd have no problems finding да (da). There's no reason to promote this transliteration pseudo-entries to proper L2 sections just to resolve deficiencies of search mechanisms (which is completely inappropriate for dictionary and cannot "register" transliterations as search keywords). They must be distinctively degraded as opposed to real entries, not cluttering the original Latin-script page by being grouped under ==Transliterations==, ==Translinugal== or something, or handled like {{see}} (e.g. {{tranlit|ru|sr|bg|cu|..}} generating something like "This could be a transliteration for Bulgarian, Russian, Serbian, Old Church Slavonic... word", whenever trans. collides with some {PAGENAME}. --Ivan Štambuk 08:37, 15 August 2008 (UTC)
"Our users should spend half an hour learning Russian Cyrillic," I'm sorry but this is one of the most ridiculous suggestions yet made! Why should we force users to learn another script when what they are reading isn't in that script? How and where do you propose they learn Cyrillic? We are here to make a dictionary that is useful and accessible to the largest number of people possible, not to be bigoted and judgemental about people reading or making transliterations (and yes, I really do think your suggestion is that offensive).
Having an L2 language heading with a "Romanisation of" entry is no more cluttering up the original script page than the Polish and Portuguese entries at da are cluttering up the English and Danish ones. A misspelling of entry under an English L2 header is no more degrading the real English entries or elevating them to the status of real words than a romanisation entry would if handled as I am proposing.
And there is equally no point in saying that the romanisation entries would clutter up the native script entries - just like the Russian entries do not clutter up the Serbian entries or the Pinyin entries the Traditional Chinese ones, if you are not interested in something you either don't know about it or ignore it. It works fine for every other page on Wiktionary. Thryduulf 09:57, 15 August 2008 (UTC)
Well to me it's even more ridiculous to suggest that there exist somewhere a significant population of potential Wiktionary users learning Russian or Sanskrit but are totally ignorant of Cyrillic or IAST/Devanagari..Even if there is (which I seriously doubt), how hard it is to google out some Latin-Cyrillic/Devanagari converter, or switch to different keyboard layout in your OS in a few mouse clicks?
Misspellings are already degraded by not being counted as real words, these transliterations should be degraded even more because unlike misspellings, no one practices writing of spoken language that way. Transliterations are primarily scholarly tool, not dictionary input method.
Pinyin entries (real ones, with tone marks) rarely coincide with other language entries, but this is completely different - there are hundreds (if not thousands) of languages written in various scripts in which something can be in some romanization system trans. as da.
Sorry if someone was offended, but I see the promotion of transliterations to L2 language sections very detrimental to the level of professionality of this project. Basically it boils down to using plain human ignorance as an argument. --Ivan Štambuk 12:16, 15 August 2008 (UTC)
Of course I read all these alphabets but I wish all users, with Latin keyboards, to be able to quickly and easily find the information they are looking for. Regarding "da," that would apparently take looking through up to 100 search results to find it. It is a most unsatisfactory situation and elegant solutions have been proposed. It doesn't matter which solution is adopted, simply that one or more be implemented so that our users will be able to quickly and easily find the information they are seeking. 12:26, 15 August 2008 (UTC)
At this moment, looking up da one finds this word in Japanese, Mandarin, and many other languages, but not Russian. This is a very poor situation, which needs to be rectified so that our users will be able to quickly and easily find the information they are looking for (in this case, confirmation that "da" is a Russian word, and in fact exists, just as the Mandarin or Japanese do. It doesn't matter how this is done, just that it is done. 12:33, 15 August 2008 (UTC)
But da is not a Russian word, да is. User can always refine search results by using additional keywords, such as russian da, which would list да as #4 result. I don't understand why is everyone so obsessed with this da - 98% of russian words transliterated are not so trivial monosyllabics that coincide with hundreds of other language entries, but reflect peculiar Slavic word structure, and would probably get immediately listed among top 5 search results. Situation is poor for a very specific group of people - those who cannot read or type Cyrillic, don't know how to switch keyboard layout to Cyrillic in their OS or text processor, or that googling something like "latin cyrillic converter" could help them, and they're looking up trivial word which would yield lots of search results. Are they worth all this trouble? --Ivan Štambuk 10:48, 16 August 2008 (UTC)
Since this discussion started, I have added appropriate transliterations and “alternative form” links to leb and леб, and researched some citations and added a subheading for the Russian da in English. Results:
  1. Searching for leb now shows леб as result #3. The existing mechanisms work just fine in this case, and this example shows no need for a romanization entry.
  2. Да will never be easy to find by searching for da. Try it in Google, Wikipedia, (which does have Russian da but no да), or whatever. There are too many da bombs and district attorneys in English, da Villas, etc. in Latin-alphabet languages. This is not a failure of Wiktionary, just a fact of language that да is just one of a hundred da’s.
 Michael Z. 2008-08-17 04:41 z
I've also added romanization sections to the entries батяр, горілка, and канадієць. Using “go” or “search” for any romanized version of these gives the reader good results.
(Unfortunately, actually typing a URL for the transliterated name requires you to click another button to get the same results. This is a case of bad design of the 404 page, which should just include the search results at the bottom, and not any indication that romanization entries are needed.)
Centralized romanization subheadings seem to do the trick for searching, and they convey the information well. Romanization L2 entries would be a distributed mess to maintain, and offer very little additional return for the effort.
The only situation where I can see a case for romanization entries is where they correspond to a Latin-alphabet word in some language or languages. Searching for romanizations of words like Европа (Evropa), Америка (Amerika), Улуру (Uluru), абдомен (abdomen), Tetris (Тетрис), абдуктант (abduktant) shows a wide variety of results—finding the Cyrillic version gets harder when there are more different uses of the Latin version. I might consider adding original-script links to {{see}} referrals:
See also: amerika and Америка
or adding another section to the top or the bottom of Latin-alphabet entry pages, saying something like “Amerika also represents Америка in the Cyrillic alphabet”. I think these should not be full-blown L2 headers.
I see a potential problem of dozens of informal and unattestable romanizations being added to such links. I also see this multiplying with the addition of other transliterations to every Latin and non-Latin entry, overlapping with translation sections: e.g., Tetris = Тетрис = Тетріс = テトリス = טטריס. Michael Z. 2008-08-17 05:29 z

Not being able to find Russian da is a very serious problem for our project and its users. As pointed out earlier, at da we find Japanese da, Mandarin da, and da in many other languages etc., but no Russian da, giving the impression that no Russian term with this sound exists. Our users with Latin keyboards (as this is the English Wikipedia) should be able to quickly and easily find the information they are seeking. Nihon or riben are similarly not written in a Latin script, yet our users may find this information quickly and easily. This very disadvantageous situation needs to be remedied. Several solutions have been presented and one should be adopted. This is just one example, but is indicative of the difficulties our users have in finding the information they are seeking under the current situation. 05:42, 17 August 2008 (UTC)

No, not being able to conclude that in Russian there is another script other than the Latin is a serious problem about the searching person. And if he has attained this conclusion and is curious about Russian language, then he will find a table of correspondence Latin-Cyrillic and write "д" in lieu of "d" and "a" in lieu of "а". The same applies for most non-hieroglyphical languages, such as Sanskrit and others. For hieroglyphical entries Romanisation is indispensable. User:Ivan Štambuk evaluates the time necessary for making oneself familiar with the Cyrillic alphabet at 30 min., but methinks it could last even shorter. Therefore I consider Romanisation of Russian reproachable (in the titles, not in the text of the article with a Cyrillic title) and unnecessary. Bogorm 09:11, 17 August 2008 (UTC)

If you would address the issue of other languages appearing under da but not Russian, it would be very helfpul. The situation remains that our users with Latin keyboards are unable to find the Russian da under the entry da, while finding Japanese, Mandarin, and many other languages, giving them the impression that no word with the sound "da" exists in the Russian language. This is a very, very serious problem and needs to be addressed. Several solutions have been proposed, and at least one of them should be adopted. Of course I can read the Cyrillic alphabet, and even type in it, but our users should be able to quickly and easily find the information they are looking for (in this case, that the Russian language has a word that has the sound "da." 18:25, 17 August 2008 (UTC)

Russian да is now linked from the second English definition (da#Etymology 2), before Japanese and Mandarin, so the situation appears to satisfy your needs.
But the primary organization of terms in the dictionary is by their normal native spelling, not by their pronunciation. You can also find да by searching for da, but it is result number 107 (after Mandarin, but before Japanese, I think). The simple fact is that there are dozens or hundreds of terms which include the Latin fragment da in them, so similar-sounding terms in Cyrillic, Arabic, Devanagari, and other writing systems. will not be as easy to search for.
This is not a problem resulting from Wiktionary's structure, but a simple matter of you get what you are searching for and not something else. The Latin alphabet is used in Japanese and Chinese, but it is not in Russian. you won't have better luck searching in Google or anywhere else, either. Michael Z. 2008-08-17 19:48 z
Thank you, but keep in mind that one hears Russian characters in English-language (i.e., Hollywood) films say "Da" quite often, so the term is well known to many filmgoers in the English-speaking world. 18:57, 20 August 2008 (UTC)
This sense is now the third sense on the page, in the first entry when you search for da. Are you saying that this is not yet addressed adequately? Michael Z. 2008-08-20 19:31 z

Romanization example

I added an example of how romanization could be added to the entry for канадієць (kanadijecʹ). Feel free to improve it or make different examples. Michael Z. 2008-08-10 01:44 z

Romanization should be a level 3 heading, subordinate to a language heading. It relates to the spelling of a term, not to a part of speech. It properly belongs next to pronunciation, but could be put after the POS to conserve space.

Some romanization schemes belong to a writing system, not to a language (like w:ISO 9), so they might belong in a level 2 heading. Might be simpler to just repeat them with each language heading on the page. Michael Z. 2008-08-10 02:00 z

Why not just create a separate entry for the romanized term that links to the main entry? Like what we do with plural entries?--TBC 02:24, 10 August 2008 (UTC)
A plural is a separate term, so it has a separate (usually minimal) entry. A romanization is a foreign expression of the same term, not even of a different spelling. I think romanizations should remain subordinate to the term, rather than multiplying the number of entries for it. Michael Z. 2008-08-10 17:59 z
A romanization entry would be minimal (in terms of content) as well, usually only providing a link to the main entry. Also, the romanized entries would only be included if there are sources verifying that the corresponding rominization system is in wide use, as to prevent an excessive amount of new entries.--TBC 01:00, 12 August 2008 (UTC)
In the romanization box, there should be more distinction between the name of the scheme and the word. Bolding the words is my initial thought of how to do it. Thryduulf 10:01, 10 August 2008 (UTC)
I agree it needs some visual help, but I'm concerned that bold makes it harder to, e.g. distinguish a háček ě from a breve ĕ. How about aligning the transliterations? Michael Z. 2008-08-10 17:59 z

Yes, I think this works better than the bold. Thryduulf 18:49, 10 August 2008 (UTC)

Beautiful. I think we should be doing this regardless of the issue of whether to include any of these transliterations as proper entries. Even where transliterations are already entered, I don't think anyone is supporting the inclusion of as many forms as you have accounted for, so this is more thorough than the way we're addressing it could ever be realized. DAVilla 19:58, 11 August 2008 (UTC)

Having these sections of bot-created transliterations at the original script entries might either, 1., facilitate creating the romanized entries themselves or, 2., obviate the need for them in many cases. In the meantime some entries will be easier to find for our benighted anonymous non-users of non-roman scripts or, as I like to call them, "anon-non-nons".
Is this one step ready for a VOTE? I haven't heard any opposing voices from non-anon-non-nons or other non-anons. DCDuring TALK 20:18, 11 August 2008 (UTC)
I'd like to see separate entries (just minimal ones, about the size content-wise of plural entries) for romanizations, that link to the main entries. I don't think this is ready for WT:VOTE yet, but it's certainly getting there. --TBC 01:05, 12 August 2008 (UTC)
We could also create redirects, but seeing that the "custom is not to create redirections to other articles" as per WT:REDIR, I don't see how that would work out.--TBC 01:11, 12 August 2008 (UTC)
If prior customs aren't bringing our users to the content they are seeking, the custom should be reevaluated (at least as regards the subject at hand). The example presented at канадієць seems quite fine. 01:13, 12 August 2008 (UTC)
Redirects are not an option because nath for instance might very likely mean something in a different language. In fact being so short it almost certainly does. But if we were to have additional transliterations, there is already a way to do them. We use "soft" redirects such as with any of the pinyin entries. DAVilla 04:49, 13 August 2008 (UTC)
Most of those pinyin entries "soft redirects" I've seen are in fact full-blown entries with etymology, definition lines...e.g. Aoyunhui. So Mandarin gets entries for both traditional and simplified spellings, pinyin with and without tone marks. Looks like maintenance hell to me. --Ivan Štambuk 10:56, 14 August 2008 (UTC)
But this is aside from the issue of separate entries. On the question of tables such as the one presented, you aren't opposed to that, are you? DAVilla 04:49, 13 August 2008 (UTC)

I'm in favour of including romanizations under a subheading of native-script entries. Practical questions arise.

  1. Choice of included romanization systems
    1. What are the criteria for accepting romanization systems? I included 16 real examples in канадієць—a few are widely used in English publications, a few are specific to a field (linguistics, geography), or an application (a passport agency), and some are obsolete or virtually unused, to my knowledge
    2. Where do we discuss the unclear cases?
    3. Where do they get listed? Some romanization systems apply to several languages. I suggest we keep an authoritative list, so different languages stand a chance of being handled consistently.
    4. Where are they documented? Point to each language's transliteration page, or refer to Wikipedia articles?
  2. “Wiki-romanizations”—A few of these are not attestable (e.g., my own pet peeve Appendix:Russian transliteration, created by Wiktionary because it is smarter than linguists and librarians). Do we include them?
  3. Infrastructure—romanization can be handled similarly to translations
    1. Templates?
      1. {{romanization-top}}, {{romanization-middle}}, {{romanization-bottom}}
      2. {{romanization|kanadijec’|scheme=uk-und_linguistic}}
    2. Categories?
    3. Romanizations to be checked?
    4. Link to documentation of the romanization scheme?
    5. Set the font-family for the sake of MSIE or other browsers which have trouble with diacritics, etc?

Maybe we should conduct a test with a couple dozen examples to catch any other issues before this is implemented full-scale. Michael Z. 2008-08-14 03:47 z

I've added two more examples, at горілка#Romanization and батяр#Romanization. In these cases, I've linked all of the romanizations that are attested in English (15 links for 2 terms).

I don't think anything is gained by initially hiding the romanizations here. Also note that for one of the standards there are 4 different romanizations of горілкаMichael Z. 2008-08-14 07:08 z

Romanization: Next steps

I wonder what it takes for there to be progress on this front. Is it necessary for there to be a Vote to permit the structure and then specific implementation details at the language level? DCDuring TALK 01:14, 23 September 2008 (UTC)

Transliteration codes

By the way, the Unicode CDLR project has a standardized identifiers for transliteration schemes, similar to the codes for language and script. This may be useful for, e.g., developing templates which consistently label a transliteration scheme and link to its documentation. For example, romanization of Ukrainian for francophones would be identified by uk-fr, and w:BGN/PCGN romanization of Ukrainian would be uk-und_Latn/BGN or Ukrainian-Latin/BGN.

Later, this info may also be useful to automatically transform original script into various transliterations, or to check transliterations. Michael Z. 2008-08-10 19:54 z

MZ? —This comment was unsigned.

Innumerable destructive edits removing inconvenient information about etymology

I would like to complain about the supercilious misbehaving edits committed by this user in articles concerning Old Norse. In the following edits (first, second, third, fourth, fifth, but that is not all - in dread he at first inputs his obfuscating "probably"-version, which I altogether did not remove, but made the mistake of sparing, he for the second time removes the only stringent (Scandinavian) explanation of the origin (when the article had contained both versions just before that) and leaves only his unsure "probably"-version!) he is obviously trying to belittle the huge influence the English language has experienced from Scandinavia. If you observe how he writes "probably" on dread only to revert my edits elucidating the Old Norse origin and the similarities with the Danish rædde(definition and prove for the Scandinavian origin here, jf.=jævnfør means compare, id est the same origin) and Norwegian redd (definition here) and the rugged reversion of my edit when I try to write "probably" in his manner, the only conclusion for me is that he is considering himself one of the objects of "Quod licet Iovi non licet bovi", guess which one. And this contemptuous endeavour is not at all ridicule, but rootlessly disparaging and insulting for the contributors of the Wiktionary who would like to be conducive for its expansion, whereas the purpose of such edits can be only its reduction and therefore are pernicious, perilous and reproachable. I am using two academical sources for my edits and I expect a little respect and engaging in a discussion, which the user has not yet deigned to accede to. I do not want what this user has against Old Norse language, but all his edits have proved to be rootless and I beg some administrator to eradicate them all. With best regards Bogorm 18:08, 7 August 2008 (UTC)

  • "This user" is me. Bogrom must convince me (as well as the OED) how exactly Old Norse hr- became English dr-. Concerning his edits more generally I will confine myself to a couple of examples. He believes bridge to come from Old Norse brú, which is merely the Old Norse word for "bridge". The relevant cognate in ON is bryggja, which indeed left its mark in certain English dialect forms like brigg. But the normal English pronunciation /dʒ/ is all the proof you could want that ON (which had a hard G) had nothing to do with it. He also sees ON influence in such words as hart and hat, which were both being used long before the earliest Scandinavian invasions. Beliefs like this, as well as the theories expressed on his talk page that Old English is descended directly from Gothic and Sanskrit, have made me conclude that he doesn't know what he is talking about. Far from playing down the Norse influence in English, I think it is vitally important, and that is why I resent such useless and intellectually trivial contributions as Bogrom's, which seem to be based on nothing but caprice and superficial similarities. Ƿidsiþ 18:16, 7 August 2008 (UTC)
I could of course be indignant at such affronts about not seeing the blatant similarities added to the article and calling them superficial, but I am not going to accept this as flagrant as rootless distortion about me reckoning arguably the English as a descendant of Gothic which is Widsith's figment. I said that there are words descending from Sanskrit, whose influence in English is straightforward since some common Indoeuropean words stemming from Sanskrit have entered in the vast majority of Indoeuropean languages including English and contesting this means contesting the Indoeuropean nature of either Sanskrit or English which is a complete nonsense. Sanskrit is a chronological and etymological predecessor for all Indoeuropean languages exactly as Gothic is the most ancient Germanic language with a script (Ulfila) and his influence of Western and Northern Germanic languages is ineffable! Ask him whether he knows to which of the three groups of Germanic languages does the Gothic appertain without perusing the article in Wikipedia? (a hint: "W...", "E...", "N...") I have already ased him about his knowledge of Old Norse to which he repeatedly did not deign to response. Bogorm 18:35, 7 August 2008 (UTC)
It may be useful to mention that the talk page of Bogorm holds some more information on this, including also a debate on other etymological issues questioned by Ivan Štambuk. (No comment on the linguistical aspects of this matter by me.) -- Gauss 18:31, 7 August 2008 (UTC)
Bogorm, you clearly are not aware of all the issues that are at play here. For example, when you stated that since Gothic precedes Old English, its words must be etyma because of that fact, you make one of the oldest mistakes in the etymological book. Widsith is a highly knowledgeable contributor, and has a great deal of respect in the community. Please learn from him. And also please stop harassing him and reverting his edits. While this may not be the case on Wikipedia, here on Wiktionary, wasting a useful editor's time is grounds for a block. -Atelaes λάλει ἐμοί 18:34, 7 August 2008 (UTC)
If you have deigned to behold, I have already stopped reverting unsourced and supercilious "probably"-edits and I have done it only once. Although I urge you to make yourself familiar with the "three reversions rule", I am not going to make it even a second time! I do find "Stop doing something" when the author has done it only once extremely ridiculous. Bogorm 18:39, 7 August 2008 (UTC)

First of all, the three revert rule is not really in play here on Wiktionary. But yes, you have stopped revert Widsith's edits, and for this I thank you. But, Widsith is right and you are wrong. Earliest attestation date means nothing. Indo-Europeanists have a fairly settled genetic lineage for Indo-European languages, and it is not simply "oldest gave rise to the younger". Sanskrit did in fact give rise to a number of langauges, such as Hindi, Bengali, etc, but certainly not English. Gothic may have given rise to Crimean Gothic (the jury's still out on that, if I'm not mistaken), but that's about it. Greek actually has older attestation than Sanskrit, and all it gave rise to is Greek. If you look at w:Indo-European languages, they have a fairly nice bit about how this is all arranged. -Atelaes λάλει ἐμοί 18:48, 7 August 2008 (UTC)

Just typesetting

The entry ¤ is a symbol used in typesetting. I entered the context label {{typesetting}}, but it renders as (metal typesetting), which would be incorrect in this case.

So how do I enter just "typesetting"? Why not have {{typesetting}} for typesetting, and {{metal typesetting}} for metal typesetting?

Not sure why it was set up that way. It looks like what someone had in mind was to partition into two groups, either typography or metal typesetting. Consider using {typography} instead (or alternatively to rename {typesetting} as {metal typesetting} and then redirect {typesetting} to {typography} as a more likely context). If that would be incorrect, then we may need three choices of label.
Typography and typesetting have a lot of overlap, but they are two different things—the latter is a subheading of the former, I think. Typesetting is the mechanical process of setting type (whether literally mechanically or digitally), while typography also includes designing with type, the design of type, the font design, etc.
I also think the subjects should remain pure. “Metal typesetting” brings specifics of the tools and media into it. If there's no objection, I'd like to remove this distinction from the templates. Michael Z. 2008-08-10 00:38 z

More generally, it is confusing for editors to have templates with one name and different text. Why not have them be the same when there's no reason not to? Michael Z. 2008-08-08 04:35 z

In general this is best. Redirects are used for consistency when a topic has more than one name, e.g. jocular and humorous. The correct home for (metal typesetting) is {metal typesetting} while {typesetting} is either its own separate label or a redirect to another. 00:11, 9 August 2008 (UTC)
Once we figure out what makes sense, I'll go ahead and adjust these templates. Michael Z. 2008-08-10 00:38 z

Names of symbols

Is there a good way of entering the English and foreign names of symbols? For ¤, I have entered English and German names under “Synonyms” and “Translations”, but this feels like a slightly awkward workaround, because they're not really synonyms or translations. Michael Z. 2008-08-08 04:45 z

¤ doesnt mean "The generic currency symbol, used as a placeholder when a specific local currency sign is not available in a digital font." it means "{{context|when the correct symbol is unavailable}} {{non-gloss definition|specifying units in any currency}}" (if your definition is right). in other words, it doesnt mean "a symbol' but is a symbol. so the definition needs fixing. once thats fixed, obviously the english and german names arent synonyms and translations. maybe just ===see also=== * [english name]] 05:13, 8 August 2008 (UTC)
100% true. But it doesn't explain the symbol nearly as well. Can you link to a few good examples of symbol entries which are defined in just this way? Michael Z. 2008-08-08 05:40 z
same anon, diff ip address. how abou t< and > and [vertical line] and ! (but that one doesnt have the non gloss definition template and needs it and some more help too) 16:54, 8 August 2008 (UTC)
Then I would say this symbol's meaning isn't specifying units in any currency. It denotes a spot in an unfinished typesetting project where the correct currency symbol is still missing. It is not meant for a reader, but for a designer or typesetter who will be completing the next phase of a project by finding the right symbol in a specialized font and replacing the louse (¤).
Maybe specifying a currency symbol which is yet to be typeset. Michael Z. 2008-08-10 00:46 z

Taos nouns

Hi. I pasted 257 Taos nouns here: User:Ishwar#Taos nouns. All nouns are listed in the singular form (unless the noun belongs to the noun class that has the constant plural form). Some of them have duoplurals, those that do not can probably have the duoplural forms deduced from w:Taos language#Nouns (assuming someone wants to do this). Some have notes about loanwords (usually from New Mexican Spanish), but the source is not indicated except for two words.

All words are from w:George L. Trager's publications, mostly from his 1946 sketch. The transcriptions of all the words have been converted to the transcription used in his later 1948 re-analysis of the sound system.

If Wiktionary is interested in this, feel free to bot it to pages. Otherwise, I'll disappear it in a half a year or so. peace Ishwar 06:02, 8 August 2008 (UTC)

Note: all nouns are enclosed with { }. Ishwar 06:03, 8 August 2008 (UTC)

This language is still spoken, correct? Are you in contact with any language teachers at the pueblo or local tribal college? I ask that just out of curiosity, because I wonder if those spellings correspond with the language as it's spoken today, 50+ years later. 04:08, 14 August 2008 (UTC)
That's a good point, but even if they don't, I think it might be nice to include the old spellings, either temporarily or permanently. (We include archaic spellings of English words; this is different, in that from Ishwar's description it sounds like the spellings were never used by actual speakers — please correct me if I'm wrong — but I think it falls into the same general category.) —RuakhTALK 22:32, 15 August 2008 (UTC)

Intransitive English Prepositions

Back in the spring, when we were discussing using determiner as a POS heading for the English members of that category, Ruakh asked--presumably rhetorically--"will we next adopt the notion of intransitive prepositions?" Well, I'm asking it now in earnest. The idea that prepositions must have an object is clearly wrong. There are many cases where everyone agrees that words like from are prepositions even when they lack objects. Why not be systematic about it?--Brett 15:44, 8 August 2008 (UTC)

I admire your spunk, but I think there's a difference between the determiner debate, where we're accurately using a scientific term that's less prevalent in traditional-grammar circles, and something like this, where we'd be accepting a modern — and AFAIK quite recent — scientific broadening of a traditional grammatical category. Granted, this is still better than Wiktionary-specific non-scientific broadening of traditional grammatical categories, as with our re-definition of "adjective" and "adverb" to include prepositional phrases with appropriate modificands, but I'd rather not tread too far down that path.
I think the best approach — one that I suspect will never happen, because only Connel and I seem to be in favor of it — would be one that doesn't use part-of-speech as a header at all. Categorization by part of speech is frequently a matter of POV, albeit sometimes a matter of scientific POV vs. unscientific POV, and unless we were to accept e.g. ==Adverb or intransitive preposition; see discussion below== as a POS header, we could never achieve NPOV with strict part-of-speech headers. (Also, most of our readers — and casual editors — seem to have a very loose grasp of even the most basic traditional parts of speech, viz "noun", "adjective", and "verb". I'm not sure our use of POS headers helps them as much as we'd like to pretend.)
RuakhTALK 19:30, 8 August 2008 (UTC)
You didn’t give any examples of what you mean, but what I think you mean is a subcategory of adverbs (to burn up, to burn down, to burn in, to burn out, to burn off). A good term for them is particle. —Stephen 19:41, 8 August 2008 (UTC)
Sorry about not being clear about what I meant. I'm not looking at particles, but at what have often been called adverbs as in up in sentence #2 below:
  1. The squirrel went up the tree.
  2. The squirrel went up.
As for this idea being new, it probably was way back in 1863. That's when Alexander Bain discussed prepositions governing clauses. In other words, he recognized back then that many of the words that have often be lumped together as subordinating conjunctions are actually prepositions. In 1924, Otto Jesperson published The Philosophy of Grammar in which he shows that words like up in 2 above differ from the up in 1 only in that they have no object. And yet, as I said before, there are lots of prepositions that everyone accepts even when they don't take an object, for example the sun came up from below the horizon. Dictionaries do not list from as an adverb, only as a preposition, and yet here it is without an object.
No other part of speech is categorised this way. We don't say verb are nouns when they are intransitive or that adjectives that take complements are adjectives, but that those without complements are something entirely different.
There is more evidence in the modifiers that they allow: straight and right are not used to modify core adverbs (e.g., *She moved straight surely) though they typically can modify prepositions (e.g., She moved straight along the line). Notice that they can still modify them when there is no object (e.g., She moved straight along).
Adverbs also may not appear as complements to linking verbs (e.g., *The movement is quickly.) Prepositions, however, can, even when they have no object (e.g., the ball is up here; the ball is up).
Finally, everything in life is POV, but that's not a helpful position. This is no more POV than categorizing elements as metals halogens, etc. There are good reasons for it and nothing but tradition against.--Brett 20:20, 8 August 2008 (UTC)
To say there is nothing but tradition against ignores the need to be useful to a broad users base, unless you think we can continue to eke out the opportunity to indulge our own interests without meeting the needs of other users, perhaps as a kind of service to Wikipedia. DCDuring TALK 21:25, 8 August 2008 (UTC)
Re: " [] there are lots of prepositions that everyone accepts even when they don't take an object [] ": I don't think that's true. In your example, traditional grammar holds "up" as an adverb and "below the horizon" as the object of "from". The OED Online has several quotes with "up from" in its first adverb entry for up, and none in either of its preposition entries; and its sense 15 for from is “Used in certain of the above senses (esp. 1, 2, 3, 9, 10) with an adverb or a phrase (prep. + n. or pron.) as object.” —RuakhTALK 19:56, 11 August 2008 (UTC)
Sorry, I should also have added that this change affects only a small number of words. It is unlikely that such a change would cause any problem for naive editors.--Brett 20:22, 8 August 2008 (UTC)
Please, no. Consider that we need to connect with what those of our users who read the headers (which we assume they need to navigate our long entries) remember from school. Why aren't categories sufficient for those of an academic frame of mind, who, I hope, represent the tiniest fraction of our target audience though a large portion of those who vote on policy decisions here? Perhaps we need to amend the thought-eliminating slogan to: "All words in all languages" + for all users. DCDuring TALK 20:29, 8 August 2008 (UTC)
The need to be useful to a broad base of users includes the need to be accurate, not just familiar. This proposed change would not be an indulgence but rather a means to educate people. Presumably, that is why they come here in the first place: to learn about words. How would we be be failing to meet the needs of our users by simply collapsing two sections (i.e., adverb & preposition) into one (i.e., preposition), especially when many of the relevant entries are more or less duplications of each other as it now stands.--Brett 22:08, 8 August 2008 (UTC)
We have the enormous problem that we are only the fourth most popular dictionary site, behind Answers, MWOnline, and and not far ahead of a few others. (We don't know where we stand vis a vis Encarta.) Users can and do go elsewhere. Even folks on the WMF mailing lists seem to prefer as their on-line dictionary. As it is now we use the PoS names in the ToC and headers as navigational aids. Adding a PoS name unknown to our users is like having a few Cyrillic street signs mixed in at random with the others. As to "accuracy", we are looking at a label, not a thing-in-itself. The label is a call upon knowledge. If nothing answers that call (as it will not for 99% of our anons when we give them "Intransitive Prepositions"), how is that more accurate? If our typical user only visits three or four times a month, that means that there we can make almost no contribution to their overall learning of language. We can only help them find what they are looking for and try to anticipate their requests. Most of the discussion of PoS and other linguistic categories makes no contribution to our serving our population of passive users. It is for our own edification. DCDuring TALK 02:25, 9 August 2008 (UTC)
As with intransitive verbs, the heading would not include the word intransitive, simply the word preposition. No Cyrillic here.--Brett 11:52, 9 August 2008 (UTC)
So the actual proposal would be to apply a new/modified "intransitive" context tag to certain senses (possibly not already present) of Prepositions and to add an associated category? DCDuring TALK 12:10, 9 August 2008 (UTC)

Thank you for forcing me to be clearer, and sorry for not being so from the beginning. As it is now, the words in question are listed as adverbs. Some of them, like backwards, have only an adverb heading. Others, like up, have headings for both adverb and preposition. The proposal is to replace the adverb heading with a preposition heading in cases like backwards and to merge the adverb senses under the preposition heading, removing the adverb heading in cases like up. If anything, this should simplify things for the general user.

It would further be possible to add a context tag indicating the valency (transitive, intransitive, or both) of the preposition, but I don't know if this is necessary.--Brett 13:32, 9 August 2008 (UTC)

Anyone else have any thoughts on this?--Brett 17:49, 12 August 2008 (UTC)

Anchors in Wictionary

It has been suggested that I should have come to this venue first before altering the wiktionary page by putting anchors in. I did come to this venue and bring up this very thing in Subject #2.50. I'll bring it up again.

I may have found a way to reference wiktionary directly when using the tool tip mouse over. By putting a link in wiktionary at each definition line, I will be able to reference that specific definition. That, I am again assuming, will mean that the word is valid for that specific meaning and I will not have to validate those synonyms.

The nomenclature I was using was very specific to the wiktionary page. It was a two part code where the first part referred to the part of speech and the second part referred to the definition number. Thus N1 was the first definition under noun for that word. adv5 would mean the fifth definition under adverb for that word.

I'm told the placing of anchors is unacceptably altering the sleek lines of the page. Does anyone know of a way of avoiding that? Amina (sack36) 02:01, 10 August 2008 (UTC)

Firstly, sorry. I didn't understand your earlier comment here, and should have asked for clarification at the time, rather than just getting annoyed when it turned out there were problems.
Secondly, the use of <div> for this is indeed unacceptable, especially if you're wrapping just part of the definition in it, because that produces confusing line breaks. That's not a big problem, though: just use <span> instead.
Thirdly, this is a minor point, but some entries, such as [[pen]], have multiple Noun sections.
Fourthly, sense numbers are always changing, as existing senses get split or merged or removed or reordered, and new senses get added in the middle of existing ones. Nowhere else do we use sense numbers; in Synonyms sections, for example, we use the {{sense}} tag to indicate (with a short gloss) what sense we're referring to. That approach isn't perfect, either, but it's worked fairly well, and more importantly, when it fails, its failures are less of a big deal. (If a translation at [[sex]] becomes misnumbered due to shifting senses, then translators are screwed; but if its gloss just doesn't match its sense line perfectly, that's O.K.: they still have an idea what sense(s) it corresponds to.)
RuakhTALK 13:09, 10 August 2008 (UTC)
OK, this whole this is getting way too confusing for me. I looked at pen and found that the three meanings in the first section didn't have a difference of meaning for me. Sense in the Synonym area doesn't seem to have nearly enough breakdown of meaning. It looks like this whole concept of linkage to wiktionary for definitions is just too complex for me to handle. Maybe someone else later on can come up with a solution. For now, I'll go back to defining it with hard-coding. Amina (sack36) 14:47, 10 August 2008 (UTC)
Other people have thrown up their hands at this, too. It just may not be doable for the foreseeable future in the Wiki framework we have. Section links seem to be as fine-grained as we can reliably go - and that has required a lot of work to standardize and maintain the Language, Etymology, and PoS headers. If there were a lot of people changing Etymologies and many term templates were using Etymology section links, that too would be problematic. We cannot even link to a specific PoS within an Etymology or a Language section. DCDuring TALK 16:01, 10 August 2008 (UTC)
So your principal concern is alleviated by using < span > instead of < div > ?
I'm not worried if the scheme arrived at isn't perfect. We're talking about synonyms here, which are usually well known words and not the more unusual definitions subject to change. If this grows outside of that then at least we have some experience with how to handle situations such as those you name, before trying to concoct anything more complicated. DAVilla 20:14, 11 August 2008 (UTC)
It might be worth a serious experiment using entries (but not senses) likely to change over some months' time. We might learn something without wasting too much enthusiasm. DCDuring TALK 21:06, 11 August 2008 (UTC)

Category:Pronunciations wildly different across the pond

"Yikes!" was my first thought on encountering this category (at mosquito). Looking at it, it does say that it's a "Placeholder category to provoke discussion of what an appropriate category name might be...". There have been a grand total of 3 comments on the talk, one each in November 2006, December 2006 and August 2007. Reproduced here (sans comments, etc) they are:

To which I'll add my new suggestion:

Although it is currently only used on 16 entries (but clearly could be on many more (e.g. mobile)), having a placeholder category hanging around since November 2006 is too long imho. The time has come (again imho) for this to gain a more professional name.

Of the suggestions above, the second is a bit unwieldy, and the third is more restrictive - not allowing words with major differences between other varieties of English (e.g. Australian and South African). A Google search for xenophones indicates that (in addition to being a misspelling of w:Xenophon, an ancient Greek soldier and mercenary) it means "speech sounds from a foreign language", and most hits seem to be about the occurence of English phonemes in Swedish. My suggestion isn't perfect though, and I still like EP's suggestion. Thryduulf 13:32, 12 August 2008 (UTC)

Where there are sufficient needs, adding more categories is better.--Jusjih 00:20, 13 August 2008 (UTC)
I don't fully understand your comment. Are you saying there should be categories for each pair of regions? e.g. a separate one for where the UK and North American pronunciations are very different, another for Australia/North America, a third for US/Canada, etc? If so what do you propose we name them? What should we name their parent category? Thryduulf 12:18, 13 August 2008 (UTC)
There isn't a good name I've been able to think of, or I would have suggested it by now. :P --EncycloPetey 18:44, 13 August 2008 (UTC)
Category:Evidence in favor of splitting North American and Commonwealth English. DAVilla 02:49, 14 August 2008 (UTC)

I'm going through these adding Canadian pronunciations, and in some cases they correspond to US or UK, or both. Eventually the distinct Maritime and Newfoundland Canadian pronunciations may also show up, as well as any number of other specific pronunciations from all sides of the pond.

I suggest that the category name be specific and unambiguous: perhaps Category:English Words pronounced differently in General American and RP. This would carry the implication that, e.g. Canadian English generally/often corresponds to GA, but that the details of particular accents may vary widely. Michael Z. 2008-08-17 18:21 z

But where does that leave words that are pronounced similarly in the US/UK but very differently in Australia? Thryduulf 19:13, 17 August 2008 (UTC)
I guess whoever is interested will have to make some tables, and create a category for the cross of every set of two and three English dialects. Michael Z. 2008-08-18 19:44 z

Adding pronunciations to a page for each of the numerous dialects of English spoken in each country is a bad idea. We risk ending up with very unwieldy pronunciation sections that will swamp the page and risk confusing the reader.

I have always understood it to be Wiktionary's policy that we give standard pronunciations for each country (with UK and US pronunciations currently dominating, but that is only because there have been few contributions of Canadian, Australian, etc, pronunciations up to now). For example, the standard for British English is Received Pronunciation. (I'm not sure whether there is a single American English standard - print dictionaries seem to vary. Is General American an official standard?) Variations in dialect are best kept to a table of correspondences between phonemes (similar to the one that "converts" British English to American English at Rhymes:English). We should certainly provide space for the many variations of English pronunciation around the world (British English alone probably has dozens) but putting these in the pronunciation section for every page is the wrong approach.

By the way, anyone who wants to expand this list will find this huge list in Wikipedia a useful resource. — Paul G 10:02, 19 August 2008 (UTC)

Pronunciations for every local accent is probably impractical. I'd say we aim for major pronunciations for UK, US, Canada, Australia, New Zealand and South Africa, plus local pronunciations for words with a clear geographic location (for example we'd want the Hong Kong English pronunciation for a term originating in or principally used in Hong Kong), and the local pronunciation for settlement names, particularly if these are different to the general pronunciation for the country (e.g. the Bostonian pronunciation of "Boston" and the Geordie pronunciation of "Newcastle"). Where more regional pronunciations within the more general ones above are significant then they should be included (e.g. moor is one syllable in southern England (hompophonous with more) and two syllables in northern England (how I would expect "*mooer" (one who moos) to be pronounced)). I don't think we need to note things like that in Bristol area is pronunced the same as aerial. I see nothing wrong with using a collapsible (1-column) table for long pronunciation sections. Thryduulf 14:07, 19 August 2008 (UTC)

Thank you for directing my attention to this thread.

The reason the pronunciation differences are limited to these two dialects, is because that is the linguistic distinction normally made in hundreds or thousands of reference books. While regional differences always exist, there are systematic differences between US and UK from various phases of reform and divergence. Other regions have their own dialects, but most of the time will explain variations as "like American English" or "like British English" as way of explanation. This is especially true for major regions (such as India, Australia and Canada) with very large numbers of English speakers.

Yes, individual entries should have as many regional pronunciations as needed. But this particular category (before it was vandalized down to a tiny handful of entries) was about major pronunciation differences typical of one or the other dialect. Since the colloquial distinction ("its like UK Eng" or "its like GenAm") matches the linguistic distinction, further subdivision would be original research, and misleading to readers.

Should we note that Bristol-area pronunciation of "area" is homophonic with "aerial" in the pronunciation section of those entries, but that has nothing at all to do with this category. Perhaps a "micro-regional pronunciation variations" category would be better for that sort of thing. But over-emphasizing such differences can be misleading to readers and researchers.

Pronunciation section policy never reached any conclusion about what pronunciations should be listed. There are eight to a dozen different General American dialects that we should explain but currently don't. There are easily hundreds of "micro-regional" variances in American English that probably shouldn't be expounded at depth (so-called Pittsburgh pronunciations, etc.) as they can't be verified. But since that never was put into policy, we've just accepted the tiny few pronunciation distinctions that have been contributed so far.

Keep in mind that not all words are pronounced differently for a given region. Only the differences are going to be identified and categorized. For a better category name, I think Category:General American and RP pronunciation differences might work, or perhaps Cateogry:General American vs. RP pronunciations. --Connel MacKenzie 16:20, 25 August 2008 (UTC)

Latin verb translations

Verb translations for Latin aren't consistent. For example, the verb to go lists eo (1st-sg-pres-indic-active) and vadere (infinit-pres-active). Since Latin verb entries are always on the first-person form, instead of being in the infinitive one (contrast the entries for sum, eo and amo with esse, ire and amare), shouldn't all the translations consistently list the first-person form?

Also, I noted that eo lays out the conjugation table on the entry page itself. Shouldn't it be using a template for this? The Latin version seems to use a template that's dedicated for this verb, but I'm not sure it's needed; just a generic template for irregular verbs would already be much better than this!

~> SilvioRicardoC 17:00, 12 August 2008 (UTC)

Yes, you have noticed one of the many ways that users can help clean up translations. Years ago, all Latin verb translations were put in as infinitives, and this has never been completely updated and cleaned up. Wiktionary is an on-going effort, but the good news is that you can help clean up these old translations!
As for templates, it usually isn't possible to create a template for every irregular verb, and we are still cleaning up the regular templates. Again, years ago, we had horrible blocky Latin verb conjugation templates. The regular and pattern templates have been updated, but we're still cleaning up all the calls to these templates. In some cases, the wrong conjugation pattern is used on the page, so it's not just a matter of a simple replacement; each conjugation has to be checked against sources. --EncycloPetey 18:46, 13 August 2008 (UTC)
Hmmm I just found {{la-verb}}, which may be helpful. What about a bot looking for (and listing in some page) Latin translations of verbs that don't end in -o or -or? Such a list would make it pretty easy to know what still needs to be corrected. I still need to take a look at the "bot policies" (who gets to have and use a bot, and when); maybe I can help if I'm allowed to have one... ~> SilvioRicardoC 23:21, 14 August 2008 (UTC)
The key criteria for getting bot permission here are (1) making the code and purpose plainly visible for discussion and feedback, and (2) demonstrating basic knowledge of Wiktionary formatting conventions. Then you put it up for a vote. However, it might not require a bot in this case. I imagine someone out there with access to the last XML dump could generate such a list. Essentially, you'd just need to look for Latin verb entries and be able to parse the endings for the entry name. --EncycloPetey 02:28, 16 August 2008 (UTC)
Note that automation that (1) reads the XML dump(s), (2) possibly reads some pages from the current wikt, and (3) writes lists or reports to (your own) User: space pages (within reason) is not a "bot", you don't need any special permission. The 'bot permissions and flags are for automation that changes mainspace pages or other content, (or perhaps, other user talk pages and such, but we don't have those like the 'pedia does). Robert Ullmann 18:06, 17 August 2008 (UTC)
Good! Is there any page explaining how exactly things work? I can't seem to find anything else about the XML dumps or any other API related to this... I'm gonna try searching for it again later, but, I don't know, sometimes wiktionary seems so "decentralized"... ~> SilvioRicardoC 17:07, 19 August 2008 (UTC)
XML dumps of en.wikt can be found here. Many people use pywikipedia (meta:Using_the_python_wikipediabot) to do dump analyzes (and to make "bots"). If you don't like Python, there are links from there to other tools. Hope that helps. --Bequw¢τ 19:39, 19 August 2008 (UTC)

family template


Noticed theres no family template corresponding to Category:Family which differs from Template:anatomy and Category:Anatomy. Should there be a family template? (note: it would sound more scholarly if you used kinship instead of family since that is more commonly used in anthropology and not all relatives would be considered "family" in an English sense, e.g. clan members in Western Apache, etc.) Ishwar 18:23, 12 August 2008 (UTC)

No, there is not always a template for every category. The templates are used for context situations. A word tagged with (anatomy) using a template is marked to mean that the definition is used only in anatomical jargon by anatomists. Words for family relations can appear in many different contexts, so just a category is used. In other words, the entry for phalanges would be marked (antomy) to let the user know that it is not an everyday word, and should be considered jargon used by specialists in a particular field. By contrast, the word aunt is not jargon, but is instead an everyday word, so it needs no context marker. --EncycloPetey 18:52, 13 August 2008 (UTC)
True. On another point, would you consider Kinship to be a better name for that category? DAVilla 02:44, 14 August 2008 (UTC)
"Kinship" might be a more precise term, but I think it would be less useful. Such categories are most useful (in my estimation) for non-native speakers, so having a more familiar label (pun not intended) like "Family" for the category is more useful. --EncycloPetey 02:24, 16 August 2008 (UTC)

Alternative spellings and forms

I have noted a good number of entries that show as alternative spellings words with noticeably different pronunciations (eg, have to and hafta, airplane and aeroplane). I changed the headings in those entries to Alternative forms. Isn't that desirable? Or should they appear in {{see}} at the top of the page? Should WT:ELE not reflect our choice(s) in that regard. I took a look at Wiktionary:Alternative spellings (not yet a guideline or policy). Could that be edited into something that could be a policy or a guideline? DCDuring TALK 19:29, 13 August 2008 (UTC)

I prefer "Alternative forms" for exactly that reason: because spelling differences often reflect, either now or previously, a difference in pronunciation. Ƿidsiþ 19:35, 13 August 2008 (UTC)
But surely there are some cases where the regional pronunciation differences are not sufficient (or sufficiently related to the regional spelling differences) to require "Alternative forms" rather than "Alternative spellings". I am thinking of cases like color/colour, labeling/labeling. OTOH, I suppose that if we were to only have one heading, "Alternative forms" is more inclusive. We would still need to determine when something was so different that it didn't warrant inclusion as an "Alternative form". DCDuring TALK 20:07, 13 August 2008 (UTC)
I'm not sure I agree with the assessment of different pronunciations being associated with different spellings—in fact, I don't think these are examples of that at all. I think these are both cases of alternate spellings, used in particular contexts.
  • Hafta, gonna, d'ya, and so on are eye dialect mostly used in written dialogue, to stress its informal nature (or rather to reveal the speaker's nature). In many instances of normal speech, have to is pronounced hafta.
  • Airplane/aeroplane are arguably regional spellings which just happen to reflect regional pronunciations, but strictly speaking they are not pronounced differently. A Canadian would read them both aloud identically, as would a Londoner.
If these were really different words, then they would belong in “Synonyms.”
But if we are destined to argue about this for each and every occurrence, then I'd rather settle on the broader “Alternate forms.” Michael Z. 2008-08-13 23:15 z
I don't really see pronunciation as a very workable criterion here, since pronunciation is all over the map even for exactly the same word. For single-word terms especially, I prefer to use "Alternative spelling" whenever the alternate is more-or-less a variant of the defined word. So I would count airplane/aeroplane, color/colour, labelling/labeling all as alternative spelling situations. I tend to use "Alternative forms" for idioms or other multi-word terms where one or more words in the alternate is clearly a different word than the matching word in the defined term. So, for example, I see pocket flask/hip flask/hip-pocket flask or in two shakes/in two shakes of a lamb's tail as alternative forms. -- WikiPedant 23:34, 13 August 2008 (UTC)
"color"/"colour" seem close enough despite pondian differences. "airplane"/"aeroplane" differ more (extra syllable). "mom"/"mum" is another example. More are at w:American_and_British_English_spelling_differences#Spelling_and_pronunciation. DCDuring TALK 23:53, 13 August 2008 (UTC)
I apparently made some comment or another since I'm somehow quoted in the preceding section above. You all might peruse some of the pages and other categories listed w:Category:American and British English differences , which seems to be the latest greatest name root category name form coming out of the sometimes lengthy debates about "best" in W:WP:CFD. For the sake of interwiki connectivity and sanity, not to mention saving people time and making them more productive when visiting across sister projects... I'm firmly in favor of adopting categories in use either by the commons or en.wikipedia... the later of which, as is implied by the mention of the CFD debates, usually has a lot of thought power behind the naming. Ill-conceived category names on en.wikipedia are fairly short lived. Best regards. // FrankB 00:29, 16 August 2008 (UTC)

I think we discussed this some time back. If I remember correctly, "Alternat(iv)e forms" was to be deprecated for the reasons give above, and also because of the UK/US variation ("Alternate forms" being the preferred form in US English but incorrect in UK English, which uses "Alternative forms"). "Forms and variants" was the preferred heading, which covers a wider range than "Alternat(iv)e forms". — Paul G 14:48, 19 August 2008 (UTC)

CheckUser vote

I've started two new CheckUser votes, as two of our three CheckUsers are inactive at the moment, and two active CU's are required to meet policy. --Versageek 13:56, 15 August 2008 (UTC)

Redirects in the Citations namespace

Wiktionary:Citations says "Unlike the main space, inflected forms and alternate spellings should be redirected to the primary entry. Variations in case should be on the same page, with the other(s) redirecting, even if the definitions are distinct." Is this really our policy? I can't remember it being discussed. (See Citations:dazzled and Citations:dazzles for a recent example) SemperBlotto 07:19, 16 August 2008 (UTC)

I don't think we have a policy as of yet; the citations namespace is simply too new. As I see it, Wiktionary:Citations is a draft proposal, meant to spur discussion, in an attempt to try and create policy (which DAVilla deserves some credit for). However, I for one am very much against such a policy. I have placed my position and arguments for such on the discussion page of said page, and so shall not reiterate it here. Quite frankly, I think it would probably be best if the discussion took place there, instead of here, so that it's more easily found for future reference. -Atelaes λάλει ἐμοί 07:32, 16 August 2008 (UTC)

Aramaic alphabet

On Wiktionary:Requested entries:Aramaic I encountered using the improper alphabet for the Aramaic language - the original Aramaic (not taking into consideration its numerous modern dialects) has but only two alphabets which have been used by every native Aramaic-speaking and -writing person and had been official throughout the empires where it has been one of the official languages - the one for the Imperial Aramaic, which is to be found here and is certes not digitalised (in Wikipedia itself rendered with images) and the late official Estrangelo script used from the 2nd century BC for the whole Aramaic-speaking world and for its main cultural heritage - the Christian Orthodox (later Nestorian) religious literature. The Hebrew alphabet being used only by a small community for the dialect Jewish Aramaic, I could not see any reasonable causes for its application. But after correcting the letters there, my anxiousness become dismay, when I beheld [[Category:Aramaic nouns|a whole category]] with entries in this alphabet, in none of which there was any trace of the Estrangelo writing. Their moving needs must be onerous, therefore I did not commence it. The other reason was that I would ask about possible (not POV) onjections - are there any? Please, before answering, consider that the official Wikipedia in Aramaic is (only ! , unlike the Serbian) in the Estrangelo script (here showing indubitably which is the sole admissible alphabet at least in the Wikimedia projects) and that literature in Aramaic since the creation of the Talmud has been produced only in this alphabet.(I am not sure, but: is the Talmud the only Judaistic work in Aramaic? In any case, it is the last.) I am bound to be absent in the next 10 days, so I ahall not be able to reply instantly on questions, thence please do not make haste to close the discussion Bogorm 18:58, 16 August 2008 (UTC)

The "Hebrew" alphabet (or the "square" script), as it is known today, is an acceptable alphabet for writing MANY dialects of Aramaic, not simply the various Jewish dialects. The alphabet was originally developed in Hatra (now in modern-day Iraq) first for writing Aramaic, then Hebrew, and is still known as Ketav Ashuri ("Assyrian writing") by many Hebrew speakers. The Estrangela script (the "round" script) was developed later, along with its two modern variants (Eastern/Madhnhaya and Western/Serto). It's definitely not used by all Aramaic dialects, only the Eastern dialects. To my knowledge, the Imperial alphabet doesn't have a Unicode range and therefore can't be used in Wiktionary.
There are a few articles written in the Estrangela script, like ܟܠܒܐ (you can see that I've added a link to the corresponding square script spelling). These can be seen after the Hebrew script entries in categories (the last pages), I think because their Unicode range is a higher number. I've been meaning to add more Syriac script spellings after I've built up some pretty decent Hebrew script entries, working out things like definitions, pronunciation, and any templates that go with them. In either case, when I add translations to English entries I always add both scripts in the Aramaic definition (but so far I've only created the actual articles in the Hebrew script). Therefore, we don't need to move the already existing articles, just create the corresponding articles.
As for the Aramaic Wikipedia, I'm actually a sysop there (w:arc:User:3345345335534) and I can attest to the fact that we do NOT prohibit the use of the Hebrew script or only allow the Syriac script. There's no reason why the Aramaic Wikipedia can't work like the Kurdish Wikipedia, where the two main dialects are broken into different scripts. No one has created any articles in the Hebrew script so far, so although Syriac is the only script one sees, that does not mean the official policy is to write only in Syriac. --334a 20:28, 16 August 2008 (UTC)
Nitpick: according to the Hebrew Wikipedia, כתב אשורי(ktav ashuri, lit. Assyrian script) refers specifically to the square script used for Aramaic, and also to the square script used for Torah scrolls, m'zuzot, tfilin, and so on, but not to other square scripts used for Hebrew. Also according to the Hebrew Wikipedia, the Hebrew alphabet (including all of these square scripts) developed from Canaanite alphabets, rather than Aramaic ones. Still, I believe your overall point — that Aramaic has about as much claim to this family of scripts as Hebrew has — is quite valid. —RuakhTALK 21:24, 16 August 2008 (UTC)
And according to English Wikipedia Estrangelo developped directly from Canaanite alphabets too. Aramaic may have the claim for Estrangelo being the descendant of the Canaanite scripts, but no Suryoyo would ever consider the Hebrew script "his" unlike Kurds in Turkey where they are forced to accept the Latin script, but like Tadjiks. Bogorm 21:58, 16 August 2008 (UTC)
Of course, no one has created such articles there, there are no living Suryoyo people, who could write in Hebrew (without studying, but notice, not their Orthodox culture, but the foreign one) and as I presumed, the Talmud must be the last literary Aramaic work written in this alphabet. I am strongly disappointed by your position, especially concerning creating articles there in an extinct (in this language) script, since I do not consider neither in this case nor, for example, in the Tadjik Wikipedia the introduction of ancient scripts (for Tadjik - Arabic script until the 1920s) to be recommendable. Unlike Serbian, where Latin is widely understood, I doubt strongly that you could find even a dozen Suryoyo people (with no knowledge of the Hebrew language), who would be willing to change the only script in contemporary times and the only of the cultural heritage of their people (think of what script Saint Ephrem - the most illustrious Aramean - had used, and, please, deign to notice that there is not a vestige of Hebrew script in the article about him in Wikipedia) for the Hebrew used only be the Jews in and before the Middle ages. Notwithstanding, I felicitate you for your activity in the Aramaic Wikipedia, provided that it will not resemble the Serbian one and would like to ask you: which literary works in Aramaic with Hebrew letters do you know after the Talmud and some Medieval exegesis on it and which tribe exactly except Jews has ever in history used Hebrew alphabet? I know only one unique case in history when one tribe has converted to Judaism (and probably accepted the script) - Khazars, so if you show me sourced information about a second one, I would be deeply flabbergasted. I concede that Imperial alphabet, though the most ancient one, cannot be rendered in a digitalised manner, but still hold firmly the view, that for one language, which is considered living, the only admissible script is that of its native speakers (which the modern Jews are not for Aramaic) and that in Tadjik, exempli gratia, the imposing of the abolished Arabic alphabet would be lamentable, and as I am quite sure that they will not commit such reversion, do not cease to be astounded by the different approach for Aramaic here. As I do not welcome the use of (abolished) Arabic script for entries of the Tadjik language and, e. g., the (extinct) Kharoṣṭhī next to Devanagari in entries of the Sanskrit, so here too, but if you are soo insistant, I shall "die Flinte ins Korn werfen", as the Germans say. Bogorm 21:58, 16 August 2008 (UTC)
O.K., first of all: We make a point of including obsolete spellings; I don't see why you feel that spellings in obsolete scripts (or scripts that are obsolete for a given language) should be any different. Second of all: There are Jews who speak Aramaic as their first language and use the Hebrew script for it; w:Judeo-Aramaic estimates that about 26,000 Jews today speak modern forms of Aramaic (a figure which doubtless includes some second-generation Israelis who speak it only as a second language with their family and whatnot, but still — 26,000 is not bad for a language with no country). Third of all: The ancient Jewish Aramaic works, including (most of) the Talmud, are very important in terms of influence and modern familiarity. You keep describing the Talmud as the "last" work in Jewish Aramaic; even if that were true, it's completely irrelevant, since the Talmud is still widely read, and quoted from, in the original Aramaic language and square script. We might as well exclude Ancient Greek on the grounds that no one writes The Odyssey in Ancient Greek any more. —RuakhTALK 22:54, 16 August 2008 (UTC)
I have not said anythimg about the importance of the Talmud and I am against abolished scripts, because else one should enter articles about words in Central Asian languages in three scripts (Arabic until 1922-23, Latin 1922-1938 and Cyrillic afterwards), when contemporary people are using only one and have probably never seen their languages in Latin or Arabic letters(except centenarians). I am not disapproving of quoting in ancient languages (on the contrary), as I consider, e. g., the mess in Latin to be indispensable for the Catholic faith and as in Zoroastrianism Avesta is always being quoted in the original language and translating it is inconceivable, so I respect much quoting in ancient languages. As for Ancient Greek, I do not find the comparison adequate (I insist on comparison with Tadjik/Arabic script and Sanskrit/Kharosti), but it would have been appropriate, if the Greeks had changed their script and created a (considerably) larger cultural heritage in the imaginary new than in Greek, which is not the case. But it is the case with Sanskrit and since writing Sanskrit in Kharosthi is not sensible, I forsooth do not comprehend wherefore the approach towards Aramaic should be any different. Kharosthi had been in use in the past and moreover only in Northwestern India and there are no Sanskrit entries in Wiktionary in it, so the correspondence with Aramaic/Hebrew script is striking (except about the entries). Bogorm 23:25, 16 August 2008 (UTC)
I did not know about this diaspora, but with 1 500 000 Suryoyo 26 000 makes 1,7% of this quantity (of native Aramaic speakers). In Afghanistan should live too several thousends Tadjiks who have never seen Cyrillic script, who know their language only with the abolished in the motherland Arabic, so the similarities really augment. But Tadjik Wikipedia in Arabic script does not exist and must not. Bogorm 23:25, 16 August 2008 (UTC)

You're focusing heavily on the Western branch of the Syriac dialect of Aramaic. There are other dialects, living and otherwise. Like I said before, there's no reason why both scripts can't be used here. There are other (less well-known) Aramaic works written in the "Hebrew" script, like the Zohar, but that's not the point: the goal here in Wiktionary is not to erase everything old/less common and replace with new/more common, but the exact opposite. We have entries here in Old English (the old), along with spellings and pronunciations in various dialects of English (the less common). There is absolutely no reason to be disappointed or to reject having both scripts: it does not limit either script, but allows them both to be used, adding more content at no one's expense. It's win-win. Unlike Tajik, the Hebrew script was never "abolished" by anyone at any time. There's no governing body for Aramaic, there are no official rules in place. There's no need to find a dozen Suryoye who would be willing to change their script because that's nowhere near the issue here, the need is to find a dozen Aramaic speakers whose dialects are written in the Hebrew script. Suryoye may be the largest group of modern Aramaic speakers, but they aren't the only ones.

Here is an interesting article about a Western dialect of Aramaic (not to be confused with the Western dialect of Syriac) spoken in the village of Ma`loula, Syria. As you can see on the board the man is writing on in the picture, he's using the Hebrew script. This is pretty much confirming what I stated before: that Estrangela is only used by Syriac (Eastern Aramaic) speakers, and I'm fairly sure Ma`loula is a Christian (not Jewish) modern-day town. --334a 00:06, 17 August 2008 (UTC)

mislabelled audiofiles

On rare occasions, I run across audio files that are mislabeled. Most recently, I discovered that the audio file included in 准备 is incorrect. The audio file is labeled Image:zh-zhǔnbèi.ogg, which is correct. However, if you listen to the audio, you will discover that the speaker is not saying zhǔnbèi (IPA(key): [ tʂun˨˩pei˥˩ ]), but rather zhuāngbèi (IPA(key): [ tʂuaŋ˥˥pei˥˩ ]). The word zhuāngbèi (装备) means "equipment" not "to prepare." I left a note on the talk page for Image:zh-zhǔnbèi.ogg, but am not sure what else should be done. Anyway, we may need to come up with a more coordinated way to provide feedback about incorrect audio files. -- A-cai 17:54, 17 August 2008 (UTC)

I thought the MediaWiki software had been recently updated to allow image moves? Circeus 17:56, 17 August 2008 (UTC)

If they did, I'm not sure which buttons to press. -- A-cai 18:01, 17 August 2008 (UTC)

I would presume that any move needed would need to be done on Commons, where the file is located, rather than here. Unfortunately my adminstratorship on that project would appear to have lapsed, but I suggest you bring it to the attentions of the Commons community. I can't find a specific page, but commons:Commons:Village pump would seem like a good place to start. Thryduulf 19:33, 17 August 2008 (UTC)

In a similar vein, I was alerted the other day that we have audio for Swedish nouns which include the corresponding indefinite article, though one doesn't see it from the file name, thus the bots(?) who include them doesn't notice, and presumably very rarely any human editor. I have never been involved in the whole audio files business, so I don't know what one should do about it? Point in case is e.g. the file sv-tack.ogg ("thanks"), which doesn't say "tack", but "ett tack". Which conceivably could make the whole file unusable for anything but the (yet to be created) noun section of the entry tack... \Mike 10:36, 20 August 2008 (UTC)

The issue is also common in French audiofile (in this case, I suspect it's the practice at fr: wikt and the elision issues that are at play). Circeus 17:06, 25 August 2008 (UTC)

Did I break something?

I just did a complicated edit to {{fr-conj-é-er}} and {{fr-conj-éger}} that makes them use the code of {{fr-conj-table}} directly (instead of transcluding it) because it is the only way to display an alternative conjugation shared by all of them. Anybody can check that this does not actually break the template? Circeus

Nevermind that. The stupidity of tis approach as opposed to using {{fr-conj}}, which does not automatically adds brackets, has been pointed to me. Circeus 19:27, 17 August 2008 (UTC)

Article about English research using bgc

(Forgive the poor punctuation, Im using a Japanese keyboard in an internet cafe in Tokyo)

One great skill Ive picked up as a wiktionary contributor, is the research I do now and then to check words, at RFD/RFV or before I create new entries.

The research consists of checking usual sources like bgc and ggc and applying CFI.

Even apart from Wiktionary, its a useful skill, since it lets one research words on ones own. No need to take a dictionarys word for it, and the research isnt limited to just what the prescriptivists give a thumbs up to. In a sense, its the ultimate descriptivist tool, at least it would be if publishers would open everything up to google preview.

The point is, Ive been thinking of making a small little website to teach others how to do research on the English language using BGC and GGC. The website would also talk about wiktionary of course since the ideas came from here. Itd be affiliated with my blog, Maybe if it became popular enough it could even be officially acknowledged by Google. Itd probably be published on even though it would not be a blog, just because thats easiest for me. Or maybe someone here has a better idea.

The reason Im writing here, is Id love to involve the community in the project. Share ideas and so on, maybe even post articles here first to get input from you guys. Since, afterall, collectively we are the pioneers and experts of this technique.

Let me know what you all think :) Language Lover 08:05, 18 August 2008 (UTC)

That might be the only way we're going to get any documentation of the good methods used, given the recent documentation track record. DCDuring TALK 12:08, 18 August 2008 (UTC)
What’s bgc? Google suggests that it means Billy Graham Center, Bard Graduate Center, or Berkeley Geochronology Center. —Stephen 20:14, 18 August 2008 (UTC)
See Wiktionary:Glossary#b, (ggc, is, etc. etc.) I think it'd be wonderful to have something like Help:Researching, wherever it gets posted. Conrad.Irwin 20:18, 18 August 2008 (UTC)

CFI straw poll

In the course of a conversation, it has come to me attention that we are often asking for citations spanning at least three years, but that WT:CFI says thay must span one year. I'm asking for a show of hands to answer two questions: (1) Did you think CFI required 1 or 3 years before reading this posting? (2) Do you think CFI should require 1 or three years? No need to post any rationale, as I simply want to know what people think and not why. If we get an overwhleming consensus to change CFI, then it might lead to a vote, but if responses are mixed, then there wouldn't be any point in going further at this time. --EncycloPetey 18:18, 18 August 2008 (UTC)

Note by the way that Wiktionary:Votes/pl-2007-12/Attestation criteria calls for three years.—msh210 20:34, 20 August 2008 (UTC)
(1) 3 years, (2) 3 years. --EncycloPetey 18:21, 18 August 2008 (UTC)
(1) 3 years, (2) 3 years. --SemperBlotto 18:38, 18 August 2008 (UTC)
(1) 3 years, (2) 3 years. —This unsigned comment was added by Conrad.Irwin (talkcontribs). 18:57, 18 August 2008 (UTC)
(1) 3 years, (2) 3 years. -Atelaes λάλει ἐμοί 19:21, 18 August 2008 (UTC)
(1) 3 years, (2) 3 years. Too easy to to manipulate entry with 1 year. DCDuring TALK 19:25, 18 August 2008 (UTC)
(1) 3 years, (2) 3 years. sewnmouthsecret 19:30, 18 August 2008 (UTC)
(1) 1 year, (2) 3 years, with the possibility of extenuating circumstances for certain unusual cases. - [The]DaveRoss 19:32, 18 August 2008 (UTC)
(1) 1 year, (2) 3 years unless it is obvious that a word will endure (e.g., the name of a new country or a new material which will be part of recorded history even if it proves shortlived). —Stephen 20:07, 18 August 2008 (UTC)
(1) 1 year, (2) 1 year. A word that sticks around for a full year, and keeps the same sense that whole time, is worth including. —RuakhTALK 22:26, 18 August 2008 (UTC)
(1) 1 year, (2) 1 year (even shorter in exceptional cases). 3 years is an incredibly long time, especially in fields like computing and the internet, and one of the advantages of our format is our ability to stay current. Imagine not having an entry for something like blog until it has been around for three years - it will be in some print dictionaries before then. Thryduulf 22:51, 18 August 2008 (UTC)
That would qualify under "Clearly widespread use", which is a separate criterion. --EncycloPetey 23:25, 18 August 2008 (UTC)
(1) 3 years, (2) 3 years. Nadando 23:28, 18 August 2008 (UTC)
(1) 3 years, (2) 3 years. --Panda10 00:57, 19 August 2008 (UTC)
(1) 1 year, and the rest of you are deeply confused on this point, if understandably so. (2) 0 years if there is exceptional evidence, 3 years with the current requirements. See this proposed vote. DAVilla 05:17, 20 August 2008 (UTC)
(1) 1 year, (2) 3 years, otherwise it is a neologism. To answer Thryduulf, 1 year entries should be allowed, but marked as neologism imho. -- ALGRIF talk 11:56, 20 August 2008 (UTC)
(1) 3 years, (2) 10 years (like a normal dictionary.) --Connel MacKenzie 22:39, 25 August 2008 (UTC)
(1) 1 year; (2) 1–15 years, depending upon the “class” of word (1 for names of historical events, like 9/11; 10+ for the most ephemeral of vulgar slang).  (u):Raifʻhār (t):Doremítzwr﴿ 23:07, 25 August 2008 (UTC)
(1) 3 years, (2) 3 years (with the widespread use and/or neologism markup exception). -- ArielGlenn 22:52, 28 August 2008 (UTC)

Manipulating groups

Groups citations are subject to manipulation. No more than one citation from usenet in any one of the last three years would address my specific concern adequately. DCDuring TALK 00:07, 19 August 2008 (UTC)

I believe that even dates of Usenet posts are subject to manipulation, and I'm pretty sure I've seen some cases where Google has clearly misjudged the date of a post. (But this might not be the case for new posts, since Google might see that they're new. I don't know how that works.) I think the best approach is simply to maintain a healthy skepticism — if a quotation doesn't pass the sniff test, don't use it. —RuakhTALK 00:30, 19 August 2008 (UTC)

Combined transitive and intransitive senses

The first sense at write gives both a transitive and intransitive sense. (In fact, it is only the intransitive sense, as, even though the transitive sense would be worded the same, it would have the object in parentheses.)

This would be fine if Wiktionary gave definitions only, but when it comes to the subsections of synonyms, antonyms and translations, we potentially run into problems.

It is far better that we have two identically worded definitions, one for the transitive sense and another for the intransitive, differing only by the bracketing of the object for the former, than to require the synonyms, antonyms and translations to be marked repeatedly with "transitive" and "intransitive" when they apply to one sense only. — Paul G 14:43, 19 August 2008 (UTC)

I don't like bracketing, I don't think it's widely used here, but if it's better as you say then I'm willing to go along. We'd just have to make sure that every conceivable object was accounted for. What does it mean to write one's heart, to write the past, to write the world? Putting meaning into words, it seems, is a lot like thinking in a little box, and brackets are a way of making that box a little tighter. Actually the transitive/intransitive distinction is already a bit of that. The last example is mislabeled: "The computer writes to the disk faster than it reads from it" is intransitive. DAVilla 05:00, 20 August 2008 (UTC)
The aim of bracketing is to show that the bracketed material is only there so that the definition makes sense but is not actually part of the definition. The object of a transitive verb is not part of the verb, so should not be part of its definition either. Recall that the way definitions work is that it should be possible to replace a word in a sentence with its definition and for the sentence still to make sense.
We don't have to include every conceivable object - we can just list a few examples followed by "etc". So "to write (a book, a letter, etc)" suggests "to write any written form of communication". This is standard practice in other dictionaries. — Paul G 10:02, 20 August 2008 (UTC)
There was a proposal a while back, from a non-Wiktionarian, suggesting that we include basic prepositional use with verbs, something like:
  • to write <something> [on <object>] [with <something>] [in <language>] [to <recepient>]
These ideas could be combined, by giving examples instead of <something>s, it becomes very convoluted very quickly, and there's almost no limit to how many clauses can be included.
  • to write [a letter/book/sign etc.] [on (a piece of paper/the wall etc.) [in (pen/ink/crayon etc.) | with (a pen/a pencil etc.) ] | on a (computer/typewriter etc.)] [ to (his mum/a friend etc.)].
The idea behind it appeals to me, but I am not optimistic about its clarity in practice, it is also hard to work out which prepositions are being modified by the verb - writing [for a child] is not a special use of for. Anyway, this would be (according to the original idea) in addition to the current definitions and examples. Conrad.Irwin 00:37, 21 August 2008 (UTC)
To me it's more aesthetically pleasing to combine them on one line as there's less repetition. If functionality is greatly enhanced though, I could be won over. For comparison, OneLook shows that only two dictionaries combine them on a definition line (MSN and Cambridge). Most split them (Wordsmyth, AHD, RandomHouse,, The Online Plain Text English Dictionary,, and about 5 didn't show in/transitivity at all. --Bequw¢τ 10:23, 22 August 2008 (UTC)

Spanish legal jargon glossary

A Wiktionary user has mailed to OTRS a Spanish-English legal jargon glossary which he has created, I am not quite sure what we would want to do with it but he has agreed to release it under GFDL. If anyone would like to do some formatting work and figure out the best way to incorporate this information into Wiktionary I can forward the attachments to you. Let me know. - [The]DaveRoss 19:33, 19 August 2008 (UTC)

Formatting of translations of example sentences

In WT:ELE#Example sentences, it is not stated whether:

  1. Translations of example sentences should be in italics.
  2. The English translataion of the word in the English translation should be in boldface.

What is the preference? Is there any consensus?

I have formatted sám as follows:

  1. alone
    Nechci být sám.
    I don't want to be alone.

Is that the preferred way? Thanks. --Dan Polansky 08:04, 20 August 2008 (UTC)

  • I think that I do it differently every time. The word being exemplified should definitely be in bold. I was once admonished for also bolding the word in translation - but it looks good to me. I'm not bothered about italics. SemperBlotto 09:39, 20 August 2008 (UTC)
  • Both in example sentences and in quotations, I italicize transliterations but not translations, and I bold both the transliteration and (where possible) the translation of the headword. For a Latin-script language, it would (in the abstract) make sense to italicize translations of example sentences, just as we italicize the sentences themselves; but for non-Latin-script languages it makes sense to distinguish transliterations from translations this way, and overall I think it makes sense to be consistent. —RuakhTALK 17:18, 20 August 2008 (UTC)
  • We had this discussion a little over a year ago. We decided in a vote that this was the correct format (although italicization of non-Latin scripts was not discussed). However, additional discussion had the consensus that for short example sentences with short translations, it looked better to put the two on the same line. The vote did not specify whether the translation should be italicized. --EncycloPetey 17:23, 20 August 2008 (UTC)
  • I used to be entering it like this:
    1. Nechci být sám. -- I don't want to be alone.
    Then I thought I should stick literally to WT:ELE. Good that I came to ask here.
    So what about this, following Ruakh in italics and boldface, and following EncycloPetey in that short sentences can be one one line:
    1. Nechci být sám. -- I don't want to be alone.
    It looks like that as it stands, it is up to me to make up my mind on the formatting of translations of example sentences, is it? --Dan Polansky 06:46, 21 August 2008 (UTC)

For what it's worth, I prefer for English translations to be enclosed in directional double quote marks:

  1. Nechci být sám. -- “I don't want to be alone.”

That's consistent with English translations that follow foreign language terms using {{term}}. I don't care much whether we also bold the English words that correspond most closely to the foreign entry, but I can imagine how that may be difficult in some situations, e.g. when the division between English words and the division between foreign language words doesn't align well. Rod (A. Smith) 18:32, 24 August 2008 (UTC)

There is no particular need to include completely regular inflections such as cameras or asked.

According to Wiktionary:Criteria for inclusion#Inflections,

Although it is not forbidden, there is no particular need to include completely regular inflections such as cameras or asked. To the extent that they are present, they should indicate what inflection is intended and link to the stem form, and should not merely redirect.
Irregular forms such as geese and were should have their own entries, because people unfamiliar with the irregularity will look for them under the inflected form. Inflected forms — whether regular or irregular — with idiomatic meanings, such as blues or smitten, should have their own entries, with the predictable meanings distinguished from the idiomatic.

I propose that this be re-written as,

The entries for such inflected forms as cameras, geese, asked, and were should indicate what form they are, and link to the main entry for the word (camera, goose, ask, or be, respectively, for the preceding examples). Except with multi-word idioms, they should not merely redirect.
At entries for inflected forms with idiomatic senses, such as blues and smitten, predictable meanings should be distinguished from idiomatic ones.

My proposal reflects what I believe has become standard practice in two regards:

  • all inflected forms get entries, provided they otherwise meet CFI.
  • inflected forms of multi-word idioms can simply redirect.

It doesn't touch the major debate of the day:

  • should entries for inflected forms have actual glosses/translations, in addition to inflection information and lemma links?

because I don't think we've reached consensus on that.

At some point I'd also like to add this:

When an inflected form has idiomatic senses, the main entry should make this clear.

but I think that's a topic for a separate, future discussion.

Any comments/objections/whatever before I bring this to a vote?

RuakhTALK 17:14, 20 August 2008 (UTC)

I'm entirely in agreement with that. It's the current practice, and not one that is that likely to change in the near future. Circeus 17:45, 20 August 2008 (UTC)
Agreed. --Bequw¢τ 20:35, 20 August 2008 (UTC)
Agreed. Thryduulf 12:54, 23 August 2008 (UTC)
Agreed.  (u):Raifʻhār (t):Doremítzwr﴿ 14:31, 23 August 2008 (UTC)

Thanks, all. I've now created the vote: Wiktionary:Votes/pl-2008-08/Inclusion of regular inflected forms, Wiktionary:Votes#Inclusion of regular inflected forms. —RuakhTALK 16:13, 25 August 2008 (UTC)

Addition to the CFI

Europanto is a constructed language that has the ISO 639-3 code eur. It is currently not mentioned in WT:CFI#Constructed_languages (probably because up until January 2008 it was incorrectly listed as a "Living" rather than "consturcted" language). As this language is probably disallowed by default (and there are no current entries) it should be mentioned in the CFI where the other disallowed, ISO-coded languages are. --Bequw¢τ 20:28, 20 August 2008 (UTC)

I agree that this should be disallowed. Then again, I'm usually oddly alone on banning con-langs from Wiktionary. So, I guess we'll see what everyone else thinks. -Atelaes λάλει ἐμοί 20:30, 20 August 2008 (UTC)
Based on what I find about this "language" in its Wikipedia article, I agree that Europanto should be disallowed on Wiktionary. --EncycloPetey 20:33, 20 August 2008 (UTC)
I agree with EncycloPetey. (In general, I think that words from constructed languages should be subject to the normal CFI — if there are people actually using these words with something resembling independence, then they merit inclusion, no matter how silly or misguided such use might be — but going by the Wikipedia article and a skim of Europanto home-page, this language doesn't seem to have any words of its own. Someone coming across a word in a Europanto text might look it up, but would be fully satisfied with the source-language entry.) —RuakhTALK 20:45, 20 August 2008 (UTC)
I was going to add it to the disallowed list at some point (quite a while ago; when we were revising this a bit), but never got to it; as Ruakh notes the issue is moot, as Europanto has no vocabulary of its own. But probably worth listing. Go ahead and add it, I don't think you'll get any objections? Robert Ullmann 18:00, 23 August 2008 (UTC)
Now added to the "not yet been approved for inclusion in the English Wiktionary" list. --Bequw¢τ 09:22, 25 August 2008 (UTC)

Broken superlative entry template

In a page like this:

When you click on Superlative you get a bad template, not at all like the (correct?) Comparative one. I don't know how to change such things and probably don't have permission to do so. Could someone fix this? I've been using the Comparative button and manually editting it's template to say superlative -- dougher 04:25, 21 August 2008 (UTC)

Now fixed, thanks. (I don't know why that template has "bot" at the end of its name, but whatever. I modified [[MediaWiki:Noexactmatch]] and it should work now.) —RuakhTALK 14:15, 21 August 2008 (UTC)

Duplicate Language Codes

Often there are multiple language code templates for the same language, such as both {{es}} and {{spa}} for Spanish. Usually this happens when there's a valid 2-letter ISO 639-1 code and a 3-letter ISO 639-3 code (there's almost always a 3-letter code). These duplicate codes work equivalently when templates expand them into names but differently when they are used verbatim to make topical categories. The de-facto standard (here and in many other systems) is to use the 2-letter when available, otherwise the 3-letter. We have about 100 of these 3-letter duplicate language code templates. Editors often aren't aware of the subtle differences between templates, using the duplicate code effectively with {{term|...|lang=}} but creating category-havoc with {{pejorative|lang=}} (indeed duplicate codes work fine for the 1st parameter of {{etyl}} but not the 2nd!). Editors have even thought the 3-letter dups were the right category prefix and started hand-categorizing entries using them. I think the best way tackle the problem is to make clear the policy (somewhere) and then remove these templates. Most weren't heavily used except for eng ("English", use en) and lat ("Latin", use la). They have been mostly orphaned by now.

Note that we do have some "unavoidable" duplicate codes because the WMF created some (and they're associated wiki's) before ISO codes were established ({{zh-min-nan}}, {{roa-rup}}, {{zh-yue}})) and as we need to be able to link to those wiki's it's probably best to leave these for now. Do people feel like removing this dups is an acceptable approach? --Bequw¢τ 06:07, 21 August 2008 (UTC)

Removing the templates isn't so simple. The reason we have them is for subst'ing in case someone includes them in a translation section. Now, if there's a way for AutoFormat to subst these without the templates existing, then I'd say let's delete them. However, I suspect that it's easier for AF to function if the templates exist. --EncycloPetey 06:13, 21 August 2008 (UTC)
Well, if we delete them, then people won't be able to use them, then there would be no need to substitute them (unless people simply put the code, without making the template call). However, I suppose one could argue that people who are used to using these templates will be thrown off a bit, as they'll no longer work. However, I do not think that justifies keeping them. -Atelaes λάλει ἐμοί 06:28, 21 August 2008 (UTC)
That's exactly what I mean. It is not uncommon for someone to add a translation using the ISO code instead of the language name. Some Wiktionaries does all their language names that way in the Traducciones section. So, people who are used to doing that come here and do the same. I agree that the current situation is problematic, but am hoping someone clever will think of an alternative solution. --EncycloPetey 06:55, 21 August 2008 (UTC)
Well, I don't think it would be that difficult for Robert to write a bit for AF which has all the defunct three letter codes, or better yet, which tells AF which two letter code the three letter code stands for (which would allow AF to keep up with changes to the templates, should they happen). I imagine he's up to it. -Atelaes λάλει ἐμοί 07:04, 21 August 2008 (UTC)
AF doesn't literally subst the template, it replaces the template with the language name from User:AutoFormat/Languages and then links it if not in WT:TOP40. This keeps it from doing horrible things if the template is not in fact a language template. The control file is built from the templates in the XML (so now fairly stale, sadly); but it can fairly easily include the 3-letter codes even if the templates don't exist (go look at it). Fixed code to produce the control file from the "live" wikt, so is now up to a few minutes ago. Robert Ullmann 05:02, 22 August 2008 (UTC)
OTOH, whether the template exists or not has no effect on cases where a code is used directly to generate a category name: {{zoology|lang=frobazz}} is going to generate Category:frobazz:Zoology anyway. So deleting them doesn't directly solve that problem; it means that editors might look a bit further to find the correct code. (Mind you, we will probably get someone creating {{lat}} with an edit summary of "why the F is this missing?!" ... ;-) Robert Ullmann 03:22, 22 August 2008 (UTC)
Though some other wikt's use lang codes instead of names, most that I've seen still use the 2-letter code when available, so most people will still get the right result. When looking at duplicates here, I'd say 2-letter references outnumbered 3-letter references at least 10 to 1 for all languages. Thankfully, since AF uses a separate control file, s/he should be unaffected:) People can still manually mis-categorize pages, but hopefully this will enlighten (train?) them better (how else will they learn). What would also help is to write the policy down somewhere easy so nobody has to look through old discussions. As for people re-creating the temps, we could lock down the "popular" ones. --Bequw¢τ 10:06, 22 August 2008 (UTC)
There will be lots of work to do when people start using {{see}}, another ISO language code for Seneca (w:Seneca language). --Jackofclubs 13:02, 23 August 2008 (UTC)
So you thought you would intentionally create the problem by adding tistis and Template:lang:see? NO. We will sort out the template use first. Robert Ullmann 10:49, 24 August 2008 (UTC)
Sure - the template used should be sorted out first. It was just a little experiment. I'm not after creating problems. --Jackofclubs 10:51, 24 August 2008 (UTC)
Yes, you are creating problems. When you are told there is a problem, LISTEN. DO NOT continue.
There are several way of sorting this both temporarily and permanently. Creating a number of things using "see" is not helpful; particularly since you are doing it entirely to be disruptive: if it were not for this discussion, you would NEVER have entered any Seneca words, correct?
I am removing the cats, etc; if you persist in this you will be blocked. Robert Ullmann 10:59, 24 August 2008 (UTC)
I will not continue with anything to do with the Seneca language until further discussions. --Jackofclubs 11:09, 24 August 2008 (UTC)
There are several way of sorting this both temporarily and permanently. Creating a number of things using "see" is not helpful; particularly since you are doing it entirely to be disruptive: if it were not for this discussion, you would NEVER have entered any Seneca words, correct?
You are absolutely correct in your assumption that I "would NEVER have entered any Seneca words". I hadn't even heard of the Seneca language until recently when I discovered it had an ISO code see. However, the entry is still prefectly valid for Wiktionary and it was categorised in the same way as similar entries in other languages, so I don't see that I did anything wrong in this case. It was not intentionally disruptive. --Jackofclubs 11:09, 24 August 2008 (UTC)
I am removing the cats, etc; if you persist in this you will be blocked. Robert Ullmann 10:59, 24 August 2008 (UTC)
I understand why you want to delete the cats etc., but please note that I categorised them in the same way as other entries. --Jackofclubs 11:09, 24 August 2008 (UTC)

Note that a simple solution for us would be to use an extension code (see-sen or whatever) for a while; when we have sorted out the existing {see} template we can bot-convert the temporary code to "see". But in the meantime, trying to use "see" as a language code will cause no end of problems. Robert Ullmann 11:04, 24 August 2008 (UTC)

  • Why not just create some simple conversion template with a big switch-statement inside that would default all of these problematic 3-letter codes to the respective 2-letter equivalents, which could be used by topical context labels, {etyl} and others to pre-filter 3-letter codes that would generate false categories? This way nothing should be deleted/protected, and users can use whatever they're accustomed to.. --Ivan Štambuk 09:02, 28 August 2008 (UTC)
Yes, I thought of that. It's useful so have a look at {{standard-code}}. That still leaves open, though, the fact that having these duplicate codes leads users astray when they hand code their own categories. Now, we could use {{standard-code}} in our categorizing templates and have someone/thing scan and fix the hand-code mistakes OR we could delete the duplicate templates and have the users learn. That's the question. If we decide on keeping them, I'll volunteer to scan the dumps for incorrect usages (easy promise since we don't have many dumps now!) and we could fit {{standard-code}} into {{etyl}} and the others that need it. --Bequw¢τ 06:17, 6 September 2008 (UTC)

malformed pronunciation sections

Couldn't decide whether this belonged here or at the grease pit, feel free to move it if I jumped the wrong way.

There are quite a lot of pronunciation sections out there that are not formatted according to current practice. To try and sort most of these out, could someone with the appropriate technical know-how set up something similar to User:Robert Ullmann/Mismatched wikisyntax that highlights entries with pronunciation sections that match any of the following criteria:

  • is completely empty (perhaps AF could just add {{rfp}} to these?)
  • contains no templates
  • contains /.../ outside of a template
  • contains a <tt>
  • contains a link the the Rhymes: namespace outside of a template
  • contains a table (possibly search for instances of {| )
  • has a {{enPR}} template that contains one or more slashes.
  • has an {{IPA}} or {{SAMPA}} template with parameters that do not start and end with / (i.e. / should be the first and last character of all parameters)
  • contains an {{a}} template anywhere except following a bullet or bullet or indent, (i.e. not one of * {{a}}, ** {{a}}, *: {{a}})

And for pronunciation sections of non-English words

  • contains an {{enPR}} template
  • contains {{IPA}} and/or {{SAMPA}} templates without a |lang= parameter

Thanks, Thryduulf 13:15, 23 August 2008 (UTC)

This sounds like more of a GP consideration to me, but what should I know?
Regarding the specifics here:
  • I’m probably guilty of the slashes-in-enPR-templates thing; what’s meant to engirdle those transcriptions?
  • I suggest changing your penultimate criterion to read “has an {{IPA}} or {{SAMPA}} template with parameters that do not start and end with </> or <[> and <]>, respectively (i.e. for all parameters, / should be the first and last character, or [ should be the first and ] should be the last character)” to account for the small number of narrow transcriptions we have (such as for many Irish language entries).
See ya!  (u):Raifʻhār (t):Doremítzwr﴿ 14:24, 23 August 2008 (UTC)
Indeed. {enPR} should not have /'s; IPA (and SAMPA) should have /phonemic/ or [phonetic] (do note that we have a lot of entries that use / with phonetic transcriptions, and I've noted a well-known editor berating a newbie for changing the contents of /'s to the correct phonemic transcription) There are more with phonetic []'s than Doremítzwr notes: most Chinese language entries have phonetic transcription correctly shown in the brackets. SAMPA doesn't have a lang= parameter (maybe it should, but it doesn't).
Note that the doc at {{enPR}} says it doesn't use slashes because enPR is phonemic, and they are therefore redundant (and supposedly thus confusing?). enPR is the general pronunciation thus "broad"/phonemic = slashes (while "narrow"/phonetic/dialect = brackets).
Good list. The technical details seem more like GP stuff. Robert Ullmann 17:15, 23 August 2008 (UTC)
Oops, yes I forgot the [...] transcriptions. A lot of French ones use this as well, and they seem to be the most likely not to have a lang= parameter too. We should also add "contains [...] outside of a template" and "has an {{enPR}} that contains one or more square brackets" to the list.
Should we move this conversation to the Grease Pit then? Thryduulf 19:25, 23 August 2008 (UTC)

User:Robert Ullmann/Pronunciation exceptions perhaps? (and see talk there) Robert Ullmann 16:28, 24 August 2008 (UTC)

Persistant blocking issues

There has been an unfortunate trend for a while now with regards to OTRS communications concerning blocks for apparently well meaning contributors. The pattern goes something like this:

  1. New User / IP adds an entry or content to an entry which they think, for one of several reasons, should be in Wiktionary.
  2. Admin / Old timer sees content which doesn't belong for one of various reasons, and removes it. The only notification of why the content was removed is the edit summary.
  3. New User / IP tries to view their recent addition, sees that it isn't there and thinks that perhaps they didn't add it correctly, and tries again.
  4. Admin blocks New User / IP for "re-adding previously deleted material". The only notification is again, in the edit summary.
  5. New User mails OTRS and says "I tried to add this and I am blocked, WTF mate?".
Your response, while very much appreciated and clearly understood, does not neccessarily erase the negative feelings experienced after my first exposure to Wiktionary. I also cannot help but feel that there is a serious "techie" aspect to this site that discourages the every-day bloke from wanting to wade through all the variations and instructions. The instructions for tuning an automobile engine are probably not as involved or detailed. My efforts were merely intended to be helpful after seeking the definition myself and not finding it. Sorry that it didn't work out that way. No further attempts will be made.

The fact of the matter is that New Users should not be assumed to know to look at edit history and recent changes, if someone adds content which doesn't belong the person who removes it should at least let the contributor know on their talk page why the content was removed, and perhaps toss in a {{welcome}}/{{welcomeip}} so the New User can check out some helpful pages. Not all "re-adding previous..." blocks are appropriate, I suspect that most are not. This is especially true if they occur without any direct communication with the contributor being blocked prior to the block. - [The]DaveRoss 17:03, 23 August 2008 (UTC)

I agree that the edit summary explanations are often not read by new contributors. The talk page is a bit more "in your face" and may be a better means of communicating important messages. If perfectly plausible behavior, like redoing something, could lead to a block, that is certainly important to the blockee, if not to the blocker. If they know to go to OTRS, at least there is some potential for them to be saved as contributors. I would expect there to be many who go away, some of whom bad-mouth en.wikt.
The example illustrates how difficult en.wikt is for new contributors. Perhaps it is time to be more frank about the fact that en.wikt really does not want contributions from brand-new would-be contributors (because the ratio of corrective effort to average value of the contributions). A more structured approach, possibly even as extreme as prohibiting changes to entries from those not on white lists, might be called for. Forms for new entry suggestions and for input of citations might also be useful. DCDuring TALK 17:31, 23 August 2008 (UTC)
There are plenty of brand-new users who have no trouble making decent entries from the start. (See User:Kwamfun, who started with ความฝัน). There are also "contributors" who have nothing but disruptive crap. Let's guess for a moment that TDR's message from OTRS was received in the last 48 hours or so (hence Dave is posting this now). We don't know this, but under that assumption, the user in question created one of unicorn monster, tdaddy and poofer, cablephilia (personal attack), 2000-2001 United States network television schedule(Saturday morning) (by pasting from WP) or ancient drum and fife corps. The last is just about the only one I think was written by someone who can write coherently ... and probably should not have been deleted out-of-hand as a "Fatuous Entry": it is, rather, Wikipedia material, and (whaddya know? ;-) w:Ancient Fife and Drum Corps in fact is just that. I for one am very quick to block the idiots and drunk college students—who are quite obvious—and not anyone who might be okay. Robert Ullmann 17:52, 23 August 2008 (UTC)
I have responded to two complaints of this nature in the past few days, and many, many more prior to those. I do look into each of them so I can elaborate on block reasons and what the best course of action might have been for the blocked user. The truth is that we don't generally get the best contributions from the new people, some better than others, but given a chance and some positive communication they may turn out to be solid, long term folks. I seem to remember making fun of Connel for his initial contributions, some admins at the time berated him for some if I recall, and he turned out to be a somewhat prolific editor in the end ;) Admins are busy, I know that, it isn't always easy to be fully communicative and the benefit of the doubt takes more effort, but we do need more help and something about honey and vinegar comes to mind with regards to that. - [The]DaveRoss 21:27, 23 August 2008 (UTC)
Keep in mind that when an editor goes to re-create a deleted entry, (s)he sees a warning message stating that the entry was previously deleted, and listing which administrator(s) have deleted it together with the deletion summary/ies. So while it would be nice for administrators to leave talk-page comments when they delete seemingly well-meant entries, I think what's really important is that the deletion summary give enough information that it be intelligible to a newbie. Of the examples linked to by Robert Ullmann, I think most do so, though ancient drum and fife corps does not. —RuakhTALK 17:01, 24 August 2008 (UTC)

Wiktionary:About sign languages

I'd like to submit Wiktionary:About sign languages for adoption as policy. Two major considerations are that (a) it requests a bending of WT:CFI for these typically unwritten languages and (b) it introduces a novel transcription system for entry pagenames. Feedback is hereby invited. Rod (A. Smith) 06:34, 25 August 2008 (UTC)

For examples of entries that follow the guidelines at Wiktionary:About sign languages, see OpenB@Chest-PalmBack RoundSplane (please), 1@Sfhead-PalmDown Claw5@InsideChesthigh-PalmDown-Claw5@InsideTrunkhigh-PalmUp RoundHplane-RoundHplane (confused), and B@RadialWrist-PalmForward-OpenB@CenterChesthigh-PalmDown Sidetoside (busy). Other examples can be found in Category:American Sign Language. Rod (A. Smith) 01:57, 26 August 2008 (UTC)

I understand voting on a relaxation of the CFI, but what would the implications of voting on the transcription system be? Would it mean we couldn't change the transcription system for entry pagenames later (though the word "novel" is inviting:)? By analogy we allow multiple phonetic transcription schemes for spoken words. Would we want to allow multiple transcription schemes for ASL? --Bequw¢τ 07:41, 26 August 2008 (UTC)
Well, if we're talking pagenames, we'd really have to come up with a single system and stick to it. Also, I seem to recall hearing about some sort of unicode system for representing sign language. What's the word on that? Btw Bequw, do you mind if I copy your comments over to Wiktionary talk:About sign languages? I imagine this discussion really belongs there. -Atelaes λάλει ἐμοί 07:49, 26 August 2008 (UTC)
This is the same problem we encounter with languages that traditionally had no written language. See w:Inuit language#Writing, an example of this, lists 3 very common transcription systems. Would we only allow entry names in one of these systems? We could maybe have one system be "primary", such that pagenames created with other transcription schemes are soft redirects to ones in the "primary" system, but I think we ignore history (and our audience) if we only allow one.
The unicode system you're talking about is Stokoe Notation, but we can't be used directly because some of the characters it employs aren't allowed in pagenames. And of course, publish me widely. --Bequw¢τ 05:28, 27 August 2008 (UTC)
Yes, Atelaes is right that we really need to stick to a single pagename scheme for main entries. For historical or research purposes, it seems quite reasonable to allow soft redirects from other transcription systems (although the specific Unicode character set used Stokoe isn't actually well defined). If we specifically call out the allowance for soft redirects in Wiktionary:About sign languages, would anyone care to word that, or should I? We should be able to button this up by the time Wiktionary:Votes/pl-2008-08/Wiktionary:About sign languages opens, but if anyone thinks these issues need more time to resolve, feel free to extend the vote's start date. Rod (A. Smith) 05:43, 27 August 2008 (UTC)
What about SignWriting? The 'pedia article seems to imply that there's a font and a Mediawiki plugin coming out soon. I mean, if it's coming out next month, might we not just wait until then? If it's coming out in two years, then we can just go ahead with our makeshift system and convert when the time comes. My point is that, if there's an official system of some sort, we'll probably want to use it. -Atelaes λάλει ἐμοί 05:49, 27 August 2008 (UTC)
I wish there were an official system. There doesn't appear to be any immediate solution for us. Even if a SignWriting font and MediaWiki plugin are released under GDFL, we'd still have the challenge of creating tools that create MediaWiki source like <img src="image.php?w=90&h=80&build=.7,ff0000,10,10,256,25,25...">, not to mention the fact that that's pretty much illegible and unusable for entry pagenames until Unicode and browsers (or at least operating systems) support it. Besides, if we do adopt a consistent, phonologically sound transcription system, we should be able to bot-migrate from our system to whatever eventual standard arises. Rod (A. Smith) 06:55, 27 August 2008 (UTC)
Yeah, SignWriting symbols should be in Unicode one of these years, and we can always switch over at that time.—msh210 20:29, 27 August 2008 (UTC)

Sections laid out horizontally

I'm not sure if this has been discussed previously, but what do people think about laying out neighboring sections horizontally (e.g. "Synonyms" and "Antonyms" at User:Bequw/horizontal). I find it nice as the left-right aspect reinforces the idea that they are contrasting concepts. Secondly, though the example I made isn't very verbose, this layout makes better use of horizontal space, shortening-up the entries. This layout could be extended past "Synonyms" & "Antonyms" as well. If other semantic relations are allowed (with possibly with less wonky names), "Hypernyms" & "Hyponyms" could be contrasted as well as "Meronyms" & "Holonyms". Indeed, "Etymologies" and "Descendant Terms" are contrasting ideas (one looking backwards in time, and one forwards) and if they were allowed to be reordered to be neighboring, they could be laid out horizontally as well. I think technically this layout is allowed by WT:ELE as the "order" is still preserved (the left section being "before" the right one). Not sure if the contrasting colors look right, but it's not bad. Is this desirable for entries that have longer contrasting sections? --Bequw¢τ 11:01, 25 August 2008 (UTC)

First, not that you've mixed templates in doing this. The template {{mid}} was only ever supposed to be used in Translation sections and is now deprecated in favor of {{trans-mid}}. For this situation, you want {{mid2}}. However, this format present a number of problems. First, it means enclosing section headers within column templates. I'm not sure how comfortable I am with that. The proposal to pair Etymology with Descendants is more problematic, and would not work in many cases. It is a problem because (1) it moves a potentially large section to the from of the language entry, and (2) there are cases where a single etymology exists for more than one part of speech, each of which has its own descendants. So, the way I see it, Synonyms/Antonyms are the only pairs likely to ever be presented with this format. Would it then be worthwhile to do this? --EncycloPetey 15:52, 25 August 2008 (UTC)

Where does the discussion belong

Where is the appropriate place these days for wheel-warring disputes? I've issues a tentative 1-day block on User:Ruakh after reading some of his slander on my talk page and getting more, even as I was responding to some of it. Welcome back indeed.

Going forward, what is the proper procedure? Start a vote to desysop Ruakh? Mound up evidence there, or here, or what? --Connel MacKenzie 17:38, 25 August 2008 (UTC)

It begins (and ends) with you following proper RfD procedure for f**k. You deleted it, Ruakh restored it; the process goes to RfD. Robert Ullmann 17:45, 25 August 2008 (UTC)
The breach of process was Ruakh's. There certainly were no citations provided for a known-bad entry from a known-bad contributor. The notion that it might be an acceptable entry is too far fetched - it would be "#$%&" or something, not the wonderfoolism. But even IF those concerns were addressed, the technical "bad entry title" reason still holds. --Connel MacKenzie 18:06, 25 August 2008 (UTC)
The technical issue was discussed in WT:GP, and the conclusion was that there is no technical problem with the page title. It is possible (as you have mentioned) for it to cause a problem with some tool that does not properly maintain the code/data distinction, failing to treat page titles as data and not code, but in that case the tool is broken a priori. There is no issue with this title in the MediaWiki software. The entry was restored after this determination was made. Any further discussion is a Request for Deletion content-process issue. Robert Ullmann 18:13, 25 August 2008 (UTC)
That is false on four levels.
  1. It is not a hypothetic problem, but an actual problem. It is ignorant to suggest that testing the search capacity proves that all tools work. Special delimiter and wildcard character combinations cause problems - the first place I noticed this was for random pages. But fixing one does not mean that others are fixed; it does suggest the problem exists elsewhere. Simply following our original page title conventions of course bypasses all that nonsense. That aside, you obviously didn't automate tests of all Mediawiki software that can encounter it, nor even all the Javascript used just on this site, let alone all the toolserver tools that encounter it.
  2. The addition of previously deleted content is subject to citations. The content (from Wonderfool) was bogus to begin with. No evidence was submitted that the entry is anything other than his imagination. Plausible, perhaps. But in English, written as "$%&*" or somesuch, not that nonsense.
  3. The entry was restored out of spite, not after some determination was made. The comments around it make that very clear.
  4. As far as content, there was a theory thrown around that it could possibly be written that way. No evidence was proffered to that effect. It does not pass any laugh test, to suggest it might even be a common way or sanitizing it. Could it conceivably exist? Yes. Does it? No.
Now, since you've again restored it out of process, I assume you started the appropriate deletion discussion? No. Ahhh. So you are just pulling my leg here because you didn't read any of what Ruakh actually wrote. Since when did you start listening to his slander campaign? I expected much better of you.
--Connel MacKenzie 21:29, 25 August 2008 (UTC)
Actually, it was I who was taken in by Ruakh's well crafted lies. I restored it because there was a discussion, which appeared to find that there were no technical problems with the entry, and Ruakh also went to the trouble of citing it. If you still have problems with it, by all means feel free to start an rfd. -Atelaes λάλει ἐμοί 21:39, 25 August 2008 (UTC)
I guess I'd better do that, then. --Connel MacKenzie 21:51, 25 August 2008 (UTC)
Interesting to see citations hidden in the subpage now. With the creation of RFV, others maintained that the best place for such citations is in the entry. Both good and bad that it is different now. Nevertheless, this highlights greater concerns. The entry as written now suggests it is the way fuck is sanitized in the English language. That is misleading at best. The Usenet citations themselves...our CFI...what is this, random policy adherence when its Ruakh? All the while, the technical problem of the entry title itself still exists. Nice shell-game shuffle, by the way. Discuss things openly? Not if you are Ruakh. Back room slander only. Unbelievable. And why? To put forward a known bad entry as valid. Great. --Connel MacKenzie 21:49, 25 August 2008 (UTC)
Re: "With the creation of RFV, others maintained that the best place for such citations is in the entry": I agree that citations should go in the entry, or that at least some of them should, but the entry was deleted at the time, and I was concerned that if I restored it and added the citations, it might give you an aneurysm. :-P   Re: open discussion: All Wiktionary content is GFDL; if you feel that a comment of mine is "back room slander" that should be discussed openly, you need only copy it to one of the high-volume discussion pages. (Er, technically I guess e-mails aren't automatically GFDL'd, but I've never turned down a request to release an e-mail under the GFDL. Granted, I've been in e-mail threads that were partly or entirely about you that you weren't privy to, but usually I've been pretty good about CC-ing you; and anyway, I've never slandered you, since that term implies falsehood — indeed, intentional falsehood.) —RuakhTALK 23:54, 25 August 2008 (UTC)
  • Silver lining: the out-of-process restoration will test "killbot"'s handling of this too. Not that that code didn't have enough similar problems already... --Connel MacKenzie 21:54, 25 August 2008 (UTC)


(for everyone else's benefit, I don't imagine that Connel needs this lecture, since he's been coding for about as long as I have, but whatever)

Page titles are data. Metacharacters (like * and ** in a lot of contexts) are code. Software that inadvertently (or intentionally!) treats data as code is seriously broken; it is an egregious error.

Any properly functioning software treats a page title as the content of a string variable; it is not going to break.

I've read most of the MW software, (not just "tested search"), and it does not have problems like that. Too many eyes, both professional and amateur on the open-source code. Not to mention being written by people who would not make that sort of kindergarten mistake. (I don't think I'm being entirely fair to kindergarteners here.)

It is possible for data to cause faults such as a keyerror in a database, but this is a design fault in the software, not converting the data (in this case page titles) to acceptable keys. And any number of other similar things. But saying "f**k" is not a valid string data value because "**" has some language code semantics is wrong.

"*" is a valid character in MW page titles, in any combination or sequence; software reading the data, whether from the XML or the UI or API must handle that. And there is no reason why any given software will not, unless the programmer has gone out of his/her way to conflate data with code, or as mentioned, uses the title string as a DB key or something that can't handle an arbitrary UTF-8/Unicode string.

(oh and Connel, deleting an entry from the wikt because it is breaking something of yours with a valid entry title is beyond the pale. Just skip titles containing "**" or whatever when you are loading the w:MUMPS global? Eh?) Robert Ullmann 11:12, 27 August 2008 (UTC)

Chinese Categories

Currently there are two set of categories for standard chinese: one set with prefix Category:Mandarin and another set with prefix Category:zh, Category:zh-cn and Category:zh-tw. An example is: Category:Mandarin nouns and Category:zh:Nouns, Category:zh-cn:Nouns, Category:zh-tw:Nouns. The difference between the two set of categories is that in the zh categories the entries are split up depending on if it is simplified, traditionel and pinyin and the traditionel (zh-tw) is sorted by radical and the other by pinyin. In the mandarin categories pinyin, traditionel og simplified are in the same category and it is sorted by pinyin. The zh categories have the benefit that you can find things based both on radical or pinyin. The bad thing about them is that the names are clumsy. An nice categori name would be Mandarin nouns by radical, Mandarin nouns by pinyin, Cantonese nouns by radical etc. We have had a little discussion about it at Wiktionary talk:About Chinese and since changing categories can be a big task I have startet the discussion here. Maybe we can just do the easy thing which is to delete the Mandarin categories? or can someone make a bot which can give us some nice names for categories? Kinamand 11:46, 26 August 2008 (UTC)

It is a lot easier to change than you think: almost all the entries (and it should be all the entries ;-) have the categories generated by the POS templates, such as {{cmn-noun}}. I set these up (with much help from A-cai) almost two years ago, with just this in mind; the templates implemented the category names as they then existed. It is a matter of deciding on structure and names in discussion on the talk page (at WT:AC). Robert Ullmann 09:42, 27 August 2008 (UTC)
My one concern is that there are a number of categories which don't belong to a specific template. One such example is Category:zh-cn:Job titles in Romance of the Three Kingdoms. Whatever we decide, I would want to ensure that there was a consistent implementation, whether a category is included in a template or not. I think this may be an ideal subject for further debate, and then perhaps a vote. That way, we ensure that we don't replace one unsatisfactory scheme with another. -- A-cai 23:47, 27 August 2008 (UTC)
I can see on Roberts page that he have programs with can run through all chinese entries and check if it has format problems so I think he can make program which can do a search and replace on all that kind of category names. If he can not do that then maybe it is better to just delete the mandarin categories. Kinamand 08:10, 1 September 2008 (UTC)
Another odd consequence of the current double set of categories is that in Category:Nouns by language you can find both Category:Mandarin nouns and Category:zh:Nouns which both covers mandarin nouns. Category:Mandarin nouns are also in Category:zh:Nouns. I more an more feel convinced that we should delete the mandarin categories since they are not used much and only in situation where the zh/zh-cn/zh-tw also covers. Then later we can decide if we should and can rename the zh/zh-cn/zh-tw categories to something better. Kinamand 14:35, 2 September 2008 (UTC)

More constructed language updates to the CFI

After futzing about Europanto in the Constructed Languages section of the CFI, I realized there's several other factual discrepancies there.

  1. There are two ISO 639-3 coded languages not mentioned on the CFI: Blissymbols (zbl code created Aug '07) and Kotava (avk created Jan '08). I'd like to add these to the "not yet been approved for inclusion" list.
  2. Several languages listed as having an ISO code, actual don't (mistakes from the wikipedia page): Jakelimotu jkl, Ceqli cql, Tceqli tcj, Delason dea, Linga lnq, Orcish orq. I'd like to move them to the line of conlangs w/o ISO codes.
  3. Glosa and Interglossa are mentioned separately, implying they both have ISO codes. Interglossa has the code igs, but Glosa is really just the new, sanctioned version of Interglossa (see w:Glosa). I'd like to list them jointly on the same line.

As these seem to be factual changes rather than policy changes, I'd imagine no one would have problems with the above changes. But if so speak-up. --Bequw¢τ 06:56, 27 August 2008 (UTC)

On point (2), it wasn't a 'pedia "mistake", it is that those codes (probably all, but I've only checked orq=Orcish and cql-Ceqli) were in draft ISO/DIS 639-3.5, but not in the final standard. I don't see any problem with changing CFI to reflect updates of this kind. Robert Ullmann 09:18, 27 August 2008 (UTC)
Quite right, they all appear in that document. Thanks. --Bequw¢τ 04:31, 28 August 2008 (UTC)

Template:see to Template:also

re: discussion above about see being the language code for Seneca

I'm noting this here for those who don't read the Grease Pit, please refer to WT:GP#Template:see

After a bit of discussion, EP proposed we change the template to {{also}}, which is just as simple, and doesn't conflict. AutoFormat has been taught how to do this, and done a few (there are apparently 42,000+ ;-). It can continue to munch on them; see the migration process described there.

At some point we should get the DYM extension proposed quite a while ago, which will do what {see}/{also} do in the majority of cases, at that point most can just be removed. In the meantime, this need not make any work for anyone. Please comment there or here. (There is probably better.) Robert Ullmann 11:54, 27 August 2008 (UTC)

AF is converting these; it will take a while, but it is a bot and doesn't get bored. It is spending most of its time working on simple things in the pronunciation sections. Editors should start using {{also}} when adding things, but don't bother changing it manually. (if you make any edit, AF picks the entry up anyway ;-). It will remain a redirect to {{see}} until it is more efficient to redirect the other way. Robert Ullmann 14:38, 31 August 2008 (UTC)


I propose that we have a new group of users called 'patroller' who can patrol edits and also have their own edits autopatrolled, that admins have the ability to add and remove users to and from the group, and that our policy be that all now and future whitelisted folks be added to the group, with our current procedure for whitelisting. This would require consensus on the policy change and also to get a dev to effect it.

The minor benefit is autopatrol of own edits sans RP or JS. The major is potentially more eyes on RC.
Giving more folks more tools. But they're on the WL, & this is not really more trust than that is.

Thoughts?—msh210 06:00, 28 August 2008 (UTC)

  • I like the idea of more patrollers, I'm just trying to work out why we wouldn't just want to make them full admins..? Ƿidsiþ 08:11, 28 August 2008 (UTC)
  • I think it takes three or four days for the typical newbie to get a handle on formatting for the kinds of changes that (s)he makes — adding translations, adding inflections, adding simple pages, whatever. I think it takes upwards of a month for the typical newbie to get a handle on our norms for blocking, page deletion, warnings, and so on. —RuakhTALK 12:21, 28 August 2008 (UTC)
    In my experience it takes longer than a few days. I have seen long-term newbies who still don't alphabetize Translations. And this week I came across "patrolled" edits that weren't formatted correctly, so I don't think extending the duties to less-experienced people would be wise. There are a lot of things that a person needs to know about formatting in order to patrol successfully. And once an edit has been marked as patrolled, other patrollers generally won't look at it, so it's vital that we have experienced people patrolling. --EncycloPetey 15:58, 28 August 2008 (UTC)
    My understanding of patrolling is not that we are checking for correct formatting, or even for accuracy particularly (which is not easy to judge in many languages), but that we are quickly marking to show that an edit is not vandalism. Ƿidsiþ 13:19, 29 August 2008 (UTC)
    My understanding has always been that we are checking for correct formatting. --EncycloPetey 16:08, 29 August 2008 (UTC)
    Just for reference, the earlier discussion we all had about this issue (where I apparently felt differently..) is at Wiktionary:Grease_pit_archive/2006/November#Patrolling_edits. Ƿidsiþ 19:09, 29 August 2008 (UTC)
    ...and see also Help:Patrolled edits. --EncycloPetey 19:13, 29 August 2008 (UTC)
    (alphabetizing translations) if the entry is patrolled, AF will pick it up and do that ;-) and even very experienced people make mistakes, checking a list for alpha order takes careful attention. point taken though, I agree, although there is a huge variation in learning time Robert Ullmann 16:14, 28 August 2008 (UTC)
Ruakh has a point here. I was nearly blacklisted before I even started because I didn't know about editing a previous entry. We've already got a dirth of people volunteering to help with this. Won't patrollers keep away the people who are on the fringe of assisting? Amina (sack36) 15:15, 28 August 2008 (UTC)
Id like to note that this has been discussed before, and (at least the first time I noted) it would have taken an extension or core software mod; this is no longer true: it is now available in the MW s/w. We can have it turned on if we like. There are two user groups defined:
  • patrol: users can patrol other user's edits
  • autopatrol: users have their own edits patrolled automatically
Usually the settings are also modded to allow sysops access to Special:UserRights, and restrict them to set/unset these two groups.
This would certainly be useful in having users with accounts now in our whitelist be in the autopatrol group. Whether we want patrollers who are not sysops is a separable issue: as was noted in prior discussions, someone who is a reliable patroller should probably by nominated for sysop. In any case, we still need RP and the JS, because none of this applies to IP-anons. Robert Ullmann 16:14, 28 August 2008 (UTC)
Well, then, can we agree on having current and future whitelisted folks autopatrollers (but not patrollers)? I assume that, much as most of our admins' edits don't wind up in the patrol log, this will clear up the patrol log, and require less RP/JS work, to boot.—msh210 21:48, 2 September 2008 (UTC)
No, they will show up in the log as autopatrolled. Robert Ullmann 17:06, 22 September 2008 (UTC)

A few questions. Will users be able to see whether they're whitelisted? Will autopatrolled patrols show up in "history->logs"? And will users have a way of opting out of being whitelisted? Language Lover 02:35, 10 September 2008 (UTC)

They'll see they're whitelisted in special:listusers. I don't know how autopatrol logging works: I notice that some of my edits (I'm an admin, so an autopatroller) are listed, while most are not. Anyone know better than I? And as to opting out, the method of being added or removed from the whitelist need not change (and, in the proposed vote, does not change); if someone can be removed by request now, then he can if this is effected also. (I'm not sure what current practice is.)—msh210 16:45, 22 September 2008 (UTC)
You see your own edits occasionally showing in the patrol log as autopatrolled because of a harmless race condition in the software database updates. You'll note that is also true for bots. Is there any reason why a user would want to opt out of being autopatrolled? Why would they care? But as noted, if so, we can just do it. Robert Ullmann 17:06, 22 September 2008 (UTC)

A couple other points.

  • It's absolutely absurd to patrol for trivial formatting mistakes that AF picks up
  • I'm afraid whitelisting might make it harder for contributors to become sysops. This is because there's less incentive for current sysops to make them so. Sysopship is already enough of an "inside club". If someone is making so many good contribs that you'd consider whitelisting them, then just sysop them already! Language Lover 02:42, 10 September 2008 (UTC)
To your second point, the horse has left the barn long ago: we've had a whitelist for a while.—msh210 16:45, 22 September 2008 (UTC)
If someone patrolling sees formatting problems that AF will fix, they can just patrol the edit and it will get fixed. But I don't see why this is the issue, the patrolling is to look for vandalism, and is sometimes useful to tag things ({wikify}) or clean them up.
As to the second point: we've been whitelisting users for a long time, and it has no discouraging effect on nominating sysops, the reason for sysops is not so they can be autopatrolled. And there is a serious difference: I have no trouble whitelisting a Wonderfool sock (while keeping a weather eye on the edits), but sysop is out of the question. And we have one user who makes lots of good edits (she works on given name entries mostly) who has delclined a nomination for sysop. And there are many more (and we want many many many more) who make good edits on an occasional basis that aren't candidates for sysop. Robert Ullmann 17:06, 22 September 2008 (UTC)


A user has asked about some formatting here. We have template {{iu}}, which points to a specific Inuit language, but the word igloo is not known to come from that particular variety of the Inuit languages. So, what etymological template should be used for "Inuit" as a general group? --EncycloPetey 21:35, 29 August 2008 (UTC)

I think iu (or iku) represents “generic” Inuktitut. ike and ikt are intended to represent Eastern and Western Canadian Inuktitut, respectively. Which specific language do you need? Michael Z. 2008-08-29 23:15 z

Alphabetising order of pronunciations

I originally wrote the first part of this at User talk:Robert Ullmann/Pronunciation exceptions but then when writing the second part I realised it almost certainly needs a wider audience. Thryduulf 00:50, 30 August 2008 (UTC)

Almost all entries that have pronunciations for more than one region have them sorted alphabetically by description - e.g.


However, there are some where they are in a different order. As AF can use the parameter of {{trreq}} templates to alphabetise translation sections, it would seem a probably simple matter to do the same with the parameter of {{a}} templates for pronunciation sections. At the same time it could sort all lines starting with an {{a}} template ahead of all lines in the same list with a {{rhymes}} template. If so it would be worth setting this up to run as part of AF's usual duties, as new pronunciations get added all the time and sorting out what descriptions we use in the {{a}} templates is a job for the future. Thryduulf 00:19, 30 August 2008 (UTC)

I've just realised that there are other templates as well, principally audio. I think it is settled the order should be: * {{temp|a|blah}} {{IPA|/fɪʃ/|lang=en}} * {{temp|a|foo}} {{IPA|/fiːʃ/|lang=en}} * {{temp|audio|audio file 1|Audio (blah)}} * {{temp|audio|audio file 2|Audio (foo)}} * {{temp|rhymes|ɪʃ}} {{qualifier|blah}} * {{temp|rhymes|iːʃ}} {{qualifier|foo}} * {{temp|homophones|ghoti}}

The {{a}} templates should be alphabetised by the first parameter. The {{audio}} templates should be alphabetised by the 2nd parameter. This should in theory be the same order as above, but if the IPA transcriptions are labelled GA and RP and the audio UK and US the order will be exactly the opposite. Rhymes templates are more difficult - as qualifier templates may not be present, or may not contain the same text as the {{a}} the only consistent way I can think of is to match the parameter of the rhymes template with the contents of the IPA templates. However this will (I guess) be difficult to code, and made worse by the need to ignore stress markers in the IPA not present in the rhymes template (stripping [ˈ], [ˌ] and [.] should do it I think) and (as mentioned elsewhere recently) the use of [ɹ] in the IPA and [r] in the rhymes.

Obviously where there is more than one list, each list should be sorted in this way, but the lists should not be combined or items moved between them. Thryduulf 00:50, 30 August 2008 (UTC) Thryduulf 00:50, 30 August 2008 (UTC)

How did this get "settled"? I don't recall any trace of a vote, and it conflicts with ELE. The rhymes, audio, etc used to be indented under the appropriate pronunciation (accent) line, and the majority are this way. Not doing that means that every attribute of a particular accent must be individually tagged, rather than associated by structure. Compare from garage:


* {{a|Canada}} {{IPA|/ɡəˈɹɒʒ/|/ɡəˈɹɒdʒ/|/ɡəˈɹædʒ/|/ɡəˈɹæʒ/|lang=en}}, {{X-SAMPA|/g@"rQZ/|/ge"rQdZ/|/g@"r{Z/|lang=en}} * {{a|UK}} {{IPA|/ˈgæˌɹɪdʒ/|lang=en}}, {{X-SAMPA|/"g{%rIdZ/|lang=en}} * {{a|US}} {{IPA|/ɡəˈɹɑːʒ/|lang=en}}, {{X-SAMPA|/g@"rA:Z/|lang=en}} * {{audio|en-gb-garage.ogg|Audio (UK)|lang=en}} * {{audio|en-us-garage.ogg|Audio (US)|lang=en}} * {{rhymes|ærɪdʒ|lang=en}} {{qualifier|UK}} * {{rhymes|ɑːʒ|lang=en}} {{qualifier|US}}

(note how each thing is tagged in a different format ...) with the convention we have/had been using:


* {{a|Canada}} {{IPA|/ɡəˈɹɒʒ/|/ɡəˈɹɒdʒ/|/ɡəˈɹædʒ/|/ɡəˈɹæʒ/|lang=en}}, {{X-SAMPA|/g@"rQZ/|/ge"rQdZ/|/g@"r{Z/|lang=en}} * {{a|UK}} {{IPA|/ˈgæˌɹɪdʒ/|lang=en}}, {{X-SAMPA|/"g{%rIdZ/|lang=en}} *: {{audio|en-gb-garage.ogg|Audio (UK)|lang=en}} *: {{rhymes|ærɪdʒ|lang=en}} * {{a|US}} {{IPA|/ɡəˈɹɑːʒ/|lang=en}}, {{X-SAMPA|/g@"rA:Z/|lang=en}} *: {{audio|en-us-garage.ogg|Audio (US)|lang=en}} *: {{rhymes|ɑːʒ|lang=en}}

(note ELE has ":*" which isn't nesting lists properly; should be "*:" or "**") Is not this much better? Note that the comment above about the difficulty of sorting the rhymes applies to the reader.

Oh, and I want to point out something that was mis-stated somewhere above, (under "pronunciations wildly different") the assumption that RP = UK. General UK pronunciation is not RP. Received Pronunciation is a specific dialect, and often differs from UK (indeed, that was its original purpose!). The tags are distinct. (and {{a|UK|RP}} makes sense, when RP is worth mentioning) Likewise, GenAm (Midwestern US English) is a narrower accent than US. And it matters when you specify phonemes (which would be US), with phonetics, in which we need to distinguish between GenAm, Southern, Northeast city, AAVE, etc depending on the word, while they are all "US". Robert Ullmann 14:28, 31 August 2008 (UTC)

Yes, but We are using "US" to represent "GenAm" specifically, as the link will demonstrate. I agree that the second example above makes more sense for visual layout, since the audio file is paried with the correct IPA. However, we have almost no UK or Canada audio files and I've never seen any audio files from other locations. To a first approximation, all our audio files are US. Further, all our Rhymes are keyed for the UK (or RP) pronunciation only. So, while the second example above may be a desirable long-term goal, using it to format what we currently have would make most pages' format look ridiculous to most users. That is, we would have the UK pronunciation with "rhymes" indented under it, then the US pronunciation with "audio" indented under that. Such a layout will look wrong to 90%+ of the people who visit the page:


* {{a|UK}} {{IPA|/ˈgæˌɹɪdʒ/|lang=en}}, {{X-SAMPA|/"g{%rIdZ/|lang=en}} *: {{rhymes|ærɪdʒ|lang=en}} * {{a|US}} {{IPA|/ɡəˈɹɑːʒ/|lang=en}}, {{X-SAMPA|/g@"rA:Z/|lang=en}} *: {{audio|en-us-garage.ogg|Audio (US)|lang=en}}

For now, I think we should consider that, if a page has no more than one Rhymes and no more than one Audio file (or line), then the Rhymes and Audio should appear unindented following the phonetic transcriptions. also, please note that the above examples do not account for Homophones (variously formatted) or Hyphenation (which is erroneously included in the Pronunciation section). --EncycloPetey 15:47, 31 August 2008 (UTC)
On the contrary, this example doesn't look "wrong" at all: it makes it very clear that the rhymes are for the UK, and the audio is US, while the UK audio and US rhymes are missing. And when they are added, it doesn't cause a re-structuring.
Homophones should nest the same way (they are a special case of rhymes ;-). Hyphenation is an oddity, it is related to pronunciation, and we don't have any other place for it. There will probably always be some stuff at the end, after the accent(s).
We should also allow for 1 or more graphs of prose, (without *) after the * list, for notes of various sorts. (one could use a usage notes header, but I think that is just noise) Robert Ullmann 14:26, 2 September 2008 (UTC)
Hyphenation is NOT related to pronunciation. It is independent of pronunciation and is instead an issue of spelling and orthography. I have demonstrated this many times before. The places where hyphens are used to hyphenate a word do NOT correspond with the spoken syllable breaks. The easiest way to demonstrate this is in the Hyphenation: ex‧act and IPA /ɛk.sækt/. Notice that the break is different for the pronunciation and hyphenation. Hyphenation breaks are governed by morphology / spelling, whereas pronunciation is governed by phonology / sounds.
It is unfortunate that we have found no correct place to put this information, but please, let's not make the mistake of pretending hyphenation actually has any relationship to pronunciation merely to rationalize this unfortunate placement decision. --EncycloPetey 15:35, 2 September 2008 (UTC)
Don't be silly; I'm saying they are related, not that the syllable breaks are always the same. I suspect most 'muricans use the middot markings in the dictionary for both. (I doubt most understand the funny little decorations on the vowels in the actual pronunciation given. ;-). How do you pronounce "coop"? coop or co·op? "unionize"? (my favorite) un·ion·ize or un·i·on·ize? You of course know how to pronounce Nairobi a priori of course, so how do you pronounce Raila? If I give you the hyphenation (not that proper nouns are in practice): Nai·ro·bi and Ra·i·la I'll bet you pronounce Raila correctly now, even though the syllable breaks are different from the hyphenation. And you can avoid sounding like a tourist if you see jogoo (cock) as jo-go-o. All I'm saying is that they are related, and it isn't such a bad place to put it, as people find it very useful for pronunciation. Robert Ullmann 13:39, 4 September 2008 (UTC)
Also note that the "future" referred to is arriving. I just noted contain with a Canada pronunciation file, and US enPR, IPA, SAMPA. There are a lot of rhymes that are for US, as well as udio files for various. (And always keep other languages in mind). On the initial point, we should figure out some bit of sorting; annoyingly, "DerbethBot" put all the added audio links before the other pronunciation information. I am thinking of a sort on * lines (with the appropriate sortkeys generated, as with translations), with *: (and corrected :*) lines sorting with the previous * accent line. (again, as is done with translations) Robert Ullmann 15:05, 2 September 2008 (UTC)
Regarding the sorting, I'm actually in agreement that the nested lists are a good idea where we have several things. However there are a large number of entries with UK IPA/SAMPA and US audio with nothing else. Another set add the UK rhyme to that. Should we indent the US audio below the UK SAMPA?
Also, do we need want to sort similar but not necessarily synonymous pronunciations separately or together? For example do we want "UK" audio be sorted with "RP" pronunciation or separate from it? Should "GenAm" and "WEAE" be sorted together or separately? How does "US" audio fit with "GenAm" IPA?
There is also comparatively little consistency in the labelling of Australian pronunciation, I've seen "Aus", "Aus E", "AusE", "Australia" and "Australian" in different entries, all referring to the same thing - do we want to standardise one one of these? If so, and it is and abbreviation of any sort then we should set up {{a}} to link to an explanation in the same way "RP" does. Thryduulf 18:39, 2 September 2008 (UTC)
We have an Australian English template already, but it hasn't been as widely applies as the others. We have one for Canadian English, and a number of others too. See Category:Accent templates. Personally, I'm against the use of WEAE, because it is purely an artifice, and not any actual or particular American pronunciation in use. The way we've been using "US" is synonymous with "GenAm". --EncycloPetey 23:45, 2 September 2008 (UTC)
We actually appear to have both "Canada" and "CA" for Canadian English, should we standardise on one?
AusE does have a template, should we get AutoFormat change other descriptions for Australian to this?
Should we deprecate one of "GenAm" and "US" then in favour of the other? Even if not I presume you'd be happy with sorting "US" audio with "GenAm" IPA/SAMPA?
Should the order of regions be strictly alphabetical by given label, or should we group RP with UK and GenAm/WEAE/AAVE/US together? I know of at least one entry that has two British pronunciations given - Cheddar has "RP" and "Somerset"; if we are doing grouping then we'd need to keep these together and so presumably somewhere there would need to be a list of what to group with what? Thryduulf 00:52, 3 September 2008 (UTC)
I think grouping them breaks down fairly rapidly; the order will look arbitrary in a lot of cases. And while we might very well do something for English, are we going to do that for (say) Chinese? By region? (;-). Alphabetical would then be better. There are several cases: we have one, and order is moot, two, and order doesn't represent any structuring, or 3+, in which case non-alpha is already getting complicated ...
On all of the Canada/CA, GenAm/US (note these are not the same case, Canada = CA, GenAm != US), AusE etc I would (and AF is) leaving the tag in the entry what it is; we can use redirects/duplicates in the Template:accent space for now. (and that makes it easier to sort them later) I don't have any problem sorting US audio with GenAm accent. I am wondering what useful sorting might be done or not done automatically. Robert Ullmann 14:00, 4 September 2008 (UTC)

The “indented” format doesn't make any sense, structurally. The “indentations” are actually HTML definition lists with the defined terms missing (technically this code is legal, but semantically it doesn't represent a meaningful structure). These would be better off as nested unordered lists (which can be made to look the same using CSS). Example below. Michael Z. 2008-09-02 15:54 z


* {{a|UK}} {{IPA|/ˈgæˌɹɪdʒ/|lang=en}}, {{X-SAMPA|/"g{%rIdZ/|lang=en}} ** {{rhymes|ærɪdʒ|lang=en}} * {{a|US}} {{IPA|/ɡəˈɹɑːʒ/|lang=en}}, {{X-SAMPA|/g@"rA:Z/|lang=en}} ** {{audio|en-us-garage.ogg|Audio (US)|lang=en}}

Wikimedia software uses dl for : lists everywhere. If you have an issue with this, it should be taken up at that level. In any case, is is structured perfectly well, the *: items after an accent are in a properly nested sub-list (just as in your example with nested ul's). Robert Ullmann 13:45, 4 September 2008 (UTC)
It's a sublist implying a term-definition relationship to... nothing. I don't get your argument. Wikimedia software also uses ol for * lists everywhere. Why should we prefer the wonky ol-dl version of this to the semantically sensible ol-ol version? Michael Z. 2008-09-04 15:53 z
My point is that if you want ":" to be an ol list in a different class, that is a WM software issue, take it up there (but they won't listen, since you will be breaking everything in sight for un-needed semantic purity; dl's are used all over the net, not just WM, for various things that are not "defintions"). Our (non-)issue is whether we want an indent list (wikitext ":") or bulleted list (wikitext "*"). And the extra bullets are just visual noise. Robert Ullmann 17:12, 4 September 2008 (UTC)
I agree about the visual noise. Select use of bullets used for a single level of indentation can visually help users to see parallel items in a structured list, but too many bullets at too many levels makes them clutter. --EncycloPetey 17:26, 4 September 2008 (UTC)

Applying {{rel-top}} to Pronunciation

The increasing length of Pronunciation section makes it clearer that we need to sometimes use show/hide for Pronunciation material especially on a first screen. My unscientific gut makes me want to hide Pronunciation when it takes more than three lines after its heading. What would be the best way to use the gloss to indicate what lies beneath? Indicating which parts of speech are included, whether audio is available, and whether there are UK, US, or other pronunciations included all seem desirable. DCDuring TALK 15:26, 2 September 2008 (UTC)

Question: What is the current average number of lines for our Pronunciation sections? And what percent of entries have them at all? I think you're really jumping the gun on this issue. --EncycloPetey 15:41, 2 September 2008 (UTC)
I wish I could answer it and I'd certainly like to know the answer. Even more salient, though requiring an additional analysis of a different kind of less available data, would be what was the length of the first-screen pronunciation sections on the entries users other than admins were visiting. I'd love being able to reason from that kind of data instead of our assumptions. DCDuring TALK 17:52, 2 September 2008 (UTC)
I wonder if it might be worth investing in a pronunciation section specific template, which condenses/hides pronunciation in a specific manner. I share DCDuring's concern about users having to read a three page treatise before they get to see the defs, but rel-top is not the way to go. Perhaps we could have a template which shows, say, just the IPA's and nothing else, all within a single line, with the option to expand of course. Ultimately, I think this gives the basic, important information, and would probably save a lot of space on some entries. -Atelaes λάλει ἐμοί 16:43, 2 September 2008 (UTC)
Any solution to the first-screen space problem would be fine with me: show/hide, horizontalization, more abbreviations, separate namespace. I wouldn't even mind being shown data demonstrating that the first-screen space problem that I perceive was not a real one for new and casual users. DCDuring TALK 17:52, 2 September 2008 (UTC)
Keep in mind that some people are looking for the pronunciations, not the definitions ... I have known a couple of non-native speakers of English with well-thumbed Websters or AHDs that have told me they never use them for definitions: they have learned the written vocabulary just fine, but are always tripping over the multiple pronunciation systems in English (19, IIRC ;-). And we do have a large user base of non-native speakers.
anyway: see User:Robert Ullmann/Pronunciation statistics. The average number of lines is < 2; only a small handful of entries have more than 5-6 lines. I don't think this is a large issue. Perhaps entries with a lot of additional information (besides IPA and audio for the standard dialect for the language) could use something at some future time.
a thought I had was to put the audio on the same line with accent/IPA, and reduce the footprint:

* {{a|UK}} <span class="unicode audiolink">[[:Media:en-gb-garage.ogg|audio]]</span>, {{IPA|/ˈgæˌɹɪdʒ/|lang=en}}, {{X-SAMPA|/"g{%rIdZ/|lang=en}} * {{a|US}} <span class="unicode audiolink">[[:Media:en-us-garage.ogg|audio]]</span>, {{IPA|/ɡəˈɹɑːʒ/|lang=en}}, {{X-SAMPA|/g@"rA:Z/|lang=en}}

(I really don't like that icon, too big and dark ;-) something like that. Robert Ullmann 17:04, 4 September 2008 (UTC)
Unfortunately, we have many pages where that layout would be too messy. For a growing number of words, there are multiple audio and IPA/enPR/SAMPA for a single region. That is, for the US, there may be two IPA pronunciation variants given in the {{IPA}} template, and two in the {{SAMPA}} template as well, then there are two audio files to represent the two US pronunciation variants. Putting all of that on one line (along with enPR) is too much to visually parse. To do something like the above would require splitting those IPA and SAMPA where they are doubled up, but that won't reduce the visual footprint, merely reorder it. And that's not something a bot can be taught to do since it requires matching the audio with the transcriptions. --EncycloPetey 17:36, 4 September 2008 (UTC)
I agree that there are increasing number of very long pronunciation sections, some taking two inches of initial-screen space. It hardly seems unreasonable for the pronunciation media to be arrayed in horizontal rows, one for each regional accent and variant. If we do not have the capability to do it by bot, then let it be a cleanup list. I would expect most users interested in pronunciation to go, after their first encounter with one of our pronunciation sections, to the medium of their choice: IPA, SAMPA, enPR, audio, homophones. The last thing I would expect is that a user would scan the whole list every time. If we had uniform notation and placement on the left for the name of the pronunciation and a standard sequence of presentation for the media, we would definitely be accommodating pronunciation-seeking users as best we can without driving away the benighted new users who are mere definition-seekers. I would also think that a pronunciation-seeker would need to make sure that the pronunciation that was found in fact corresponded to the meaning sought, so that even a pronunciation seeker would also have to be a definition seeker as well. DCDuring TALK 18:49, 4 September 2008 (UTC)
So how would that affect the garage example above? First, note that the IPA and audio for the UK do not match. In fact, Cambridge gives five different UK pronunciations, as well as two in the US, so the above example (using Robert's proposal) should actually be:

* {{a|UK}} {{IPA|/ˈgæɹ.ɑːʒ/|lang=en}}, {{X-SAMPA|/"g{r.A:Z/|lang=en}} * {{a|UK}} <span class="unicode audiolink">[[:Media:en-gb-garage.ogg|audio]]</span>, {{IPA|/ˈgæɹ.ɑːdʒ/|lang=en}}, {{X-SAMPA|/"g{r.A:dZ/|lang=en}} * {{a|UK}} {{IPA|/ˈgæɹ.ɪdʒ/|lang=en}}, {{X-SAMPA|/"g{r.IdZ/|lang=en}} * {{a|UK}} {{IPA|/gəˈɹɑːdʒ/|lang=en}}, {{X-SAMPA|/g@"rA:dZ/|lang=en}} * {{a|UK}} {{IPA|/gəˈɹɑːʒ/|lang=en}}, {{X-SAMPA|/g@"rA:Z/|lang=en}} * {{a|US}} <span class="unicode audiolink">[[:Media:en-us-garage.ogg|audio]]</span>, {{IPA|/ɡəˈɹɑʒ/|lang=en}}, {{X-SAMPA|/g@"rAZ/|lang=en}} * {{a|US}} {{IPA|/gəˈɹɑdʒ/|lang=en}}, {{X-SAMPA|/g@"rAdZ/|lang=en}}

Will this have the desired effect of reducing the footprint? No. In fact, it would make the footprint larger for this entry. The current method would be to place all UK pronunciation variants on one line. Robert's proposal requires a separate line for each variant. The current method would also allow identical phonetic transcriptions from different regions to be combined on a single line. But if we include the audio in-line with the transcription, we can no longer do that, since we would have no means of accurately pairing audio with each of the several regions so combined. It would necessitate a separate line for each region. This would again increase the footprint of many pronunciation sections. --EncycloPetey 19:17, 4 September 2008 (UTC)
I said it was a "thought" ;-) do note that the vast majority of entries (language sections) have/will have one or two pronunciations, where this would work. And even in this example, if we had audio files for all 5 UK pronunciations, it wouldn't cost space. (and then there is Canada; how does that overlap UK? Haven't looked meself ;-) Robert Ullmann 16:38, 5 September 2008 (UTC)

Robert, could you produce a list of all the entries that have pronunciation sections of 10 or more lines. Looking at actual examples should see if there are any common factors that produce large pronunciation sections.

Done, see same report file. Robert Ullmann 16:38, 5 September 2008 (UTC)

I personally would have no issue with "show/hide" section for pronunciations have a single collapsible entry for the whole section (as in most cases users are going to be interested in none or all of the section) and the simple gloss "pronunciation" or "<language> pronunciation" would suffice I think, although "English pronunciation" might be confused with "UK pronunciation". Ideally this would be a separate template to {{rel-top}} with a preference for them being open or closed by default set independently of other sections. Thryduulf 21:39, 4 September 2008 (UTC)

New UK wikimedia chapter

A plan is in the works to found a new UK chapter of the Wikimedia Foundation, and we are currently gathering support from the community. If you are interested in being part of this new UK chapter as a member, a board member or as someone with a general interest in the chapter, please head over to m:Wikimedia UK v2.0 and let us know. We also welcome help in making finishing touches to the plans. An election will be held shortly for the initial board, who will oversee the process of founding the company and accepting membership applications. They will then call an AGM to formally elect a new board, which will take the chapter forward, starting to raise funds and generally supporting the Wikimedia community in the UK.

Geni 19:27, 30 August 2008 (UTC)


Have a look at die, verb. I've added sub-senses to try and address the always-tricky problem of which prepositions to use. Note that: (1) these subsenses are there to illustrate usage and it's not intended that they should correspond to new translations tables or anything; (2) some prepositions form idiomatic phrases, which is why such collocations as die out or die away are not inlcuded except in the =See also= section.

Do we like? Is it too much for the main page, should it be on the Citations space? Any other thoughts? This is one of the trickiest issues for people learning a language, and I hope it's something Wiktionary can eventually deal with well. Ƿidsiþ 06:59, 31 August 2008 (UTC)

I love it and I hate it. There's a lot of excellent material there, which I think plenty of people will find useful. However, I also think that it makes the entry even more confusing, messy looking, and utterly unapproachable to the average user. What we really need (and not just for this, but for just about everything) is the ability to present just a little information in a concise, easy to read format, but have lots more info that the user can dig up if they desire. I know DCDuring's with me on this. -Atelaes λάλει ἐμοί 08:41, 31 August 2008 (UTC)
I love the subsection structuring and the usage examples. Our competitors push phrasal verbs to the bottom of the list of definitions, perhaps reflecting a judgment that folks needing such information are willing to work harder for it. They don't make users go to other pages, but don't actually have separate entries for phrasal verbs. We do have phrasal verb pages, some of which are a bit light on content.
So there might be a case for moving the citations to the phrasal verbs (conserving space) and putting links to those phrasal verbs under the associated senses using {{also}} (requiring an extra line) or an in-line link. It would overcome the major weakness in phrasal verbs: that currently they only appear in bare lists on the verb's page, without a hint of meaning, often hidden under show/hide bars. DCDuring TALK 10:35, 31 August 2008 (UTC)
These aren't phrasal verbs here (except perhaps die for). What if the entire subsection of prepositional forms were placed inside a collapsible Ruakh-box? --EncycloPetey 16:28, 31 August 2008 (UTC)
I do like the idea of listing the verb with all the prepositions that it can use on the main page where the base form is. I agree that the definitions with examples for each makes the page a little harder to read, but the information is essential. How about using a new section (similar to Derived terms) for listing the available phrasal verbs and pointing them to their own entry page? Users can go there for definitions and examples. --Panda10 10:53, 31 August 2008 (UTC)
I like it, but it sounds like a job for User:Ruakh/quotations. :-)   Also, I don't think it's ever really made clear that "die of" and "die from" are used to indicate cause of the death. —RuakhTALK 12:37, 31 August 2008 (UTC)
  • I agree, I think hiding quotations like this allows us to add more content without adding more clutter. Ƿidsiþ 14:32, 31 August 2008 (UTC)
If new users can quickly learn to use the show/hides, it's a perfect solution. Show/hides are becoming such a widespread feature of en.wikt that perhaps we just have to assume that most users will quickly figure them out. The relationship to this to the phrasal verbs entries is bit unclear. I doubt that anyone is suggesting that the phrasal verbs be replaced by this, but some of the existing phrasal verb entries have some content and {{also}} would make that available, though on a slower-to-link separate page. DCDuring TALK 15:57, 31 August 2008 (UTC)
This ought not to be used for phrasal verbs, but it could certainly be used in those cases where a verb tends to take one of only a limited number of prepositional phrases following it. For instance listen is usually followed by "to", on occasion by "at" or "for", but rarely with any other preposition. None of these is a case of a phrasal verb, but they are common associations. Now, in the case of listen, there are only a couple of possible prepositions, but for die there are a handul of regular prepositions rather than one or two, so something like this might make sense. I do share your concerns about how this would impact phrasal verbs, though. --EncycloPetey 16:26, 31 August 2008 (UTC)
  • Absolutely, this is my reading as well. Phrasal verbs are different and often it's helpful to treat them as idiomatic (which they basically are). Ƿidsiþ 19:20, 31 August 2008 (UTC)
Certainly a good way to organise the verb entries. It 1) helps to eliminate the arguments about how to order the different meanings. 2) demonstrates correct preposition usage (something close to my heart) 3) could allow easier access to phrasal verb entries, (also close to my heart). I like the idea of a show/hide section for phrasal verb links. 4) would aid the task of deciding quickly whether a particular verb+prep is phrasal or not. -- ALGRIF talk 16:25, 31 August 2008 (UTC)
Will our non-expert users not be mislead by a major distinction on the verb page in treatment between mere verbs-with-collocated-prepositions and true phrasal verbs. I would venture to say that few of us would trust ourselves to make a determination of whether a given collocation was truly a phrasal verb. How can we realistically believe we are doing language-learning users (not majoring in language studies) any good if we expect them to know to look in different places for the two kinds of collocations? DCDuring TALK 19:46, 31 August 2008 (UTC)
The more I think about this proposal, the less happy with it I become. What we are really talking about here is adding the definitions and uses of prepositions to the entries of verbs that use them. I think the information being presented is really information about the use of the prepositions, and less about the verbs. If there are common verb/preposition combinations, those ought to be discussed in the Usage notes, and not among the definitions. In formation about the meaning and use of a preposition belongs on the entry for that preposition, not on the verb that happens to be next to it. --EncycloPetey 21:19, 31 August 2008 (UTC)
Yes and no. Of course, we are talking about uses and definitions of prepositions. But manifestly, it is not enough to define prepositions, since they are so fluid and subject to so many idoimatic rules about which verbs they work with. If this information goes in Usage notes (which I'm not dismissing at all), I would only be concerned if this meant we lost the illustrative citations, since the whole value of it for me is in demonstrating actual usage. Again, this isn't about true phrasal verbs but about collocations which are pretty sum-of-parts – ie, about combinations whose meaning is obvious but whose constituent parts are not neccessarily predictable by non-native speakers. Many usage books deal extensively with prepositions in this way – the M-W Dictionary of English Usage for one – and it's something we should be able to handle too. Ƿidsiþ 08:50, 1 September 2008 (UTC)

Just to clarify some terms, it sounds like what we're talking about here primarily is listing the prepositional complements licensed by particular verbs. Is that right? If so, one could ask why you stop there rather than listing whether the verb can be complemented by a that clause, a gerund participle, a wh-clause, a goal, a bare infinitive, etc. And if you're showing verbal complements, what about the various complements licensed by prepositions, nouns, adjectives, etc.?--Brett 00:58, 2 September 2008 (UTC)

Personally I do ask that; I think we should have all such information. [[need#Usage notes]] and [[penser#Usage notes]] are examples of my own past attempts at this sort of thing; you'll notice that they're not restricted to discussion of prepositions (and while both need and penser are verbs, the same could certainly be done for words of other parts of speech). BTW, in response to a point made above, I've also done some usage notes that include point-illustrating example sentences (such as [[apparent#Usage notes]]) and quotations (can't remember right now, will post back if I think of one); I think the result looks nice, but it's a pain in the neck to do, since wiki-style markup doesn't support multi-paragraph list items AFAICT. —RuakhTALK 01:20, 2 September 2008 (UTC)
Those are features that have impressed me a lot about Longmans DCE. I don't know which other dictionaries do that, but it looks very impressive and sometimes helps me. It's certainly the kind of dictionary I'd recommend to a language learner.
Putting the kind of information under discussion in usage notes has the disadvantage of losing the association between a specific bit of usage information and the sense to which it applies. Longmans has an economical notation for it that presupposes a lot of repeat usage. We don't seem to know whether we have that kind of sustained usage and, therefore, whether we could have such terse notation. Expressing that volume of information in our unabbreviated way seems likely to increase the visual complexity of our entries for more casual users (whom I posit as looking principally for definitions) and, indeed, for all users of any information that is keyed to or specific to a definition.
This kind of sense-specific usage information fairly begs to be included under a sense-specific show/hide bar if we cannot see our way clear to using terse notation for such usage information. DCDuring TALK 03:06, 2 September 2008 (UTC)
I disagree with the assumption that this information is sense-specific. I see that this information is comparative and synthetic, drawing from multiple senses, and therefore breaking it up and hiding it in a series of collapsible bars is likely to be more a disservice than a service to our users. --EncycloPetey 15:58, 2 September 2008 (UTC)
Yeah, good point....though sometimes it is sense-specific. Like many things, it will probably have to be assessed on a case-by-case basis. Ƿidsiþ 16:02, 2 September 2008 (UTC)
Certainly sometimes. There can't be a general solution to the problem. Sometimes a user wants to know everything about a specific sense; sometimes they want to compare. It's the same with pronunciation, sometimes I want to know how to pronounce the noun or the verb, sometimes I want to understand how they differ. DCDuring TALK 00:20, 3 September 2008 (UTC)

You know, everyone interested in this topic should all (assuming you haven't already) check out the special issue of the International Journal of Lexicography, in particular the stuff about pattern dictionaries.--Brett 01:09, 3 September 2008 (UTC)

Would that I could. DCDuring TALK 18:35, 4 September 2008 (UTC)
The lead article is free.--Brett 00:52, 6 September 2008 (UTC)