Open main menu
This is an archive page that has been kept for historical purposes. The conversations on this page are no longer live.
Beer parlour archives edit

collective nouns - appendices in general.

From WT:TR

Below is the discussion from the Tea room relating to collective nouns. Certain policy considerations arise. I will not add to the collective nouns appendix articles until we have sorted this out.

  1. Do we want to link appendices to the article word á la Widsith and Andrew massyn?
  2. Do we want to create an new inflection line for articles usually reserved for appendices e.g. ====Collective nouns==== á la Hippietrail and DAVilla
There are pros and cons to both solutions. I have listed some here. The third solution is to have a combination of the two.
  1. Do we want to link appendices to the article word ?
  • It's harmless and promotes use of appendices.
  • It removes a lot of dross from the article page, while still giving information.
  • Appendices are exactly that. They should be away from the main article.
  • Appendices are not reliable in terms of usage, currency and the like.
In the collective nouns appendix, there is a disclaimer regarding usage and currency.
  • Appendices often deal with things which do not form part of the main business of a dictionary. e.g. A list of Presidents of the United States.
  1. Do we want to create new inflection lines with info underneath?
  • It gives the necessary information on the page where it should be.
  • Appendices can be difficult to edit. see e.g. Appendix: Animals.
  • The information is often not reliable.
  • The information is often not part of the the main business of the dictionary.


Discussion from tea room Do we have (and do we want) an appendix for collective nouns? I looked up passel esterday and was surprised not to see a passel of brats This got me thinking.... Andrew massyn 05:36, 2 September 2006 (UTC)

Well we have Appendix:Animals for animal ones. These appendices are pretty good actually, and should probably be linked to from more pages than they currently are. Widsith 07:47, 2 September 2006 (UTC)
By all means start one, Andrew, if you'd like to see one! It might be a good idea to cross-reference Appendix:Animals rather than reproducing the content on that page. — Paul G 07:51, 2 September 2006 (UTC)

OK I've created one at Appendix: Collective nouns. Please add to it. Andrew massyn 09:25, 2 September 2006 (UTC)

Well, it looks like its going to be a huge page, so I hope the format is allright. It has already turned up some interesting results. Did you know, for example, that you get a bike of ants and a kettle of hawks; or that we dont have a plural entry for porpoises? I certainly didn't. Andrew massyn 12:36, 2 September 2006 (UTC)
Would it be better not to include the terms on the animals page? DAVilla 14:39, 2 September 2006 (UTC) Strike that. DAVilla 20:14, 3 September 2006 (UTC)
Yeah. But editing the animal page is a bitch( I am talking about Appendix Animals not each animal page). What I am doing is putting the Appendix: Animals at each of the animals I get to as well as the Appendix: Collective nouns. (Also there are collective nouns for non-animals in the normal sense - What do you call a bunch of lawyers by the way}? Andrew massyn 16:01, 2 September 2006 (UTC)
An expense? Or maybe a quibble? Widsith 16:21, 2 September 2006 (UTC)
  • Putting a link to the collective nouns category or appendix from the word for every animal which has a collective noun is very wrong, especially under See Also. What you should do is add the collective terms under See also and put the links to the appendix or category only on the articles for the collective terms. If editing the articles properly is a bitch it's better if you don't edit them at all than editing them badly. — Hippietrail 20:14, 2 September 2006 (UTC)
Yes and No. I understand your objection, but if someone is going to look for the collective noun for e.g. cats, they are going to look under cat or cats, and not under clowder. What is the point of making any appendix impossible to use? P.S. I meant the Appendix: Animals not each animal article page. Perhaps that clarifies? I agree that the collective noun should be added to the article page under "see also", and will do that in future.
Also, I was here for months before I even realised that there was an Appendix: Animals. If it is not going to be linked, it is not going to be used. Andrew massyn 20:23, 2 September 2006 (UTC)
It might not be good to link to collective nouns from each animal page, but certainly you could link to the animals appendix from each, and from the animal category, and to collective nouns from that. DAVilla 05:34, 3 September 2006 (UTC)
Instead of adding the whole category which doesn't really make anything particularly easier, why not add the collective term under a new heading or as part of the headword/inflection line? — Hippietrail 06:32, 3 September 2006 (UTC)
O.K. I will then make an inflection line called ====collective nouns====. That should satisfy? I'll get to the changes in due course. I still think the appendices should be added to the words. Not only for collective nouns or animals, but for all appendices. I know when I browse, I get sidetracked onto various appendices. For e.g. Say I wanted to know in what year George III reigned, I might at the nonce want to find out at the same time when William and Mary reigned. The only way to do that simply is to follow a link. Andrew massyn 07:26, 3 September 2006 (UTC)
Oh no, I hate that solution! There are a million related words that could be added under their own headings but why make things complicated? I don't see what the objection is to linking to our own Appendix, which is very good, and has plenty of useful information that is sensibly kept out of the main dictionary entry. Widsith 10:35, 3 September 2006 (UTC)
The appendix animals is not comprehensive and is a nightmare to edit. it would be better to have the info at a more accessible place. Andrew massyn 13:10, 3 September 2006 (UTC)
If an appendix for collective nouns works out well then maybe we could just hack up the animals appendix into an animal genders appendix, a young animals appendix, etc. The appendix space might prove better at handling some of the other associations that we've been wondering how best to list. DAVilla 20:14, 3 September 2006 (UTC)
I've been wondering if many of the derived terms should be moved to a new heading so that related terms under them, such as these collective nouns, become more prominent. I'm thinking something along the lines of ====Compound terms==== e.g. at time since derived terms like timely get drowned out and related terms like temporal are pushed down. DAVilla 20:14, 3 September 2006 (UTC)
  • A couple of thoughts:
    Using appendices in the manner suggest here just sounds like trying to reproduce categories but in a broken way
Not really. Catagories link words and appendicies have other info attached. If collective nouns were put in a category, you would get unrelated words like bunch and banana but no linking unless we put in set phrases such as a bunch of bananas, which I don't think we want to do in a category. Similarly, any appendix contains more info than you would want in a category: Presidents of the United States of would presumably have dates attached; The Animals appendix has lots of information which cant be conveniently slapped into a category. I think that both categories and appendices have merit, its just how to deal with them on the article page that is the problem. Take for e.g. the word king. It could have a category [heads of state] with words like king, president and dictator, in it and an appendix [kings of England] with the names of the kings and when they ruled. - Andrew massyn 14:04, 4 September 2006 (UTC)
  • ====Collective nouns==== would not be an "inflection line" but a "heading" and we already have a heading where such semantically related but not etymologically related terms belong and that is ===See also===. The terms should be added there whether appendices are linked to or not but if ====Collective nouns==== goes ahead since the more specific heading would overrule the catchall heading. — Hippietrail 01:58, 4 September 2006 (UTC)
No quibbles except for what to do with oddities. See below. Andrew massyn 14:04, 4 September 2006 (UTC)
I think it's fine for them to go under =See also=, I just don't think they merit a heading of their own. Widsith 08:19, 4 September 2006 (UTC)
What about the really bizarre ones like a shrewdness of apes or a superfluity of nuns? I dont think they ever had real currency, although they might have appeared in a book in the 1400s and have been on many lists of collective nouns ever since. Do they get listed, and how do we deal with them? A Usage note? Andrew massyn 10:22, 4 September 2006 (UTC)
Yes, in such cases, some comment is needed in the page. Without a comment, "a superfluity of nuns" may be wrongly considered as anti-religious vandalism or (and it's not better) it might even be used in the real life, with unexpectable reactions... Lmaltier 17:05, 6 September 2006 (UTC)

OK. Here's what I propose. If there are no serious objections in the next week or so, then this is how I intend to implement the discussion above.

  1. On the object of the collective noun - cats I will put under ====see also==== clowder (collective noun) but no linking to the appendix.
  2. At clowder, under ====Derived terms=== I will add A clowder of cats (collective noun) but again no linking to the appendix.
This should then broadly satisfy most people and can be adapted with minor modifications if necessary for any peculiar pages. Andrew massyn 15:57, 30 September 2006 (UTC)

bot policy updated

There was good support for, and no objection to, the suggestions I had made at Wiktionary talk:Bots, so I have folded them in to Wiktionary:Bots. Please give the new policy a read and make sure you agree. (If you want to see just the changes, here's the diff.)

For the most part I have expanded the policy with additional language clarifying an aspiring bot owner's responsibilities. I have relaxed, but not completely eliminated, the former language restricting a Wiktionary bot to one narrowly-defined task.

Please list your support or your (brief) reservations here. But please divert any significant discussion (pro or con) to the talk page at Wiktionary talk:Bots. —scs 15:07, 3 September 2006 (UTC)


Object / Have reservations or concerns:

  • I appologize for not submitting my rewrite (collected from last week's IRC conversations) yet.
Please post your thoughts soon! I'd rather this didn't languish for another month.
I think Each revision should have a request for comments/further revisions, before a premature vote is called. --Connel MacKenzie 15:28, 3 September 2006 (UTC)
Realistically, I'm not sure how many more comments we're likely to get. After a month of inactivity at Wiktionary:Bots and three weeks at Wiktionary talk:Bots, I really didn't think this rewrite and final call for support (I didn't think of it as a vote) was "premature". —scs 14:01, 4 September 2006 (UTC)

Copyright status of etymologies

Connel MacKenzie brought up the question of etymologies being copyrightable in #wiktionary. Etymologies are often based on extensive research and sometimes creative leaps, so a certain degree of protection would naturally be expected. The question is whether the details thus uncovered— that English interdiction is from Old English enterditen (to place under a church ban)— are copyrightable (as creative concepts), as opposed to simply the presentation thereof being copyrightable (as in tables or data). The Foundation's intellectual property lawyers are offline at the moment, but I'll point them to the discussion when I see them. // Pathoschild (editor / talk) 04:50, 4 September 2006 (UTC)

I'm no lawyer, but it sure seems like a stretch to me. It reminds me of how Columbus "discovered" America. What about the fact that the vast majority of words, and their etymologies, predate modern copyright laws, and that much of the etymology research mentioned above probably involved regurgitating earlier etymologies written by people who did not have modern copyright laws available to them?
In general terms, I think knowledge should be shared free-of-charge, but application of knowledge should be protected. Lawyers, does my philosophy have legal merit or am I misinformed?

P.S. I understand about putting in the time to do research on these things (take a look at the etymology for 傾國). A-cai 08:15, 4 September 2006 (UTC)

Are not everything here copyrighted already (by corresponding users, who wrote the text) and then licensed under GFDL? -Yyy 06:55, 5 September 2006 (UTC)
Bare facts are not copyrightable, so I consider that etymologies and dictionary definitions and translations are copyrightable only when something creative is made. The contents of very old dictionaries have no more copyright protection.--Jusjih 07:37, 7 September 2006 (UTC)

Categories that contain lists

Many categories contain hand-crafted lists of words that someone wants added to that category (See Category:Telugu language as an example). Is this acceptable, or is it OK to delete the lists? It seems a shame to delete people's work. SemperBlotto 07:24, 6 September 2006 (UTC)

It is a handy way of temporarily entering a list, to seed a category. --Connel MacKenzie 07:25, 6 September 2006 (UTC)
Once the entries have been added, I usually remove these lists (unless there is an important sequence to the list). However, it is important to actually check each blue link to be sure that the definition applicable to the category is on the page. --EncycloPetey 18:17, 6 September 2006 (UTC)

Random page

How does the Random page function work? What type of algorithm does it use?--User:Hurray MH 08:22, 7 September 2006 (UTC)

In a nutshell: Every page, when it's created, gets a random number assigned to it, between 0 and 1. When you ask for a random page, the system generates a random number X, then asks the database, "give me the first article whose random number field is greater than X." (This is a fine example of the grand tradition of letting the database do all the work.)
The advantages of this scheme are that it's simple to implement and very efficient. The disadvantages are that it isn't perfectly random (though I believe the discrepancies are slight),
You'd better believe otherwise! How informative. DAVilla 06:18, 10 September 2006 (UTC)
Hard to say if you're agreeing with me or not! If not, do tell. —scs 13:33, 10 September 2006 (UTC)
The discrepencies would be more than slight. If the article numbers are never reassigned, the probability of hitting some words could be easily several times higher than that of others, and in all likelihood are. The distribution of probabilities depends on the differences between article numbers, which is by no means uniform. There are probably words that have never been hit by random search, for instance, an article with a random number very close to zero, or where there is another article just a tiny fraction below. DAVilla 15:21, 21 September 2006 (UTC)
and that it doesn't lend itself immediately to potentially more useful features like "give me a random page that isn't a stub or redirect" or "give me a random page in English".
(Various people here on Wiktionary have experimented with alternative random-page implementations that do let you, say, restrict the choice to a certain language. I think Connel has one that basically works.)
scs 13:31, 7 September 2006 (UTC)
Please don't feed trolls. --Connel MacKenzie 19:38, 7 September 2006 (UTC)
And I was supposed to know it was a troll how? —scs 19:46, 7 September 2006 (UTC)
The obnoxious sig. --Connel MacKenzie 19:52, 7 September 2006 (UTC)
There is nothing obnoxious about "Hurray MH". And I am not going to check where every sig on every question leads before deciding whether to answer the question. (But yes, after the fact I noticed the odd link in that sig. I'm removing it now, in case it's as dangerous as it looks; interested parties will now have to check the history to see. Would merely clicking on that link have blocked you, if I were a sysop? That sounds just as bad as reading mail using Outlook!) —scs 20:02, 7 September 2006 (UTC)
No, it would take a sysop to the blocking page. If they then accidentally blocked me, I of course could undo their mistake in short order. It has no effect other than being obnoxious. --Connel MacKenzie 20:30, 7 September 2006 (UTC)
I do think that including the option to gain access to a randomly generated page by language would be an important feature. It would provide users with the pleasure of "browsing" through a dictionary without often coming accross pages in (example) Chinese. Thanks if this can be done. Syrius 12:42, 13 September 2006 (UTC)
You are welcome to help me test (set a bookmark.) Note that it is subject to my 8-yr old linux box remaining online. The electric company has scheduled maintenance for my neighborhood for tomorrow; it may be offline for eight hours or so. :-(   --Connel MacKenzie 15:19, 13 September 2006 (UTC)

bot policy in limbo

So, being bold, I made some pretty significant changes to Wiktionary:Bots and asked for comments here. Unfortunately (perhaps due to the somewhat chilling effects of the one comment posted so far), nobody else has commented either way, at all. So, unless you believe that "silence = assent", the changes don't have any support, and should arguably be rolled back until such time as they do. Naturally, I'm reluctant to do that, but I won't complain if someone else does. —scs 13:38, 7 September 2006 (UTC)

I suspect that silence means that people have (rightly or wrongly) had other priorities and have trusted the small band of bot farmers -- after all, if we trust you to run bots, we can surely trust you to edit your own policy.
However, the thought that All it takes for evil to triumph is for all good men to do nothing has prompted me to check. In case I was too subtle for some, the evil I had in mind was staying in limbo, not releasing bots. --Enginear 12:13, 9 September 2006 (UTC) I think your alterations improve the policy by indicating the sort of responsibilities a bot runner is taking on. I support them.
However, they still do not answer my niggling worry that someone may overestimate his competence and, whatever responsibilities he has signed up to, perform some wrong edits which it is impracticable to undo. I am thinking, for example, of wrongful merging of two categories [in the general sense, not just things in curly brackets] without keeping a log of which items were altered. If a log is kept, then almost anything should be reversible by bot, but if not, manual inspection may be required, and if thousands of entries were affected, that would be a considerable task.
So my advice would be that each bot task should be peer-reviewed in advance by an experienced bot runner, to confirm that there is a valid back-out strategy if the changes needed to be reversed. The test run of a bot should then include a test of the backout (unless trivial). If you are prepared to add that to the policy, then I will be much less nervous about supporting bots in the future (at present I nearly always support them, but remain quietly nervous as I do so).
Obviously, there's a risk that the peer reviewer is a sock puppet of the task proposer, or that the bot is coded to do something underhand. However, if such a person failed to arouse the suspicion of other bot runners, it's unlikely any of the rest of us would notice either.
In short, community consensus is important in agreeing what tasks we want done, and may be helpful in judging whether someone is trustworthy enough to be allowed to run bots in the first place, but I believe the details of how you do it are best checked by another bot expert without interference from the rest of us.
I suspect I am speaking for many when I say thanks to the bot runners (and template writers) for your good work. Sorry we haven't joined the discussion, but we're not always sure we know enough to make useful contributions. --Enginear 19:28, 8 September 2006 (UTC)

  • Sorry, the silence meant that I sent my IRC logs to Scs via e-mail. I don't think he's had time to adequately digest the points of that debate, yet. But he did acknowledge receipt.
(Indeed. I'm about halfway through them. —scs 21:03, 8 September 2006 (UTC))

  • It is my belief that users not tasks should be given the bot flag. Due to GPL/GFDL incompatability, the code review requirement as stated can't work. And certainly, a provision for excepting minor tasks (say, less than a thousand edits) should require no approval at all. All 'bot edits are just edits; the only affect of the 'bot flag is to not clog Special:Recentchanges. Nothing else. Did anyone block User:Connel MacKenzieBot or User:BD2412 during the recent spate of en-noun-reg changes to en-noun? No. Did anyone discuss it? No. (Well yes, but months ago and only hypothetically, then.)
  • I agree that trusted and competent users should have permission to do all non-contentious tasks by bot without seeking further permission. I would be happy to vote all current bot runners into this category, and would trust them to seek approval before making contentious changes. --Enginear 12:13, 9 September 2006 (UTC)
  • Trust is the issue. Lack of trust has an extraordinary chilling effect. Bot operation has technical throughput/volume concerns. But WP:AGF should still apply. These concerns have been shown time and again to be completely unfounded: no bot has yet brought any WMF wiki down, anywhere.
  • Trust is what allows me to support all bot requests, albeit nervously. WP:AGF? --Enginear 12:13, 9 September 2006 (UTC)
  • You want to have all bot tasks peer reviewed? Then PAY someone to peer review them. If not, then you should consider these edits to be edits just like any user editing Wiktionary. Anyone using any tool to assist their editing implicitly takes responsibility for their edits.
  • I also (more or less) trust the current bot runners to discuss with each other when they are unsure of their proposed methodology or coding. However, I work for an organisation where some activities can cause serious consequences and we therefore have a peer-review requirement. I have to admit that from time to time the peer-review of my work has picked up problems I missed and which could have had serious results. (And I am less bold than you!) I said I advised peer-reviews, particularly of back-out strategy, and that their use would make me less nervous. In view of the next paragraph, they could be limited to the rare occasions (if any) where a test run is not practicable. --Enginear 12:13, 9 September 2006 (UTC)
  • All user's contributions are accessible from Special:Contributions, whether they be bot-assisted, bot-driven, AWB, Javascript, offline editor, external editor, or wiki-input form.
  • I had not appreciated that; it makes me a lot less nervous. --Enginear 12:13, 9 September 2006 (UTC)
  • Do 'bots make mistakes? Certainly. The interwiki bot is notorious for putting interwiki links on Main Page and redirect pages. Does RobotWiktGM get blocked? I hope not!
  • The only point in someone courtiously asking for a 'bot flag is to not annoy people who patrol Special:Recentchanges. Any discussion beyond that courtesy is specious and nefarious. It is only an act of tremendous kindness for someone to request a bot flag. Having the upside-down and backwards policy based only on irrational fear has stifled the English Wiktionary project considerably.
  • It is very hard to even see from the "fear perspective" as a bot operator. Having been here a long time, I do partly understand the concerns. But the backlash from years of onerous, ridiculous 'bot policies needs to begin.
  • Fear is usually based on lack of knowledge, eg that I didn't understand about Special:Contributions. Yes the backlash should begin -- as my edit summary said: Onwards and Upwards! --Enginear 12:13, 9 September 2006 (UTC)
  • Off the top of my head:
    1. Webster 1913 should be imported
    2. GMET should be imported
    3. 1st line defs from Wikipedia should be imported
    4. iSpell dictionaries should be imported as stubs
    5. translations from other language Wiktionaies shoud be imported
    6. translations from here should generate entries
    7. inflections should generate (more) entries
    8. artificial-voice pronunciations should be imported
    9. TV slang should be stubbed
    10. etymologies from W1913 should be inserted
    11. reference links should be inserted
    12. synonyms and antonyms sections should be imported from various thesauri
    13. Wikisaurus and Index namespaces should be auto-generated
    14. abbreviations, acronyms, initialisms should be imported
    15. technical jargon should be imported
    16. medical references should be imported
    17. clinical lexicons should be imported
  • Off the top of my head <gulp> I agree (particularly about technical, medcal, and other specialist uses). --Enginear 12:13, 9 September 2006 (UTC)
  • These things may already have been done, if we hadn't had the idiotic policy in practice for the past couple years. Would we have over three million entries already? Probably. Would the entries we have be more consistent? Certainly. Would the collections make en.wikt: a more useful resouce? Of course. Is there any sane reason for continuing the witch-hunt mentality currently in place?
No reason I know of for a witch hunt, though I think you may be over-optimistic re "more consistent"! Of course, the irrational fear which leads to witch hunts is usually from lack of knowledge, so education may help -- it may be hard for you to appreciate how little most of us understand of what you do, and frustrating for you to explain in words of one syll, but it has certainly alleviated most of my nervousness. --Enginear 12:13, 9 September 2006 (UTC)
So, Connel, do you act this truculent and hyperbolic in real life, or is this just your net persona? :-)
I'm as frustrated by foot-dragging and delays as the next guy, but I would hardly characterize any negative attitude towards bots that might exist here as a "witch hunt".
That's a very interesting list of potential auto-import tasks you've got there. But the hard questions are not "How do I write a bot to reformat and import this particular content?", nor "Can I get permission to bot-import this content?". No, the hard questions are "Is this data set high-quality enough that importing it would retain or improve Wiktionary's overall quality level?", and "Is Wiktionary well-served, and are our readers well-served, by having all this information mechanically integrated within Wiktionary?"
I'm not saying that the answers to either of those questions is necessarily "no", but they're certainly not automatic, knee-jerk yesses, either. —scs 21:03, 8 September 2006 (UTC)
Prior to your ad-hominem characterization of me, I would have put you firmly in the "pro-bot policy reform" category. :-) Seriously, there is one individual that I know of, only, who is responsible for the current "policy." And that is not you.
Smiley noted, as you noted mine, but in case anyone else misses them: it would have been ad hominem if I'd said that you were truculent or hyperbolic. But all I suggested were that you tend to act those ways, here. —scs 21:23, 8 September 2006 (UTC)
Still, any negative comment towards what I've said is certain to be harped on, by the stubborn element that has fostered the current atmosphere. One needs only to look to the "English nouns" specious complaints, to see examples of that. --Connel MacKenzie 21:40, 8 September 2006 (UTC)
ALL of those imports, insertions or corrections would benefit Wiktionary tremendously. A collective resource may contain many entries that on their own would not stand. But as a collective repository they do have value...even the ones "of lower quality."
Generated translation entries had the most specious of all arguments used to stifle it: that "having a stub entry is worse than having a 'proper' entry." I have seen hundreds (perhaps thousands) of times now, where the opposite is true.
Are they all knee-jerk "yeses?" Yes, beyond any doubt.
--Connel MacKenzie 21:14, 8 September 2006 (UTC)
  • Others:
    1. Generate stub entries for idioms
    2. Generate stub entries for all items in each appendix
    3. import language definition sets
    4. generate disambiguations for tone characters
    5. generate stubs for romanizations
    6. import grammar and usage guides to appendixes
  • --Connel MacKenzie 21:40, 8 September 2006 (UTC)

There seems to be some confusion here, namely the distinction between bots and tasks. A bot is simply an account with a flag, and it should not be a big deal to get such an account (for a trusted user). That's what the WT:BOT policy should deal with. The automated tasks done with such account are different; unless they're absolutely uncontroversial (and that's what the bot owner is trusted to distinguish) they need some sort of pre-approvement by the community. All of the above-mentioned proposals (quite obviously) need community approvement, but that should not interfere with the management of bot accounts and bot owners. The current idea is that each task needs a separate account; the proposed change says that many tasks can be run under one account (which saves time and trouble). Whether and how tasks are to be approved should not be dealt with by the bot policy, or if it is, it should be clearly mentioned as distinct from the actual bot account/flag/rubbish rules. — Vildricianus 22:05, 8 September 2006 (UTC)

I'm sorry, but that sounds like it would simply be adding another hoop to jump through, on one's way to getting approval (for what we, today, call a "'bot".) I can't even imagine what bizarre "tests" would be devised to show one's trustworthiness.

I read it differently: a one-off approval after which you would have freedom to do anything you believed to be non-controversial. I haven't noticed any bizarre test in the approval of administrators (or indeed of non-automated contributors, unless "give him enough rope to hang himself" counts as cruel and unusual punishment). I would expect approval of bot runners to be similarly laid back. --Enginear 12:13, 9 September 2006 (UTC)

No. Come off it. The whole setup is perversely wrong. The current concept held by the Wiktionary community (engendered by a single person's POV) needs to be swept clean. There is no reason for any of the delays.

When Primetime was uploading via a bot, did anyone demand that he get a bot flag? No, he was blocked just the same.

When the X*cnt vandal was uploading using a bot, did anyone demand that get get a bot flag? No, he (and all of AOL) was blocked just the same.

When WillyOnWheels wrote his own page-moving bot, did anyone demand that he get a bot flag? No, he was blocked just the same.

That current group of bot-regulars have proved time and again their willingness to correct any problems caused. Why then, should people who not only have never "done bad things" with the technology, but in fact have relieved tremendous amounts of tedious cleanup, be made to walk the coals for each and every petty task?

--Connel MacKenzie 06:34, 9 September 2006 (UTC)

No reason, for petty tasks -- but nor, as far as I can see, has anyone here suggested it. Certainly, Vild and I have both agreed that non-contentious tasks should NOT require approval, and that we trust you with the decision of what is contentious.
For contentious tasks though, it is surely near-essential wikiquette to discuss first, rather than release a Terminator against regular human defenders! For lesser cases, it is surely easier to agree a change quietly first, and point to the consensus later if someone objects, rather than to argue after the event, when you will always be on the back foot, and all manner of objectors will come out of the woodwork. --Enginear 12:13, 9 September 2006 (UTC)
I simply don't get what you mean to change then, or was your post not directed to me, with the weird ---- in between? Now I could only repeat what I said above, but I'll clarify that the confusion is on my part. Are you talking about bot flags or about automated tasks? Do you mean it should be possible to do the stuff of that list without prior discussion? Or what? — Vildricianus 07:53, 9 September 2006 (UTC)

I don't know much about bots, but I am extremely doubtful about importing wordlists wholesale. The material we already have from Webster, for one, is little more than a big clean-up job. Widsith 08:09, 9 September 2006 (UTC)

Good sir, the only reason no one has cleaned them up is because they aren't in the main namespace! Instead, Wiktionary languishes with frighteningly incomplete coverage of the language. The same is more exaggerated for translation. --Connel MacKenzie 08:42, 9 September 2006 (UTC)
Better incomplete than incorrect like some, but that is no reason to languish when there are good quality (if differently formatted) non-copy-vio entries which can be imported. --Enginear 12:13, 9 September 2006 (UTC)
I apologize for my sleep-deprived rants of yesterday. I'm not suggesting (or at least, should not be) that some anarchistic "revolt" of bot-operators begin. Yes, I do take concerns such as those raised by Widsith quite seriously. And yes, I do of course comprehend that controversy surrounds each of the import tasks I listed. (Even if similar or identical imports are accepted out-of-hand on other language Wiktionnaires.)
I do wish it were easier for people to understand just how arbitrary the existing complaints are though. The current community mindset places a certain bias against reasonable progress. Rather than continue in what is probably percieved as an argumentative tact, I'll rewrite WT:BOTS in a manner I think is appropriate, and send it to Scs via e-mail as a starting point. Enginear's warning about inaction is very relevant. --Connel MacKenzie 13:34, 9 September 2006 (UTC)
One big problem, of course, is that the word "reasonable" can be rather slippery! One man's "perfectly reasonable" can be another's "dangerously scary".
Remember (as we've all said in various ways above) that a big part of the bot policy is about building trust. Those who care about the project want to make sure (among other things) that some bot-assisted revolution isn't going to drag the project off violently in some other direction. So they need to be assured that bot operators aren't going to make sweeping changes that a bot operator thinks are "perfectly reasonable" but that others in the community don't. If they're unsure about that assurance, they may insist that all bot tasks be formally approved in advance. And if they're unsure about that assurance, they may drag their feet about allowing bots at all.
Therefore, those of us who want to see the bot policy liberalized have to take this notion of trust- and consensus building seriously. We can't give the impression that, once we manage to get the bot policy liberalized and our own general-purpose bots approved, we're going to race off and make a whole bunch of big, sweeping, "perfectly reasonable" changes that, in fact, not everybody might agree with.
scs 14:36, 9 September 2006 (UTC)

Classification of slang

There's been some debate over whether taboo words that are considered vulgar, but that have been around for a very long time, such as fuck, cunt or cock should automatically be described as slang. There's been some discussion of the issue at talk:cunt#Slang and User talk:Stephen G. Brown#Motivate and discuss. Personally, I don't feel that classifying any definition of a word that is considered to be somehow substandard as (slang) is informative. Especially not when the definition would be recognized by practically all speakers.

I think we need a consistent guideline as to how to comment articles. Which policy documents regulate the use of comments at the moment?

Peter Isotalo 12:15, 9 September 2006 (UTC)

Offhand, I'd say that any term that cannot be used in formal writing (i.e. a speech or presentation) should be classified as slang. Perhaps it is our definition of slang that should be refined? --Connel MacKenzie 13:22, 9 September 2006 (UTC)

It's probably another one of those taxonomy/orthogonality things. There's a formal/informal axis, and a bland/offensive axis, and also an established/cutting-edge axis. Certainly fuck is not at all formal, is fairly (if not extremely) offensive, but is also very well established; it's not some rad new koinage that popped up on urbandictionary yesterday. So is it slang? Connel's right; it depends on your definition of slang, and different definitions of slang address different aspects of the three axes I mentioned. My own definition of "slang" would probably be that it describes terms that are both informal and relatively new, but without necessarily coming down on either side of the bland/offensive line. So by my definition, no, fuck isn't really slang any more, but that's just me. I think real dictionaries tend to call it "vulgar slang", which is probably about right.

The Jargon File has a nice little essay on "Slang, Jargon, and Techspeak" (which brings in yet another axis: "everyday" versus "specialized"). That link isn't working at the time I write this, but a google search turns up several mirrors. —scs 14:58, 9 September 2006 (UTC)

See also Wiktionary:Grease pit/2-level dictionary#Two-level too arbitrary, I think. —scs 00:31, 10 September 2006 (UTC)
...where (in case I've confused people) I say I would welcome more, not fewer tags. It's just that my view (which seems to be a minority) is that, while vulgar was appropriate slang might not be. --Enginear 12:30, 10 September 2006 (UTC)
In my perception of what slang is, "fuck" doesn't classify as it. Vulgar, yes, but slang? I used to think that slang is pretty obscure language, usually quite new or used by only a certain group of people. (That's probably too narrow, but then, I don't usually spend my waking nights trying to define "slang".) — Vildricianus 20:27, 10 September 2006 (UTC)
  • Right: slang in a specialized sense means "unconventional."[1][2] Jargon is technically a type of slang unique to a certain group of people. If you want to take a very broad definition of it, then you can say slang therefore is informal, but the reasoning behind this approach is flawed, as one can use engineer's slang at a engineering conference without being informal. Ironically, the people who use the word slang to mean "informal" tend to be speaking in an informal, off-hand manner. In the example cited above, putting "vulgar slang" while defining slang as informal would be redundant. It would be the same as tagging a word as a "vulgar vulgarity."--Frem 22:39, 10 September 2006 (UTC)

I disagree. fuck is a slang word, because it is extremely informal. That is, and has always been, the primary definition of slang. The sense of ‘jargon’ is a separate, somewhat later, sense of the word which in no way negates the original meaning. Nor is ‘vulgar slang’ redundant, since ‘vulgar’ is used to mean ‘coarse, offensive’. Widsith 05:29, 12 September 2006 (UTC)

The sense of slang you are describing is historical, and I think we should keep our usage labels as precise as possible. Among people studying language, slang almost always means "unconventional." If we broaden our definitions too much, slang will assume the same meaning as vulgar. Further, vulgar can sometimes mean "popular." On the other hand, the definition you just gave for vulgar is very accurate. Vulgarisms are by modern definition informal (i.e., out of proper form). But we could add the word informal to that entry in place of slang to avoid ambiguity. In any case, the word slang in that entry is causing too much confusion.--Frem 09:20, 12 September 2006 (UTC)
There is nothing whatever imprecise or historical about it. A word is slang if it is used only in very informal speech, and vulgar if furthermore it is likely to be seen as offensive or distasteful. fuck qualifies for both of these. Widsith 14:33, 12 September 2006 (UTC)
Is slang then another entry where we should have two definitions, one tagged (linguistics) and one tagged (historical and popular usage) [or, if we want to wind people up, (vulgar)]? (This sort of discussion re tagging was part of Wiktionary:Grease pit/2-level dictionary#Two-level too arbitrary, I think. It is less trivial when applied to, say, medical terminology.) --Enginear 14:26, 12 September 2006 (UTC)
No, That wouldn't be accurate. I agree completely with Widsith; slang meaning "not formal" is in no way archaic nor obsolete. --Connel MacKenzie 17:20, 12 September 2006 (UTC)

Oldest red links?

Is there a way to generate a list of the oldest redlinks on the site (or someplace where such a list exists)? Cheers! bd2412 T 18:28, 14 September 2006 (UTC)

Also (completely unrelated question) can we get a bot to add 'pedia and wikiquote links to articles with entries under the same name on those sites?

I've been thinking of doing that. It can't be fully automated, though (i.e., it needs some oversight), because there are too many reasons why the spelling can be identical but the underlying concept not. —scs 20:37, 14 September 2006 (UTC)

Slightly even more unrelated question, is it appropriate to add 'pedia links to related entries, e.g. to add a link to the 'pedia's Atheism article from atheist, atheists, atheistic, atheistically? bd2412 T 20:14, 14 September 2006 (UTC)

My opinion is that it's fine to have a link from our "central" entry on a term, even if Wikipedia's title is slightly different (i.e. a different form of the word). But for all our little stubby articles like "form of foo" or "one who is foo" or "state of being foo", a Wikipedia link is superfluous, I think. —scs 20:37, 14 September 2006 (UTC)
I think it'd depend on each term; I wouldn't want a blanket prohibition on pedia links from stubs. --Connel MacKenzie 20:57, 14 September 2006 (UTC)
Depends how you define a stub - an article on caterpillars should ideally never say more than that it is the plural of caterpillar, but I think a 'pedia link to caterpillar is just as useful in either article. bd2412 T 00:32, 17 September 2006 (UTC)
I very strongly disagree with your conclusion about that example. The entry caterpillars ultimately should have translations, pronunciation(s), citations, example sentences, synonyms, an image, a video of the sign language for the plural and perhaps a gloss indicating what the singlular form refers to (to save our readers a 10-15 second page load, chasing a link.) --Connel MacKenzie 18:36, 17 September 2006 (UTC)
Pronunciation, absolutely. Synonyms, maybe as a Wikisaurus link. Examples and citations... would that be of any inflected form or of the plural only? An image, fine... so long as that picture actually has more than one caterpillar in it. Sign language... would that be of the American variety? British? The international? Maybe even whatever is used in Singapore or India? DAVilla 15:10, 21 September 2006 (UTC)
I do not know of any tool that lists redlinks by age. While that would be a very nice thing to have, I can't think of a decent way to generate such a list. --Connel MacKenzie 20:57, 14 September 2006 (UTC)
It'd be easy enough to start building one going forward: just fetch the redlink list periodically (once a day or once a week) and tag each new word you find on it with the date you first found it. But doing so retroactively, no, I can't think of a way to do that, either. (If you had a lot of historical database snapshots lying around and a lot of CPU and your own time to spare, maybe...) —scs 21:24, 14 September 2006 (UTC)
Well, just one enwiktionary-latest-pages-meta-history.xml.7z has all the needed history, but that is a lot of CPU crunching. One problem is redlinks that are no longer there (e.g. vandalism) have to be filtered out from such a list. That does become rather tricky. The other problem is that most of them will be translations for the oldest entries dictionary, free, dog etc. So, a third pass would be needed to give preference to English red links, somehow. (Note: it is pretty hard to guess what language a redlink is "for.") (Note: simply decompressing the 8.72GB file is a bit of a challenge itself.) --Connel MacKenzie 14:25, 18 September 2006 (UTC)
Oh! Right. I forgot about pages-meta-history. That'd do it, wouldn't it?
And filtering out the ones that are "no longer there" isn't that tricky, is it? Just use the current redlink list as a basis. (Though there's a mildly interesting epistemelogical question here: I'm wondering what it means for a redlink to be "no longer there", since a redlink is something that wasn't there in the first place. So a missing redlink is something that's no longer no longer there, I guess.) —scs 01:43, 20 September 2006 (UTC)

appendix idea: signs and commands

I had an idea for another appendix: signs and other traditional commands, so that we can start collecting their canonical translations in all languages. Examples:

  • No Smoking
  • Keep Off The Grass
  • Curb Your Dog
  • No Entry
  • Authorized Personnel Only
  • etc., etc.

A reasonable idea, or too goofy for words? —scs 20:32, 14 September 2006 (UTC)

If you want to start a list of directives, wouldn't a category be a better place to build it from the ground, up? --Connel MacKenzie 20:59, 14 September 2006 (UTC)
You mean, have all the sign texts be idiomatic main namespace entries, i.e. No Smoking, Keep Off The Grass, etc.? Would those meet WT:CFI? Would people freak? I wasn't ready to take that plunge yet; that's why I was thinking appendixly. —scs 21:28, 14 September 2006 (UTC)
No, I don't think they'd meet CFI in its present state. Yes, I think they have translation value. Witness, for instance, signs like some in the collection here that could clearly have benefited from such an index. Standard menu commands and dialog box contents to aid the localization of software would likewise be well worth translating, but not meet CFI.
At the risk of fanning some flames, I'd suggest that WiktionaryZ is (or will be) the better location for such matter, since entries should eventually be able to be identified as lexemes as such versus other phrases with translation value and sorted accordingly. Dvortygirl 00:49, 15 September 2006 (UTC)

I like the idea of an appendix, although I agree with Dvortygirl that WiktionaryZ may ultimately be the better place for such a list. However, that said, I think such an appendix here would be a nice to have, particularly with translations. The argument against having it it that it is not part of the business of a dictionary, but rather has the elements of a phrase book. Never-the-less, my view is that; as we are not a paper dictionary and since space is not at a premium; the argument is not valid. I think go for it and see if it works. If it doesn't, it can be abandoned without too much fuss. If it does work, then it will be a useful appendix (or is that an oxymoron?). Andrew massyn 18:38, 15 September 2006 (UTC)

If you mean is it a contradiction in terms (not the same as an oxymoron, although the term is very commonly misused in that way), then the answer is no. I wouldn't call it a tautology either, as appendices can just as easily be useless. — Paul G 16:18, 21 September 2006 (UTC)

template capitalization

Does anyone have strong preferences as to whether template names should be initial-caps or not? I'm about to create a couple new ones (for help in maintaining the above-mentioned appendix for signs I'm about to create), and I can't decide whether to name them "Template:sign..." or "Template:Sign...". —scs 12:53, 15 September 2006 (UTC)

I personally prefer upper cases. Since upper and lower cases are no longer interchangeable here, we may need redirects.--Jusjih 14:11, 15 September 2006 (UTC)
Please do not use upper-case first character template names on en.wiktionary. --Connel MacKenzie 18:08, 15 September 2006 (UTC)
Hmm. Looks like I get to flip a coin, or go with my own personal preference, since no one has come up with any hard reasons or actual arguments one way or the other. Ah, well. (No biggie.) —scs 02:01, 20 September 2006 (UTC)
If the purpose of the template is merely to produce the text "signs" or "Signs" within the page, as a label template for instance, then upper-case is incorrect. Otherwise I would say that the lower-case is only/at least/probably/not unlikely preferred. DAVilla 14:58, 21 September 2006 (UTC)

Phrase List

Dear Wiktionary Beer Paulour,

I have developed a free, no advertisement, noncommercial web site,, which is intended to be a major language resource with many uses:from academic, personal enjoyment and the workplace. To do this requires participation by a wide variety of those who speak American-English At the momen, I have provided almost all the input of 109,000+ phrases, persons, places and things. So far, the site is not easily accessed via search engines (perhaps I have learned how to avoid their spiders?). My objectives have many similarities with Wikipedias (interactive and open to everyone without personal identifying information, no fee) and some differences (quarantine for suitability of entries and the ability to classify the registered user without their having to give personal identification: no addresses, phone numbers, ID numbers, birth dates. The value of the sight is dependent on having a wide variety of active users. I am interested in knowing if and how our sites can be linked to meet our goals? Check out and comment. The site has a detailed Q & A section that should answer most of your questions.


Martin MacIntyre

Perhaps your question would be better directed to the Wikimedia Foundation, somewhere on Since your project is copyright oriented (not copyleft) I don't think it will be a very good match. The GFDL is the basis for this site existing. --Connel MacKenzie 18:05, 15 September 2006 (UTC)

Part of speech headings

Discussion copied to Wiktionary_talk:Entry layout explained/POS Headers. Continue discussion there.

In the entry layout explained (from Community portal > Entry layout) you can read:

====The part of speech or other descriptor====
This is basically a level 3 header but may be a level 4 or higher when multiple etymologies or pronunciations are a factor. This header most often shows the part of speech, but is not restricted to "parts of speech" in the traditional sense. Many other descriptors like "Proper noun", "Idiom", "Abbreviation", "Phrasal noun", "Prefix", etc.

I couldn't find a link to further details on POS headings and more specifically to questions as: which POS headings are accepted, and what do they mean? Connel MacKenzie told me that POS headings were discussed last year and again this year, and that an agreement was reached over these headings. Unfortunately I couldn't find the outcome of these discussions, though I looked for them in the Grease Pit, as Connel suggested.

A few examples:

  1. A traffic light was a Noun, till someone called it a Noun phrase. Then Rodasmith changed it back to Noun, but without explaining why or without referring to any guideline.
  2. Many verb or noun forms seem to have a Verb form or Noun form heading, but apparently these headings are deprecated. Probably only Verb or Noun can be used in these cases, though - as several people have written - the inflection templates are inappropriate for non-lemmata.
  3. What about the heading Plural noun? See
  4. Why is Romanian a Proper noun and Russian a Noun?

There seems need for an accepted and easy-to-find guideline on these POS headings. That could surely avoid discussions like the one on

—Jan, 16 September 2006

Yes, you are quite right - this is something we do not yet have written policy on. "Russian" should be listed as a proper noun, but presumably whoever created the entry just wrote "Noun" and that has remained.
It can also be argued that the header "Adjective" for "Russian" should be "Proper adjective". We had a discussion some time back about restricting POS headers as they seemed to be proliferating unnecessarily.
Perhaps we should discuss and agree on a fixed set of POS headers. Some points to consider:
  • Ancient or modern? Traditional POS's are noun, adjective, verb, adverb, preposition, pronoun, interjection and article. Some modern dictionaries use terms such as "determiner", and "modifier". For example, "my" is traditionally a pronoun, but some dictionaries describe it as a possessive adjective. Some words do not fit conveniently into the boxes used by traditional grammarians: numerals are an example. "Two" (as in "two people") can be variously described as a numeral, a number, a cardinal number or an adjective, the last of these being the traditional POS. We generally use one of the other terms here, but it is debatable whether these headings are actually parts of speech.
  • Simple or precise? "Running" is an adjective (as in "running water" and "a running sore") but a more descriptive POS is "participial adjective", as "running" is also the present participle of "to run". In the verbal sense, "running" can be described as a "verb", "verb form", "verbal noun", "gerund" or "present participle". Similarly, nouns can be proper, common, abstract or collective, although dictionaries that make any distinction at all do so for the first of these only and label the other kinds as just "noun". "The" is an article, but is also the definite article.
Paul G 15:46, 16 September 2006 (UTC)
Perhaps not a huge issue, but I cannot find any justification for declaring demonyms (e.g. "Russian") to be proper nouns. They do not identify any specific individual, but rather one of a class of individuals, i.e. they are common nouns. English just seems to have been courteous enough to extend them capitalization from their proper noun roots. Rod (A. Smith) 23:55, 16 September 2006 (UTC)

Yes, we need a policy. This should probably go in the policy discussion itself, but after calling things "noun phrase" and so on for awhile, I stopped including the word "phrase". I think most people can easily see that it's a phrase, and it just clutters up the list. In general, I think transitive/intransitive should use the templates {{transitive}}, {{intransitive}} in the definition line, so there should be no need for a "Transitive verb" heading. Countable and uncountable fits neatly these days into the inflection templates or the definition line and likewise shouldn't take up heading space. Likewise, I think there should not be a heading "verb form", just "verb", for simplicity and consistency.

I can offer a couple of arguments for having such a standardized list of headers. First, if you hover the mouse cursor over a header like "Noun" you'll get a tooltip that says "part of speech" or some such. If you hover it over a heading that's not on the list, you'll get some notice about "not a standard header". The list of headers for which tooltips exist might provide an excellent starting point for this discussion, and the list should be updated when our policy is agreed upon. Secondly, a standard list of headers will allow better automated access to the data, both for bot cleanup efforts and for exporting. It'll make things look cleaner and more consistent, besides. Also, consistent formatting should help propagate consistent formatting, since anybody copying from another article will be copying the correct thing.

Incidentally, when we do standardize on preferred headings, it will be the perfect task for our team of bot-runners to go help tidy up the inconsistencies. If we settle on "Alternative forms", say, rather than "Alternative form" or "Alternative spellings", the extra variations will be quick and easy to consolidate using bots. Perhaps we could try before then to establish the bot guidelines we want. —Dvortygirl 16:10, 16 September 2006 (UTC)

And in fact this task is precisely what User:ScsHdrRewrBot is intended to do, and -- lookie that! -- it's already been approved. I haven't run it much yet, but you can look at its contributions to see examples of the relatively few header cleanups it's done so far. —scs 12:52, 17 September 2006 (UTC)
Here are the parts of speech in the tooltips list that Dvortygirl is referring to:
Adjective, Adverb, Conjunction, Interjection, Noun, Prefix, Preposition, Pronoun, Proper noun, Suffix, Verb, Verb form.
—Jan, 18 September 2006

The renewed conversations from this year can be found at WT:GP#Normalization of articles / User talk:Connel MacKenzie/Normalization of articles. Comments are still welcome. --Connel MacKenzie 22:09, 16 September 2006 (UTC)

Other classics, for those who like to read a lot: (Yes, I had trouble finding it because it was moved five times): Wiktionary talk:Entry layout explained/archive 2005BP#Uniform headings, and about eight other (more?) relevant sections of that same page. --Connel MacKenzie 22:21, 16 September 2006 (UTC)
On a side note, I think we need a policy regarding archiving. Most of the 2005 archive of talk:ELE is relevant, as is the 2004 archive. Ironically, the majority of those conversations were from this Beer Parlour, but were vandalously moved without leaving links behind. I don't think any of the conversations that were removed from WT:BP were completely finished with (as is evident by the same questions resurfacing one or two years later.) --Connel MacKenzie 22:25, 16 September 2006 (UTC)
On a side note, one thing that would help to (a) make these discussions easier to find and (b) not keep having them over and over again would be if we could all try to (c) centralize them on the talk pages for the relevant policies in the first place and (d) actually update the policy pages once we reach an actual consensus! —scs 13:48, 17 September 2006 (UTC) [Memo to self: wander on by to WT:ELE sometime soon and be bold in altering it to fit reality.]

Miscellaneous notes and opinions:

  • It would be useful to keep in mind why we're tagging words with their part of speech at all. Is it
    1. For the benefit of readers who are learning English or grammar
    2. To separate the definitions for entries that have senses in multiple parts of speech, and/or
    3. To satisfy our deep inner craving to rigorously categorize things?
For my own part, I'd like to focus on 1 and 2 (although I'm the first to admit that I've got the categorization bug, too; it's just one I try to keep it in remission). There shouldn't be any shame in saying, for the really weird and hard-to-categorize words, that their part of speech is "other". (Of course, there's a significant logistical difficulty here in that our entries don't say "Part of speech: _____". A hypothetical "Other" as a part-of-speech heading under our current scheme would be confusing and wouldn't really work at all.)
  • I'm probably starting to sound like a broken record on orthogonality, but part of speech is really orthogonal to qualifications like "phrase" and "abbreviation". That is, many phrases, abbreviations, and contractions have meaningful parts of speech (though of course many do not). An interesting example I came across recently is HEPA, which you see on more and more vacuum cleaners and air filters, which stands for High Efficiency Particulate Air, and which is therefore pretty much an adjective. My point here is that, strictly speaking, things like "Phrase", "Initialism", "Abbreviation", "Contraction", and "Idiom" are not parts of speech at all, and a mechanism which specifies or categorizes parts of speech should arguably not be overloaded with trying to capture these distinctions, too.
  • If the distinguishing quality of "Noun phrase" versus "Noun" is "has a space in it", that's a pretty useless distinction, because any reader can see this for themselves. If we maintain a distinction for "noun phrase", it should be for longer, true phrases, like "the weather in London". Things like "lawn mower" are, I believe, pure and simply nouns. (In this case it's easy to prove, given that the spelling "lawnmower" also exists.)
  • Personally, I agree with Dvorty and others that the transitive/intransitive distinction is of secondary interest and should appear (if it appears at all) in tags on the definition lines for individual senses, not prominently in the Verb header. Similarly for countable/uncountable (which we do tend to do that way), and for concrete/abstract nouns (which we don't tend to try to capture, which is probably a good thing, 'cos it ends up being not such a clear-cut distinction after all).
  • Yet another distinction is for proper nouns. Those I don't mind being called out in the p-o-s heading, though I could go either way.
  • A somewhat trickier case is for the several words we've currently got listed using variations on "Adjective and adverb", such as quite. I'm not sure what the best way to handle those is.
  • As came up in the "nouns used as adjectives" thread, it can be argued that parts of speech in English are not nearly as rigid as we think they are, such that their use in a dictionary like ours could profitably be abandoned or drastically reworked, although that's probably too radical a proposal for today. (But the idea, I think, would be that instead of saying "moo: noun: 1. the sound made by a cow. verb: 1. to make a mooing sound", we could instead say "moo: 1. The sound made by a cow. 1a. (noun) an instance of this sound. 1b. (verb) to make this sound.")
    Nice attempt, and could work for the nouns that are also verbs, but makes other types of semantic relations more complicated to state. Whatever we change (if anything at all), it should remain both workable for all cases (problematic), and simple as feck (w:KISS principle). — Vildricianus 08:02, 18 September 2006 (UTC)
  • Yes, I have just completely ignored the suggestion I myself just made to keep long screeds like this one centralized on the relevant policy page's talk page...

scs 14:48, 17 September 2006 (UTC)

Discussion copied to Wiktionary_talk:Entry layout explained/POS Headers. Continue discussion there.

—This comment was unsigned.

The main problem with moving conversations around is that the instant they are no longer available on WT:BP, no one notices that they exist anymore. This has now become a rather critical policy-ish issue. Snippets from the previous conversations should probably be sprinkled in here. Annihilating this page is not the correct answer. --Connel MacKenzie 23:43, 17 September 2006 (UTC)
No one is proposing that we annihilate this page. The proposal is to organize this now-critical issue in order to make progress on it. Note that the discussion here is drifting to points of order and discussions over where the discussion should happen. All that is just bureaucratic quagmire. Let's just have the discussion so we can get a working decision to begin from. --EncycloPetey 22:39, 18 September 2006 (UTC)

Like Connel once suggested somewhere (or was it someone else?), the logical thing to do is abolish the POS headers and make a ===Definitions=== heading instead, moving POS mention elsewhere (God knows where). That's from a structural viewpoint; we have standardized headings that mention which information follows in that section (Etymology, Pronunciation, Synonyms, etc.). The broken chain in there is that in each entry, one or more headings bring the information themselves and don't mention what follows (readers are supposed to know that definitions follow in the section marked by the POS header). That's unlogical and a basic structural flaw in our entry layout.

Interestingly, the reason why we would mention POS at all is different for everyone. I see it in the first place as recommended for someone to fully comprehend the meaning of an English word to know which POS it has. That's because English words don't morphologically distinguish per POS (languages like Latin or Russian are the opposite), which makes that speakers or learners (both native and non-native) need to 'feel' which POS a word has in order to know its meaning. I myself had trouble long ago with "cunning" (why is it a noun? and then, why is it also an adjective??), but natives, too, have this kind of problems. — Vildricianus 07:58, 18 September 2006 (UTC)

By this logic, we should abolish the language headers as well, since they communicate the information rather than telling what information follows. But I don't see this as a viable alternative. There are two very good reasons to have information contained in these particular headers themselves. First, it allows for ease of cross-linking. If I want to link clam as a Latin adverb, then I can use clam#Latin|clam to do that, because Latin is a header. Likewise, if we want to link to a particular part of speech sense bewteen pages, we can do so because it's built in as a header. Second, the Language and POS headers allow long pages to be scanned in the Contents for a desired use. If the headers only said "Language" and "Definitions", I wouldn't find it nearly as easy to navigate some of the longer pages. I don't just need to know that definitions exist on a page, because I take it for granted that a dictionary will have those. What I need is to know where the particular definitions I'm looking for are placed. This also has a secondary benefit of allowing a quick scan of the contents to see a list of all the parts of speech that a word is used for. This information is terribly useful when learning a language. And for languages other than English, the POS is just as needed, since anyone not intimately familiar with Latin may not recognize an inflected form, and all the information relevant to the use is tied up in what POS the word is. The inflection, definitions, translations, and so forth all hinge on which part of speech is intended. --EncycloPetey 22:39, 18 September 2006 (UTC)
  • EncycloPetey, I don't understand where you're going with this. First, you attack an active conversation (long overdue, at that) by moving parts of it out while it is finally being discussed. (I'd liken that to an act of war.) Then you make convoluted arguments about abolishing language headings, which, I'm pretty certain no one has suggested. Then you attack the idea of abolishing language headings by presenting examples of why it is a bad idea? (Since you bring it up though, I think Wikipedia-style disambiguation would be more efficient for spellings shared between languages. e.g. clam (en), clam (la) or even clam (English), clam (Latin). But, even I can appreciate that a task that huge would probably not be as benefical as it might seem at first. And no, I do not suggest we try this.)
  • First, I did not move any part of anything out of anywhere. If you think otherwise, you can check the edit history and see that I deleted nothing at all. I copied out all the relevant comments I could find to begin a discussion on an issue that needs focus and cohesion. I then inserted pointers to the new location. The result is that we now have a draft describing known existing practices, which we may now use as a point of reference for discussion. Frankly, many of the headers people have identified I didn't know existed because they're being used in languages whose pages I never investigate. Such progress has not happened in any discussion previously ocurring on this topic in the BP that I have seen.
  • The language heading was an analogy. Vildricianus noted that the logical option is to replace POS headers with "Definitions" for structural reasons (whether or not he supports that view, which isn't altogether clear). I pointed out that the natural extension of that reasoning leads to infeasible and undesirable ends. The point being that logical consistency is heading structure is not the only consideration, because any considered change has to be weighed against loss of utility. Does that help to clarify?
  • The thing that Vildricianus paraphrased above was something I saw an early contributor here do. They had experimented using a ===Definitions=== header to consolidate the different parts of speech so that they would appear grouped together (like you'd see in a normal dictionary) with things like {{pos_n}} or {{pos_vti}} at the start of each line. As I recall, Eclecticology deleted the entry on sight, because of the experimental formatting. I do not know how feasible a wholesale change like that is, at this point. Certainly, the different automation technologies available have been getting more attention (from people other than me!) lately. Obviously, such a dramatic change is possible. But not without a tremendous amount of discussion, and a very clear majority of contributors understanding it, and desiring it, first.
  • I have toyed with ideas pertaining to the introduction of "Definitions" as a header, but haven't found any options that work beyond lemma forms of English entries. After all, we don't put definitions on non-English pages in most cases, and we don't put definitions on non-lemma forms of English words either, but point to the main entry. That said, I could almost see introducing the "Definitions" as a level-4 subheader between the inflection line and the start of the definitions, but only for lemma forms of English words as I said. --EncycloPetey 21:52, 19 September 2006 (UTC)
    • The secondary effect of you removing the relevant conversation here, is that it encourages people to become vigilantes, who are now making changes to WT:ELE without first gaining consensus here, to reflect their own POV. --Connel MacKenzie 14:06, 19 September 2006 (UTC)
    • Examples?? Who exactly are these vigilantes who have altered the ELE in the last 72 hours as a direct result of this discussion ocurring on a draft POS header page? The whole point of having the separate POS header page and associated talk page was to discourage alterations to the ELE until the issue was hammered out in discussion. The separate page keeps any discussion and changes segregated to a new page that is not yet linked from the ELE, and so will not be interpreted as part of it at this time. --EncycloPetey 21:52, 19 September 2006 (UTC)

User:TheDaveBot - Spanish verb Conjugations

Archived at User_talk:TheDaveBot.

Scots Verb present participles and AWB

Conversation of general interest moved from User talk:BD2412. bd2412 T 18:46, 17 September 2006 (UTC)

Hello there I can see now from the conversation above this one that you've had some problems with the Verb form heading stuff. The articles below are Scots words but I think you inadvertantly fixed them with AWB to use the {{present participle of|}} template which would automatically categorise to category:English verb present participles and of course this isn't correct fo a Scots verb form. I've fixed the articles in question and I'll check my other Verb form articles and put the to the Verb heading. Just a note to let you know. Regards --Williamsayers79 10:49, 16 September 2006 (UTC)

Perhaps we should have a "Template:sc.present participle of"? bd2412 T 16:32, 16 September 2006 (UTC)
why oh why is the template assuming that the language is English? (I think I understand one of Connel's comments now.) If it is really going to do this it needs a lot more magic. (lang= and several conditionals) This is a big issue; we've never sorted out what the heading, inflection line, and defintion lines should be for verb forms. (also noun forms, adjective forms ...) Robert Ullmann 19:31, 16 September 2006 (UTC)
Rather than the template assuming the language is English, I'd like to take the English category out of the template and have it separately occur in the articles. bd2412 T 19:42, 16 September 2006 (UTC)
If you look at Template talk:infl I think that Connel was suggesting that the form-of templates take a language parameter. It could then (if the parameter was present) categorise in (lang) (specific form), or in (lang) (POS) form (e.g. English verb forms) or not at all, depending on the existance of the categories for the language. This would make the form-of templates language independent. Robert Ullmann 20:07, 16 September 2006 (UTC)
To be clear, the template would default to English, but inserting |sc| would, for example, make the word categorize as Scottish? That would be quite brilliant! bd2412 T 20:10, 16 September 2006 (UTC)
Done. Robert Ullmann 20:54, 16 September 2006 (UTC)
Exactly. With the lang= parameter in those templates, the categorization becomes cleaner also, as the correct Category:fr:present participle of would be in the template (and would only need to be corrected there if the category layout changes.) --Connel MacKenzie 18:29, 17 September 2006 (UTC)
Please, can this be moved to somewhere more relevant, like the beer parlour? --Connel MacKenzie 18:29, 17 September 2006 (UTC)

Okay: now look at airtin. Both the inflection line and the definition line categorize the entry the same way. Which should we prefer? I'm inclined to think that the form-of templates shouldn't be categorizing at all. Or should they?

  • where the [expletive deleted] did "English verb present participles" come from? Not "English present participles"?
  • I like the "Verb form" POS heading, and in (e.g.) French is is used a lot! Do we really want to do away with it?
  • I just wasted 20 minutes because some ninny wikilinked "Scots" in the sco template ... (sc is Sicilian)

What do you think? Robert Ullmann 20:54, 16 September 2006 (UTC)

    • Even though I just knocked out severl hundred "Verb form" headings, yes I think that should be the standard for verb forms. I can easily change the "English verb present participles" to "English present participles" as well. sco is Scottish? Will keep that in mind! bd2412 T 21:07, 16 September 2006 (UTC)

Hi, I don't know if you noticed, but I left the template and doc and the airtin example out of sync last night. I had edited one, and the network connection went away. And at 1 AM Sunday morning in Nairobi, there isn't much you can do about it ... ;-) I put the language parameter in more like you suggested. See airtin as I mentioned. I think having the cat in the template is good, as long as it has the conditionals. (lots of redlinked cats for minor languages would be no good) Robert Ullmann 11:53, 17 September 2006 (UTC)

I took the language parameter out for now, and manually inserted the category into all the actual English articles. The template is in Template:new en verb pres part, which is probably the best solution for the moment. bd2412 T 17:42, 17 September 2006 (UTC)

If I understand this correctly I should be using the heading ===Verb form=== and we have to add the category into the verb form article seperately or use the infl template.--Williamsayers79 17:01, 17 September 2006 (UTC)

I don't think so - I believe that ===Verb=== is the proper header, and there is no consensus (as yet) to make ===Verb form=== a header... I've undone most of the ones I added (still hunting a few here and there). bd2412 T 17:42, 17 September 2006 (UTC)

My commentary to all of the above: as noted I have removed the category from Template:present participle of and put it in the cheat template to create new English present participle entries. I've also created Category:Scots present participles and Category:French present participles (it was my understanding from an earlier discussion that categories for parts of speech would use the full name of the language instead of the abbreviated form). Each category is in Category:Present participles and in the Category:Foo verb forms for the appropriate language. I think this is the most logical organization, but am open to any ideas. I have a question also about the entries themselves - should an entry for, e.g., lâchant say "present participle of lâcher" (which it is), or simply indicate the English equivalent, which is releasing - or should it have both? bd2412 T 18:46, 17 September 2006 (UTC)

  • I am of the opinion that the following are the only things that should be used as part-of-speech headings for English entries: Symbol, Noun, Verb, Adverb, Adjective, Pronoun, Interjection, Article, Conjunction, Abbreviation, Initialism, Acronym and the x phrase derivations (noun phrase, verb phrase etc). This excludes "Verb form" because I think that everything which is labelled as "Verb" is a "verb form" whether it is the infinitive or the second-person plural of the past participle. - TheDaveRoss 18:59, 17 September 2006 (UTC)

English POS

Continued from comment above by TheDaveRoss, repeated below

  • I am of the opinion that the following are the only things that should be used as part-of-speech headings for English entries: Symbol, Noun, Verb, Adverb, Adjective, Pronoun, Interjection, Article, Conjunction, Abbreviation, Initialism, Acronym and the x phrase derivations (noun phrase, verb phrase etc). This excludes "Verb form" because I think that everything which is labelled as "Verb" is a "verb form" whether it is the infinitive or the second-person plural of the past participle. - TheDaveRoss 18:59, 17 September 2006 (UTC)

Um, For English I would add Preposition, as well as Cardinal number, Ordinal number, Idiom and possibly Phrase (though I haven't seen a clear example of the latter yet that couldn't be classified as something else). While it is true that headings like Noun form and Verb form have little utility in English, they have tremendous utility in highly inflected languages. I use them in Latin and Spanish when I am writing an entry for a non-lemma entry so that other editors will have a cue that the information about the word is not on that entry page and should not be added there.
Hello, don't forget Prefix and Suffix! bd2412 T 22:43, 20 September 2006 (UTC)
People keep saying that "there has been discussion" but all the links that I can find to such discussion seems not to thve reached conclusion with even a partial list of acceptable POS headers. Could we create an entry layout page (and corresponding talk page) where a list of accepted, debated, and rejected options could accrue? --EncycloPetey 22:06, 17 September 2006 (UTC)
Discussion copied to Wiktionary_talk:Entry layout explained/POS Headers. Continue discussion there.
I think removing the conversation from the beer parlour would be very detrimental. Moving conversations around is exactly the approach many of your predecessors have taken, which is exactly why you cannot find the previous discussions now. --Connel MacKenzie 15:09, 18 September 2006 (UTC)
As I general principle for discussion, I agree with you fully. But for this particular issue, I think we need a page and corresponding discussion. The topic comes up intermittently, with no apparent resolution each time it is discussed. I am trying to copy all the relevant discussion to the single location, which will archive it separately from all the other discussions that have happened here. It should be very easy to find in future, since it has a shortcut of WT:POS which is quite intuitive. Once some points have been fleshed out and agreed upon, the corresponding page would be summarized on the WT:ELE, with the full page linked from there directly. All this should make the past discussion much easier to find, rather than the reverse. --EncycloPetey 21:41, 18 September 2006 (UTC)
With the only exception being the archives, I'd say that every conversation that has ever been moved out of the beer parlour was moved "to make it easier to find." I maintain, that none of those are "easier to find" as a result. Ironically, the BP archives have the only usable, searchable index of topics. (The irony is that the archives exist only in an effort to reduce the page size.) --Connel MacKenzie 02:27, 19 September 2006 (UTC)
This issue is distinct and important enough that I'm going to create a new section for it, below. —scs 00:40, 20 September 2006 (UTC)
We clearly need the page, and that has a talk page which should be used. Connel, I think the operative word here is copy. Discussion is in order in either place. EncycloPetey may have just stated that a bit imperatively ... Robert Ullmann 15:48, 19 September 2006 (UTC)
What is the benefit of fragmenting a conversation in progress? To ilicitly sneak in changes elsewhere because they are out of the spotlight? So you can sneak in "form" because you don't wish to find the earlier conversations that compellingly argued against your POV? --Connel MacKenzie 06:41, 20 September 2006 (UTC)
If you can find these compellingly arguments, please link them or copy them into the WT:POS discussion page/archives. I have never seen any such arguments, but would sincerely like to. --EncycloPetey 17:44, 21 September 2006 (UTC)
Excuse me? Please refer to the second paragraph of the Beer parlour. What the hell, I'll copy it here:
Sometimes discussion identifies an issue as an idea for policy development or rewriting. Such discussions may be taken out of the Beer parlour to the relevant policy page, or a brand new one may be created. See Category:Policies - Wiktionary Top Level for identified policy pages. Some of these may be inactive. Usually, the active policy pages will be listed in one of the sections below. See also the policy development page.
That is, I believe, Beer Parlour policy? And the new page is properly identified as a policy think-tank draft; following Wiktionary:Policies and guidelines. There is nothing "sneaky" going on. And certainly no "sneak[ing] in changes", WT:POS is a very early draft, just as en.wikt policy requires. Connel, what is upsetting you so much that you are conducting a personal attack? We both know that isn't you!
I have read all the previous discussion I can find. I find very little of it "compelling". Whatever was "resolved" was apparently so un-important that no-one bothered to actually write a policy document (see scs comments below). I'm fairly agnostic on "form" in the header; although I do think it is good in the categories. The WT:POS draft is reflecting what is being used, what is not controversial, and some intermediate "take" on what is being discussed. Robert Ullmann 07:07, 20 September 2006 (UTC)
It seems I have completely misunderstood the purpose of that sub-page. I thought that it was part of WT:ELE already. As for the second paragraph of this page, I do not know when that was reworded that way, nor precisely why. It does seem reasonable enough. --Connel MacKenzie 19:13, 20 September 2006 (UTC)
I reworded that a while ago. — Vildricianus 08:22, 28 September 2006 (UTC)
I would like to clarify my initial statement (which I didn't realize was going to spark a new discussion) I think there should be a finite list of things which we all use as "part of speech" headers, and that that list should be clearly spelled out somewhere. That list I wrote is clearly lacking, but we should have one common list that we all rely on. - TheDaveRoss 17:25, 21 September 2006 (UTC)
A draft list is developing at WT:POS, inspired by your statement. There have already been found a number of headers in use that most of us probably didn't know existed. Once we have the list, we can use it as a refernce point for discussions, proposals, and the setting of a "standard" which can then be incorporated into the WT:ELE. Thanks for sparking the discussion! --EncycloPetey 18:09, 21 September 2006 (UTC)

on consensus, and policy, and the beer parlour

There's been frustration evidenced in several of the threads above about the proper form and venue for these policy debates. They can be overwhelming if carried out in full here in the Beer Parlour, but they can get lost, or languish unresolved, if relegated anywhere else.

To some extent these observations are symptoms of what may be a larger problem. We don't currently seem to have a fully worked out way of developing and then promulgating new policy. We're good at debating issues and bringing up good points and counterarguments, and we're almost good at reaching consensus. (Sometimes we do actually reach consensus, but sometimes we can't decide and fall back on "this will be left to each individual editor's discretion"). But we don't seem to be very good at taking the final step, when we actually do reach consensus, of actually calling our new consensus a "policy" that we can write up and point newcomers at.

With respect to the specific metaissues that have been raised above: it's clear to me that, logically at least, the Beer Parlour is not the right place for these extended policy discussions. There are two reasons: one is that, once a debate gets sufficiently abstruse, it's no longer interesting to the majority of Beer Parlour readers, and it ought to instead take place in a more focused space where those who care can concentrate on it. (Two possibilities for such a focused space are the talk page for an existing policy page, or a "working group" or "think tank" page.)

The second reason is that, even though the Beer Parlour may be well-indexed and the regulars who remember past debates may be able to find them in the Beer Parlour's archives, no newcomer is ever going to be able to do that. A responsible newcomer who wants to review the policy debate behind, say, our entry layout is naturally going to start at Wiktionary talk:Entry layout explained, so that really ought to be where most of the interesting, meaty debate about entry layout issues ends up.

Finally, though, we need to ask ourselves why debates which aren't on the Beer Parlour languish and go nowhere, and why those debates that do go somewhere and which achieve (or come tantalizingly close to achieving) consensus don't manage to turn into citable policy. It may be that we don't have a critical mass of people who care about policy. (That's not necessarily a bad thing, of course, given that time spent worrying about policy is time not spent actually writing the dictionary.) It may be that the people who do care about policy are being too deferential to their opponents and not pushing policy forward as long as there's still any opposing viewpoint or dissent. It may be that we're simply too lazy (or too otherwise occupied) to do the boring work of writing a policy page once consensus has been reached, so we go off and do more interesting things (like actually writing the dictionary) until a newcomer comes along and starts asking questions about all the things the existing policy documents don't say, and then we have to take a step back and scratch our heads and try to remember what that consensus we thought we had was, and where the debate about it might be.

I hope I'm not sounding judgemental or accusatory here. (Really, that's not my intent.) And I don't actually have any grand prescription for improvement here, either -- this is all just food (er, beer :-) ) for thought. —scs 01:04, 20 September 2006 (UTC)

I believe you are greatly overstating the problem.
Well, I didn't say it was a huge problem. But when a well-meaning newcomer tries to understand something as basic as our scheme for classifying parts of speech, and even after reading WT:ELE can't find the information and has to ask here instead, and when after considerable roundabout discussion we discover that our best approximation of our preferred header list is buried in the tooltip code in Monobook.js, then yes, I'd say we have at least a little bit of a problem. :-) —scs 17:37, 20 September 2006 (UTC)
People can ask before doing something that seems like it doesn't follow convention. To blithely ignore existing conventions is another thing entirely. The general purpose for not having policies is not, as you assert, laziness, but instead is an effort to maintain some flexibility. That doesn't mean that some things haven't been agreed on, one way or the other. Room exists for lots of experimentation. That doesn't mean we should throw away years of "heading folding" efforts, on a whim.
If you wish to start building up the Wiktionary policies, that is a Good Thing. But it is very adversarial in nature. It is a thankless task which fosters arguments over the most minute details.
I'd appreciate you joining some of the conversations in irc:// regarding this topic. There are lots of ideas being tossed about, to help the situation.
--Connel MacKenzie 06:56, 20 September 2006 (UTC)
I'm not sure it's necessary to worry about whether a particular discussion is "no longer interesting to the majority of Beer Parlour readers". We don't have to read what doesn't interest us. Do you really feel that lighter discussions and more serious discussions don't mix here and that there are not enough choices of forums? Maybe the description here should say "keep it light", and there could be another group which says "keep it serious". In my experience with discussion groups, I have observed that they eventually end up being whatever people make them. People are going to say what they want to and take the discussions wherever they go. For me another forum would mean more places I'd have to look. Abstrator 07:00, 20 October 2006 (UTC)

em dash and em-dash

We have both em dash and em-dash, with basically identical entries. How do we handle it when we have words like this? Just put a note on each page that the other is an alternative spelling? RJFJR 15:54, 20 September 2006 (UTC)

If there are two pages which have, and always will have identical content, there is the possibility of transcluding one page into the other. This is rarely used because rarely is it applicable. More often alternate spellings simply have a simpler entry pointing at a fuller one, so an indicator in each of them pointing out the other would be appropriate. - TheDaveRoss 16:09, 20 September 2006 (UTC)

WT:CU - Checkuser policy

I have drafted an initial set of policy and procedure guidelines for Wiktionary's version of CheckUser, they can be found at WT:CU or Wiktionary:Requests for checkuser. Discussion of policy and procedure should take place on the talk page there, which will make it easier in the long run to find it again I think. Have at it! - TheDaveRoss 07:53, 21 September 2006 (UTC)

rfap template in article or talk space?

Does an rfap template (request for audio pronunciation) belong in main space or talk space? (I just put one at talk:hegemony because I didn't want to clutter up hegemony.)RJFJR 16:10, 21 September 2006 (UTC)

In the pronunciation section of an article is where I see it most, I would put it there. The talk page will also get the job done, so that is fine too. - TheDaveRoss 16:13, 21 September 2006 (UTC)

Anti-consensus changes - can recent Policy movement help?

One of the longest problems that en.wiktionary has had, at an organizational level, is the discussion-less changes to fundamental pages such as WT:ELE or WT:CFI. Lately, there have been more people willing to take a "policy" approach to problems.

Several things come to mind immediately. One is that changes to such pages that don't point to a relevant conversation somewhere should be reverted on sight, right? Another is that we probably need a Wiktionary:Votes/WT:VOTE page, where items can be brought to the community's attention for something more solid than "perceived consensus."

Not necessarily. Some changes are points of clarification, such as the minor changes I just added noting that "Related terms" and "Derived terms" link words in the same language, to clarify against "Descendants" which clearly notes that it's for terms in other languages. This is a case where an addition did not lead to the necessary clarification of similar text elsewhere, but should have. --EncycloPetey 17:49, 21 September 2006 (UTC)

Are we ready to take some of these steps now? --Connel MacKenzie 17:40, 21 September 2006 (UTC)

Perhaps. We could certainly try having a VOTE page. If it doesn't work, we could always vote it away later ;) --EncycloPetey 17:49, 21 September 2006 (UTC)
Support the creation of a voting venue, support the solidification of policy, support of most everything. - TheDaveRoss 18:39, 21 September 2006 (UTC)
Support the voting page; support the clarification of decision policies, support the revisiting / revision / update of the ELE as a community. --EncycloPetey 18:53, 21 September 2006 (UTC)
Support the voting page; support the clarification of decision policies, support the revisiting / revision / update of the ELE as a community. --Enginear 21:08, 21 September 2006 (UTC)
Support the voting page; support the clarification of decision policies, support the revisiting / revision / update of the ELE as a community. (what they said :-) BTW, consider that some kind of gatekeeping mechanism for vote proposals might be required, such a vote proposal page, requiring a specified number of seconds either by other users or admins before listing on vote page proper. But I guess that's getting ahead of things... --Jeffqyzt 21:58, 21 September 2006 (UTC)
Support the clarification of decision policies, support the revisiting / revision / update of the ELE and other policy pages as a community, oppose the institution of a formal voting mechanism at this time. —scs 17:23, 22 September 2006 (UTC)
  • Well, I wasn't calling for a vote, on VOTES but apparently people like the idea. Who feels like being bold, to get the ball rolling? --Connel MacKenzie 23:00, 21 September 2006 (UTC)
Seeing as I am in the "starting things" mood, WT:VOTE and Wiktionary:Votes have begun. The header might need some revision before it is...complete. - TheDaveRoss 03:47, 22 September 2006 (UTC)
I've made some changes to what you had there. (Rebuilt it entirely, but left the purple.) I added a test vote thing; two other people have already voted on I guess this was perhaps overdue. I stopped short of re-arranging the ongoing WT:C page to have a sub-page arrangement (so that the votes could appear in two places) for several reasons. Better to let that run its course, before getting too fancy. --Connel MacKenzie 09:19, 22 September 2006 (UTC)

This vote is mainly a test, to let people experiment with how voting here will work. TheDaveRoss put the "50 minimum" thing in the original WT:VOTE page, but I figured I'd start things off with an opposition vote, as I think the waivers for sister-language (and sister projects) is critical. --Connel MacKenzie 07:38, 22 September 2006 (UTC)

ELE headers - Related & Derived terms

I noticed that V-ball had reordered the sections on the current WOTD thaumaturgy, so that the Related terms were listed as a level-4 header under the POS (and before the translations), instead of as level-3 at the end. The cited reason in the edit history was the ELE. Looking at the ELE, I found no statement requiring this structure, and in fact the Related terms and Derived terms sections are discussed in sequence after the Translations. However, I did notice that the (complex) example included on the page has this format.

Personally, I think we should change this example for three reasons. (1) Lists of related words and derived words are not directly relevant to the word itself, because they often belong to other parts of speech. This makes the content of these sections very different from what is listed under synonyms, quotations, or translations. They do not pertain to the POS, and should not be listed in such a manner as to imply that they are. (2) Placing them before the Translations section creates a physical and thematic separation between the definitions/quotations and the translations. We want users adding translations to be able to look back and forth between definitions, quotations, and translations without having intrusive material in between. (3) The current example implies a particular format that is not actually described or advocated anywhere in the ELE.

My own practice has been to place the Related/Derived terms as a level-3 header following the translations. This makes more sense to me. Now, I'm not saying that this is always the preferred location, since there are cases in which we want to tie these entries to a particular part of speech or particular sense, but I think for the more general case, this is the most logical sequence. Thoughts? --EncycloPetey 18:03, 21 September 2006 (UTC)

I guess I don't have strong feelings about the level of the headings, although I think having them 4th level makes more sense as they are related to the entry, although, like you say, not always directly (I would think, though, that derived words are directly related).  However, I'm all for having these entries after the translation section as you suggest.  Most importantly, I'm for a standard, a standard described in the ELE for all to see, and a standard that is then used.  —  V-ball 21:05, 21 September 2006 (UTC)
The translations should come after all other English language relationships (synonyms, derived term, etc.) to the term. If there are multiple etymologies for a word, but the synonyms apply to all etymological definitions (unlikely,) then the synonyms should be at level three, with the etymology, which the part of speech heading would be at four, translations at five. --Connel MacKenzie 21:11, 21 September 2006 (UTC)
The relevant sections in ELE that say to me that, e.g. Derived terms, should be one level below the P.O.S. that spawns them (if known) are from WT:ELE#Additional headings, "A key principle in ordering the headings and indentation levels is nesting. The order shown above accomplishes this most of the time. A heading placed at one level includes everything that follows until an equivalent level is encountered. If a word can be a noun and a verb, everything that derives from its being the first chosen part of speech should be put before the second one is started. Nesting is a key principle to the organization of Wiktionary, but the concept suffers from being difficult to describe with verbal economy. If you have problems with this, examine existing articles, or ask questions of a more senior person.", (emphasis mine) and from WT:ELE#Derived terms, "If it is not known from which part of speech a certain derivative was formed it is necessary to have a "Derived terms" header on the same level as the part of speech headings." The example shows the breakout, and furthermore the direction to place derivations of uknown specific provenance at the same level of POS implies that the opposite is true if provenance is known.
That said, it's not quite explicit, and I don't really care all that much; knowing that a decendant derives from a word+Etymology is probably more important than knowing which particular part of speech it derived from. In any case, cross pollination is going to shade any derivative meanings. I am in support of revising (making?) policy that Derived/Related terms should be at the same level as Part of Speech, and all of those be children of Etymology. Like V-ball, though, I'm more interested in standardization than the particulars. --Jeffqyzt 21:51, 21 September 2006 (UTC)
Did I word that wrong above? There are several possible structures:
  1. All of these headings at level three (if there is only one POS, and one etym) with translations coming last.
  2. Etymology and POS at level three, everything else nested below at level four (again, with translations coming last in each section.)
  3. Etymology, POSes and derv/syns/etc at level three, all others nested below a POS.
  4. Etymology and derv/syns/etc at level three, POS at level four, all others nested below at level five.
  5. Etymology at level three, POSes and derv/syns/etc at level four (for cross-polinated synonyms) all else nested lower.
In each case, determining what amount of cross-polination is going on, is what should determine which/where/what level the derv/syns/etc end up at. But in all cases, the translations are supposed to come after the English language relevant parts. If the ELE doesn't say that clearly, with brevity, then perhaps we should think of rewording it (with caution!) --Connel MacKenzie 23:12, 21 September 2006 (UTC)
But current practice has Translations at level-4 nested under POS, so are you saying that POS should come last, after Related terms and the like? --EncycloPetey 23:59, 22 September 2006 (UTC)
From what I've seen (and I've seen a lot of pages briefly), there are three standard level-3 headers in widespread use for languages with a single Etymology: (1) Etymology, (2) Pronunciation, (3) POS, in that order most often. There is also a hefty percentage of pages for which (4) Related terms (and similar headers) are placed as level-3 following the POS section, but as Connell has noted (and is currently modeled in the ELE) there is also a hefty percentage for which these headers are level-4 under POS.
Now, If we put them as level-4, then I understand Connel's position completely about having all the English-specific information preceed the Translations. What I can't reconcile is the logic behind listing derivatives and related terms under the POS (as opposed to Etymology). My own feeling is that these are links to less-related pages than synonyms, antonyms, or even translations, so they don't belong in the POS section at all. Putting them in the Etymology section would make more sense, but then we end up beginning each page with lists of peripherally related terms, pushing the inflection, POS, and definitions far down the page in some cases.
I think we ought to have them as level-3 after the POS, treating them almost like external links. It might even be worth creating a 4th grouping header at level-3 to include as subheaders things like Related terms, Derived terms, and Derivatives. --EncycloPetey 23:56, 22 September 2006 (UTC)

Statistics oddities

The Wiktionary Statistics WT:STATS were just updated, and I note three points of interest:

  • There are 243 Slovenian words, but 112 Slovene words. I can't remember which we decided was the correct language header, but it ought to be uniform. My two cents is that all the dictionaries and grammars I have for the language at home (about 7) all use the term "Slovene" rather than Slovenian to name the language.
  • There are 22 pages whose language is Etymology.
  • There are 42 pages with a language header of References.

If someone knows how to search for the miscreant pages and repair them, it would probably be a GOOD THING(tm). --EncycloPetey 23:46, 22 September 2006 (UTC)

Those should come up in Connel's analysis, and then in his todo pages. - TheDaveRoss 23:56, 22 September 2006 (UTC)
Both Slovene and Slovenian are correct, but some time ago I noted that most Slovene contributors used the word Slovene. I believe most of the Slovenian entries were added by User:Drago, a Hungarian. Generally it doesn’t make much difference except when someone links a word to the particular language, as in [[дом#Slovene|дом]]. If the language header on the дом page reads ==Slovenian==, the link does not work right. For this reason, whenever I encounter Slovenian, I change it to Slovene. —Stephen 02:29, 23 September 2006 (UTC)
It certainly seems easier (to me) to make the correction using $python -file:slovenian.txt "==Slovenian==" "==Slovene==". It takes me ~30 seconds to select them all (since I have to search the whole text of the wiki) and another two to ten minutes for the bot to run. (Not under a 'bot account though: User:Connel MacKenzieBot is not an official 'bot.) Shall I proceed? --Connel MacKenzie 02:35, 23 September 2006 (UTC)
Go for it. SemperBlotto 07:20, 23 September 2006 (UTC)
I can find only 36 main-namespace entries with "References" as a level-2 header, but I've been working from the 9/13 dump (as opposed to the 9/22 dump which WT:STATS is currently based on), so that may account for the difference. Anyhow, here they are. Anybody who wants to help, please <strike> these out as you fix them. —scs 23:00, 23 September 2006 (UTC)



One more oddity: I was analyzing the breakdown of languages and noticed that Old English does not appear on the list. I then noticed further that Ancient Greek and Old Prussian weren't either. Now, I know that we have many entries for these languages, so I'm wondering if somwhere in the dump or statistics crunching we're losing track of languages whose name includes a space within the header. --EncycloPetey 00:32, 24 September 2006 (UTC)

Excellent catch. I shall fix that error shortly and regenerate those statistics. --Connel MacKenzie 01:46, 24 September 2006 (UTC)
Scs, if you look at these:
  1. User:Connel MacKenzie/todo possibly bogus language headings
  2. User:Connel MacKenzie/todo2 probably very bogus third level headings
  3. User:Connel MacKenzie/todo3 pages with no "#" lines
  4. User:Connel MacKenzie/todo4 no level two heading at all
  5. User:Connel MacKenzie/todo5 pages bereft of wikification
and make corrections, then I'd appreciate it if you do strikeout or remove them from those page's sections. I have a plethora of other cleanup lists I compile semi-automatically after each XML dump. (You are welcome to suggest other things I should check for, and/or create these lists yourself, as well, of course.) On my main user page, I try to keep a fairly coherent list of items that need cleanup, albeit somewhat fragmented.
Also of note, is Patrick Stridvall's toolserver page, which is a little bit more of a dynamic approach to identifying similar entries. I don't know if he still needs to update it after each XML dump anymore, or not. I haven't seen him around in a while, so I don't know if he is keeping that up to date.
--Connel MacKenzie 01:46, 24 September 2006 (UTC)
Good catch. I don't know who generated the WT:STATS statistics, or how. My own crunch (again, based on the 9/13 dump, not 9/22) came up with these counts for the three you mention:
Old English 1937
Ancient Greek 581
Greek, Ancient 2
Old Prussian 166
And also these:
Technical Information 21301
Dictionary Information 21299
Chinese Hanzi 20615
Korean Hanja 8727
Japanese Kanji 1893
Old High German 823
Biblical Hebrew 325
Japanese kanji 323
Spanish (Castilian) 310
Old Norse 240
Scots Gaelic 210
(But take these numbers with a grain of salt; the script I used to generate them is still pretty rough. For comparison, it generated English 90814, Japanese 15538, French 5827, German 5541, Italian 5357, and Spanish 5087.)
See also Stridvall's header tool (though that page is currently still based on the 7/4 dump).
scs 01:40, 24 September 2006 (UTC)
Vild noticed that in my XML dump analysis, I was generating most of these numbers already, so he asked me to consolidate them onto WT:STATS. If you want to take that (or any part of the process over using your tools, please do! Just let me know, so I don't duplicate the effort. --Connel MacKenzie 02:01, 24 September 2006 (UTC)
Technical infomation, Dictionary information, Chinese hanzi, Japanese kanji, Korean hanja are all the NanshuBot entries for single Han characters. I've been working on what to do with them; but there are some issues holding it up. Robert Ullmann 17:00, 24 September 2006 (UTC)
I've been excluding all Nanshubot entries from these "/todo" pages for a very long time. For the first year that I did them, we really didn't have anyone that dealt with CJKV stuff, active I had removed the clutter to make the lists more usable. I'm very hesitant to turn it on again, as that means my various "/todo" lists will automatically be filled with an extra ~17,000 "bad" entries. OTOH, we have at least eight or nine regular contributors now, so maybe it is time. Yes/No? --Connel MacKenzie 20:48, 24 September 2006 (UTC)
If you do include them, I'd put them in separate lists; they will have a large number of instances of a limited set of problems, and as you say, there is a different set of contributors interested. Like maybe run your s/w twice, with that predicate reversed? Another thing I thought of is checking for "conjugation" used under noun or adjective, or (less frequently) declension occuring under verb. Robert Ullmann 14:12, 25 September 2006 (UTC)
That is a very good idea; thank you. I'll try to make a thing that generates /todo6 by the next XML dump. --Connel MacKenzie 16:51, 27 September 2006 (UTC)
[see more on this thread belowscs 20:55, 27 September 2006 (UTC)]

Han character entries (Nanshubot)

Clearly, those 17,000 Nanshubot-ized Han characters don't want to be put on any lists other than the one driving the eventual bot that cleans them all up. (I haven't participated in that debate, but it seems to me that a decent approach would be to put them in a "Language" of Han or Han ideograph, a "part of speech" of character or ideograph, with all the rest of the info moved to level 3 or below. And "all the rest of the info" -- it's the stuff that's lifted straight from the Unicode unihan.txt file, right?) —scs 23:44, 25 September 2006 (UTC)

I'm not sure of that. They have sat around for ages with no one that really even knows what they are. Now, perhaps, there is enough critical mass of contributors to probably even explain to me what a unihan.txt file is. :-) But either way, I don't see any way for all of them to be "corrected" by bot. Is the problem simpler, than it appears? --Connel MacKenzie 16:51, 27 September 2006 (UTC)
I haven't studied them or even read much of the discussion about them, but my impression is that we've got boatloads of entries that were mechanically generated from the data in (warning: zip file), which is linked to from and documented in Sample data from that file (which I have reformatted slightly):
kCantonese kai2
kDefinition to unbind the collar
kHanYu 10221.030
kIRGHanyuDaZidian 10221.030
kIRGKangXi 0117.060
kIRG_GSource 5-3270
kIRG_KPSource KP1-3690
kIRG_KSource 3-2159
kIRG_TSource 4-422E
kKPS1 3690
kMandarin QI3
kRSUnicode 9.12
kSBGY 269.52
kTotalStrokes 14
(Those inscrutable tags like "kKPS1" and "kIRG_GSource" are all documented in the file, and at [3].)
It's useful data, to be sure, although there's some concern that uploading all of it here might have been a copyright violation, especially in regards to the "kDefinition" lines. But at any rate, you can totally see where, say, the bulk of the data in our entry came from. (All I'm imagining in terms of bot-aided "cleanup" is perhaps rearranging the headers. But I'm sure Robert Ullmann has more to say on this.) —scs 18:53, 27 September 2006 (UTC)
Take a look at . This represents my (with some help) first pass at what the format should be if we keep all the NanshuBot information.
That looks great! I like the way you've put "Han Character" as a level-3 under "Translingual". —scs 20:57, 27 September 2006 (UTC)
It uses two templates, one of which puts the entry in Category:Han characters. There are 21,300 of these entries (+ or - a few). Yes, it is from the Unihan database. I imagined a bot could stuff all the info into the templates. This is the technical side of it. Then there is the copyright. See below. Robert Ullmann 20:37, 27 September 2006 (UTC)
If a decent "transeformed" version can be knocked together by people who have some idea of what these should look like, I will attempt to write a custom bot to transfrom all of them into the new format, hopefully reducing the manual cleanup. - TheDaveRoss 18:57, 27 September 2006 (UTC)
If these are all copyright, or of questionable copyright, they should be deleted. We'd do better to start a fresh import for technical reasons as well. --Connel MacKenzie 18:59, 27 September 2006 (UTC)
See User:NanshuBot, the copyright is very permissive when it comes to use, Amgine and legal counsel gave us the go ahead a few weeks ago. It may be better to reimport than in a new format, but we don't have to based on copyright. - TheDaveRoss 19:02, 27 September 2006 (UTC) apparently I was mistaken, or the discussion I recall was wrong...something. - TheDaveRoss 21:26, 27 September 2006 (UTC)

unihan copyright

Actually, that copyright says explicitly "Unicode, Inc. specifically excludes the right to re-distribute this file directly to third parties or other organizations whether for profit or not." I.e. it was a pure copyright violation. The en.wikt is exactly such a redistribution. Each en.wikt entry is (was) just a reformatting of a record (all of the records!) in the Unihan database. In no way did Nanshu have permission to place the data under the GFDL! That's the bad news, the good news is that Unicode's Terms of Use are better now. But they still don't grant permission to place the entire DB under GFDL, which we absolutely require. Similar observations apply to the "Four Corner" and "Canjie Input" data: it is not sufficient to "use with permission" from a copyright holder: there must be no copyright holder.

I don't know who "Amgine and legal counsel" are; I have been talking to Brad Patrick, General Counsel of the WikiMedia Foundation, and he considers it—at this time—an open and potentially troublesome issue. Robert Ullmann 20:37, 27 September 2006 (UTC)

Without weighing in with an actual opinion ('cos IANAL, and on this issue I could be swayed either way), let me point out that Unicode's copyright is somewhat more lenient on derived works than it is on redistributing verbatim files. We may have all of the same information from their files in our Nanshu-botted entries, but we're clearly not redistributing Unicode's files as-is.
There's also the question of whether pure information (which is what many of these CJK dictionary and character set correspondences are) is copyrightable at all. (If the information's not copyrightable, then "releasing" it under the GFDL isn't an issue.) There's also the question of whether Unicode holds or deserves any sort of "compilation copyright" on it all. (And then there's the question of whether it's worth having all this information propagated into Wiktionary at all, or whether any reader who needs this particular information will always go to the official Unicode files anyway.) —scs 21:05, 27 September 2006 (UTC)
  • With deference to whatever Brad Patrick may eventually say on the topic, I think we should delete them all now. There is absolutely no reason for us to have questionable material at all. We would have sucked in all content from the OED by now, if we were trying to be something polluted. But we're not. The fact that the question can even be raised in seriousness, in my mind, is reason enough to delete these 17k+ entries. If and when, a newer version with GFDL compliant licensing is available, we can try re-importing them in a more usable format. --Connel MacKenzie 21:20, 27 September 2006 (UTC)
I will not comment about legality issues since I am not a lawyer. However, I would like to provide my thoughts with respect to the format and accuracy of the information contained in the 17,000 files. First of all, I find it rather silly that this information is copyrighted at all given the number of errors in the entries. As a speaker of Chinese, I find only two pieces of information to be consistently reliable about the files:
  1. the radical/stroke information
  2. the encoding information
The romanization information is at best incomplete, and sometimes wildly inaccurate (ex. two alternate romanizations, one valid and one either bogus or archaic, without any information about proper usage). The common meanings section is basically useless for anything other than a very superficial rough idea as to the basic meaning of a given character. It is rather moot anyway, since the common meanings section is often blank.
The current format of the information is also problematic. For example, simplified or traditional written forms are listed under a heading called alternate forms. The problem with this label is that you're not told which one is which. This may seem obvious, but it is not. To illustrate my point, I will use the following character(s): and . Technically, is the traditional form of , but is the standard form used in "traditional" as well as "simplified" Chinese. We need the entry to tell us such things.
In my opinion, the key to wiktionary's success is the accuracy and completeness of its entries, not the raw number of entries. I am unconvinced that a bot will create the type of entries that would truly be useful as a learning aid, research tool or professional translation reference work. In order to fill wiktionary with the type of entries that I would like to see, I am afraid that it will be a process of human language experts contributing and editing words over a period of many years. This may sound like I am being negative or unoptimistic, quite the contrary. I believe that Wiktionary will eventually be one of the most valuable language references ever created ... even if it takes us 50 years :)

A-cai 07:29, 28 September 2006 (UTC)

I'm surprised and sorry to hear that the data's of such poor quality, because I'd gotten the impression that the Unicode consortium and the researchers who contributed to the Han unification effort had put a huge amount of work into the unihan.txt file. But I just realized something which may explain the discrepancy: Nanshubot did its work a couple of years ago, and I'm pretty sure the unihan.txt file has evolved quite a bit since then. I'll take a look at the discrepancies A-cai has noted and see if they're reflective of errors in older versions of unihan.txt which have since been corrected.
If the Unihan data which Nanshubot imported is now obsolete, that'd be another reason to do a reimport (assuming we decide to keep the data at all) as opposed to futzing with what we have now. But on the other hand, this is also another argument not to try to replicate such data here at all, since if our copy has this tendency to become stale, our readers are better served by going to the up-to-date Unicode consortium files anyway. —scs 23:14, 28 September 2006 (UTC)
The Han Unification effort was a tour de force; extremely impressive (I say this as one who was tangentially involved); the Unihan database is—and was—solid. The problem is that Nanshu derived romanizations and readings for kanji not directly from the database that are very suspect. The raw data (e.g. what I stuffed into the templates in the example) is OK. Robert Ullmann 23:22, 28 September 2006 (UTC)
Hi there. IAAL - in fact, IAAIPL. Where can I see the original database from which this material was gathered (please email directly to me). My initial inclination is "GET RID OF 'EM". That may change once I see the source and read the TOU... but don't gamble on it! bd2412 T 07:49, 28 September 2006 (UTC)
The raw data file in question is . Its context is (i.e. the public link to it is at), and there is further documentation at —scs 22:50, 28 September 2006 (UTC)
P.S. A-cai, how is the Wiktionary:Chinese Pinyin index - accurate or far from? bd2412 T 07:49, 28 September 2006 (UTC)
I'm guessing that the Wiktionary:Chinese Pinyin index is based on the Nanshu bot files. For example, if you take a look at , it lists qiāng (which is wrong, I checked several of the largest on-line and off-line dictionaries), which also shows up in Wiktionary:Chinese Pinyin index. On the other hand, it does not list under á, ǎ, à, ā or ē (all valid romanizations for ), which also matches the individual entry for . The entry should indicate that ā is standard Mandarin, and then explain the situations in which the others are used (see: ). The romanization for this character demonstrates the type of inaccuracies that I regularly observe in Wiktionary individual Chinese character entries.

A-cai 08:26, 28 September 2006 (UTC)

We certainly don't want to delete the entries. First of all, the existence of the entries isn't a copyright problem, all they are is one entry for each consecutive code in the IS 10646 "Unified Han" code range; some 20K+ code points. (FYI: what we call "Unicode" is ISO standard 10646, "UTF-8" is an annex to that standard; Unicode was one of the inputs to the IS 10646 process.) The problem is the content loaded by NanshuBot. It is possibly/probably a copyvio, and as observed, what isn't directly from the Unihan DB is suspect or simply wrong. An indication of the quality of the derived information is the use of the name "Morobashi", a misreading of the kanji for Tetsuro Morohashi.
However, there has been lots of good information added by many people since the entries were created; we don't want to discard that. A possibility is to strip all the identifiable Nanshu information from the entries, adding a Translingual/Han character section at the top, and then continuing from there. Robert Ullmann 12:03, 28 September 2006 (UTC)
As to your last point, clearly all of those such entries are "derived works" and therefore also copyvios, no matter who made the subsequent edits. For that reason, they pretty certainly should be removed as well, IF the alegations turn out to be true (and/or no further redeeming information is provided.) But, IANALE. --Connel MacKenzie 13:40, 28 September 2006 (UTC)
Certainly the radical/stroke information could not be copyrightable! Also is anyone claiming that (not the collection but) the few isolated pages that have been edited thus far were under copyright? I say remove the bulk of the data, leaving only the altered pages and otherwise the essential components like the radical and the number of strokes. What do you legals have to say about that idea? DAVilla 13:32, 29 September 2006 (UTC)

I have requested for comment the juriwiki-l mailing list, hopefully they will have time to look into the issue and advise us in some way about what is best to do. - TheDaveRoss 16:07, 17 October 2006 (UTC)

Ok, speaking as an IP attorney, I think this is most akin to the situation in Kregos v. Associated Press, 510 U.S. 1112 (1994). There, a baseball reporter came up with a set of nine statistics that he thought were particularly important to determine which pitchers would win the day's games. The Supreme Court held that although Kregos could receive protection in the arrangment and presentation of the statistics, this protection would be very narrow. In essence, all that could be protected was the exact presentation. Any other paper that chose to publish similar statistics could do so as long as the alternate presentation differed from that created by Kregos in "more than a trivial degree", specifically finding it unlikely that the AP's form infringed where it included only 6 of Kregos' 9 stats, included 4 additional stats that Kregos did not.
In our case, we are attempting to provide as much information as possible about every character available. The actual information we seek to present is in the public domain. It is only, therefore, the particular selection and arrangement to which another party could lay claim. Hence, if we strip all non-essential information originating with the other source, include additional information (which we are going to do anyway), and change the arrangement to suit our purposes, we should be in a position to prevail over any challenge to our use of this information.
Cheers! bd2412 T 16:35, 17 October 2006 (UTC)
Having not heard anything back from the Wikimedia General Counsel, having asked repeatedly ... sigh. Given what you say, I think we should do this:
  • Format the info at the top of the entry, which Unicode can't really claim any copyright on (compliation, derivation or whatever) into Template:Han char under a Translingual header so that the format meets our standard.
  • Delete the "Dictionary information" section; it is page and line references to the unabridged versions of large dictionaries most people don't have anyway. (One of them is ~10,000 pages.) This information was developed by Unicode, they might have some claim, but so what; we strip it.
  • Delete the "Technical information" section; this is the Unicode/IS 10646 code point in hex and decimal, ditto the Big5 code point. Unicode has no claim on this, but there is no reason to have it. We don't give the JIS codes (or ASCII, or whatever).
  • We already have additional information on a large number of entries.
Comments? Objections to my doing this? (It isn't a bot run for all of them, there are so many variations people have introduced; it will take runs with a bot or AWB skipping entries with variants, then going back to collect sets of them.) Is anyone going to see this here, or does it need to be moved to the bottom? Robert Ullmann 13:43, 18 October 2006 (UTC)
I think we should await the reply from jurywiki-l before we do anything, if the verdict is delete there is no need to format them before we nuke them. - TheDaveRoss 03:22, 19 October 2006 (UTC)

Well, phase one of the voting has ended. Apparently, the person driving the vote effort now, is not a Wiktionarian, nor ever was. I'm not sure how appropriate most of the remaining "semi-finalist" logos would be for Wiktionary. My objections to the vote timing, vote conduct, etc. have fallen on deaf ears. I suppose it is water under the bridge, for now, at least until someone tries to tell en.wiktionary that it has to use one of the new (inappropriate) logos.

I would like others from this community to review the meta: "semi-finalists" and begin discussing here, whether or not to even consider using one of them on en.wiktionary. It seems to me like a lot of people's time and effort has come to naught. Sould we wait a month or two, then start our own logo contest (since trying to satisfy so many other factions, has left us with such poor results?)

--Connel MacKenzie 01:31, 24 September 2006 (UTC)

I'm not sure what the problem is. I participated (lightly) in the discussion and vote over there on meta, and I saw several other names from here I recognized, so it's not like there was no representation.
Me, I rather liked several of the candidates, and the one I liked best made it into the next round, so I'm happy. :-) —scs 01:44, 24 September 2006 (UTC)
There are several there which I would actively oppose if they someone tried to make them the en.wikt logo, anything in a speech bubble for instance...what does that have to do with a dictionary? The calligraphy one isn't too bad, but I don't like many of them all that much. - TheDaveRoss 01:49, 24 September 2006 (UTC)
I agree the caligraphy one seems good. But that is very different from saying that it is better than our current logo. --Connel MacKenzie 01:54, 24 September 2006 (UTC)
The problem, as I see it, is that has about 300 good, regular contributors. Over 50 people contribute more than a hundred edits each month! To have only three people from here, comment there, is a problem. --Connel MacKenzie 01:52, 24 September 2006 (UTC)
Does WMF require that we use the same logo as all of the other wikts? - TheDaveRoss 01:56, 24 September 2006 (UTC)
I have no idea. I don't want to think of how ugly it could get, if push came to shove. --Connel MacKenzie 20:50, 24 September 2006 (UTC)
If people didn't vote, that mean they don't care about the logo. If you don't like the proposals, there was an option "keep current logo" (which didn't make it to the second round). So, I don't really see where is the problem with the current vote, and when I look at [4] I feel like changing the current logo... Maybe we could make some javascript so that anyone can choose to display the logo xe likes? Kipmaster 08:49, 25 September 2006 (UTC)
Excuse me, but that is a wildly different thing, from meta assuming they can force-feed a logo choice. The same person that ignored my complaints about the irregularities in voting procedure, is the one who unilaterally decided the cutoff for "round 2" after the fact. The fact also remains that whatever en.wikt: adopts, becomes the defacto default that the other language Wiktionaries will then adapt to their needs.
The fact that no one was able to put in a javascript to rotate the proposed logos in (as suggested a while back) implies that our current javascript "resources" are better spent elsewhere, anyhow. --Connel MacKenzie 16:29, 27 September 2006 (UTC)
  • The apparent voting irregularities in "Round 3" now in progress are distressing. --Connel MacKenzie 08:01, 21 October 2006 (UTC)

Special:Policy pages to which TheDave is the sole contributor

There have been a few new things popping up around here in an effort by some of the population to firm up policy and procedure, as well as organize certain things we do. However, I (and possibly others) have created some pages which haven't been scrutinized nearly enough for my liking, and so I thought I would point out the pages I have made which are intended to establish policy for new features, but have only been editted by me. Please, visit them, comment on them, change them, whatever, if someone else shows up and wonders what our checkuser policy or voting policy is, they might only have my hastily drawn up first draft to go on...and that is scary!


These are the ones that I have made, if anyone else has some "policy" pages which they feel haven't gotten the attention they need to get underway, go ahead and link them too. - TheDaveRoss 21:28, 24 September 2006 (UTC)

There's a typo on Wiktionary:Why create an account where it says At the, for example which seems to be missing a link to the German wiktionary (which I would have filled in if I knew what to put here).RJFJR 13:31, 25 September 2006 (UTC)
Thanks for pointing that out, I had linked the German wikt [[de:Main_page|German Wiktionary]] and it should have been [[:de:Main_page|German Wiktoinary]]. - TheDaveRoss 16:48, 25 September 2006 (UTC)

Highlighted new entries

When I view the new entries, some of them are highlighted in yellow. Can someone tell me what this means? — Paul G 16:13, 27 September 2006 (UTC)

If you are a sysop, "not-patrolled" edits appear with a different colored background (usually yellow.) --Connel MacKenzie 16:19, 27 September 2006 (UTC)
I thought it might be something like that. Thanks. — Paul G 11:06, 29 September 2006 (UTC)

Pinyin number system transliteration entries

I've completed the entries for about 1,400 Pinyin number sytem transliterations, i.e. an alternative spelling of the usual Pinyin transliteration, but substituting a number (1 through 4, or rarely 0 or 5) for the diacritic. Right now, virtually all 1,400+ entries have an only unsubst'ed template. Any changes that need to be made universally should be carried out before those templates are subst'ed.

For example, the current content of the entry for xue3 is:


The current layout of the page appears as:



  1. Alternative spelling of xuě, a transliteration for several Chinese characters.
Category:Mandarin pinyin

The xuě in the definition is piped to xuě#Mandarin, which means nothing in this case, but is important for terms which appear in multiple languages, such as . Also, the category entry is piped so the number transliteration will generally show up in the Mandarin pinyin category right after the transliteration using the diacritic. As of now, if the entries are subst'ed, each will have the following setup:

# {{cmn-alt-pinyin|xuě}}
[[Category:Mandarin pinyin|xue3*]]

Does anyone have ideas for improvements that should be made before subst'ing complicates the process of changing the entries wholesale?

  1. Should these transliterations be in a separate category or subcategory?
  2. Should the level 3 header say something other than "Pinyin" - I was thinking of "Transliteration" actually, or something like that.
  3. Should the reference to the transliteration with the diacritic be under a separate heading for "Alternative spellings"?
  4. Should the entry be worded differently?

Cheers, bd2412 T 23:28, 27 September 2006 (UTC)

Note that if you look at the entries right now, you will see a possible version of the above (yi2) that provides for putting the characters with glosses in both the versions with tones and with diacritics. (It also uses "syllable" at this exact moment, but that is one word in a template ;-) Robert Ullmann 17:01, 29 September 2006 (UTC)

Comments on BD2412's section above

(section header changed to L3 to make it part of prev section Robert Ullmann)

Due to the included template, I can't seem to edit the above section. :-(

I do not "like" the ===Pinyin=== heading; I'd much prefer the English language part-of-speech equivalent be identified. Likewise, the "see also" or "alternative spelling" shouldn't just say the foreign language term, but its translation/gloss in English. Couldn't this be done? I understand that these can be helpful to someone who knows what they mean already. But I had hoped that they would become much more complete, somehow. Does that next step have to wait for the templates to be transcluded/subst:'ed?

Also, please refresh my memory: when did ==Chinese== (the common English term) get replaced with ==Mandarin==?

--Connel MacKenzie 07:08, 28 September 2006 (UTC)

We discussed this a few months back; Chinese is a language group, the word "Chinese" used loosely to mean Mandarin sometimes, and the group at other times. The languages in the group are spoken and written differently. (There is a myth that written Chinese is universal, and only the spoken languages vary; this is entirely incorrect. A native speaker of Min Nan may be able to decode written Mandarin, in the way an educated native speaker of English can decode, say, French. But that's it. Frequently speakers of the other languages will learn written Mandarin as an acquired language, with no idea how it is pronounced.) The languages are Cantonese, Hakka, Gan, Jinyu, Mandarin, Min Bei, Min Dong, Min Nan, Min Zhong, Wu, and Xiang. (And numerous dialects.) In this particular case it is important because this is Mandarin Pinyin, not (e.g.) Hakka Pinyin. Robert Ullmann 12:20, 28 September 2006 (UTC)
The template is not included above - there's an actual section header (or two) in there for realism. With respect to the part-of-speech, that doesn't work well here because each transliteration may stand for as many as dozens of Chinese characters (have a look at ), with different meanings and in all different parts of speech. Many of those individual characters have multiple meanings and can be used as different parts of speech - the language is very loose about that; some characters have no meaning at all apart from their inclusion in other words. It's almost as if we had an entry for a symbol representing a sound like "fō", which can be combined with other symbols to make fōtō or fōrest or camfōr, and so forth.
Also, these can't rightly be under the second level heading of ==Chinese== because they are specific to one dialect. Cantonese, for example, uses altogether different transliterations. Making them more complete would probably entail adding the actual characters, as in xuě. We're still not settled on how to do that in the diacritic entries, which should probably be completed before adding that level of detail to the number-system entries. Cheers! bd2412 T 07:28, 28 September 2006 (UTC)
I think (zh:yì) very clearly depicts my primary complaint. In English we have entries like un- which have many derived terms (Pages starting with "un"..) But we do explain each of them. Clearly, I'm missing a big part of the picture here. We also have entries like run where each individual meaning is spelled out. I assume that zh:'s entry for zh:run will eventually be equally detailed. But it probably won't be spelled out in English on the zh: Wiktionary. Am I out to lunch here? --Connel MacKenzie 08:11, 28 September 2006 (UTC)
Well, what I'm saying is to hash those out on the entries for the actual diacritics (e.g. as opposed to yi4, which says the same thing with numbers). Eventually we'll get them all sorted out like that, but I'd rather have the number-system entries the way they are until we have all the diacritic entries done in a manner that comports with what I think you have in mind. bd2412 T 08:16, 28 September 2006 (UTC)
Just to clarify a tiny bit: what I have in mind is that looking at an entry, no matter what language it describes, should convey something meaningful to an English reader (which is the premise of having separate language Wiktionaries, after all.) Having a tone transliteration pointing only to a foreign symbol with no description in English is something we should be moving away from, not towards. --Connel MacKenzie 15:34, 28 September 2006 (UTC)
Right now yi4 is essentially just an alternate spelling, so these arguments apply more to and/or apply with equal weight to all alternate spellings. As for foreign symbols, there is no way to distinguish homophones based on their sound, and there are no written symbols other than foreign ones that do this. DAVilla 13:04, 29 September 2006 (UTC)
Your point is well taken - I'd just prefer to have one set of pages on which those entries are created, worked over, and perfected before they are copied over wholesale to a second set of pages which are, effectively, alternate spellings. Perhaps some kind of transclusion could be worked, but I'm leery of that. bd2412 T 18:43, 28 September 2006 (UTC)
As to the L3 Pinyin heading, I don't like it much either. But we need some standard non-POS L3 heading for these, like Letter or Symbol. Maybe it should be Pinyin syllable? Ideas? Robert Ullmann 12:20, 28 September 2006 (UTC)
I was thinking maybe Transliteration or Phoneme. bd2412 T 14:11, 28 September 2006 (UTC)
Well, not phoneme, is two phonemes, one syllable. Robert Ullmann 14:42, 28 September 2006 (UTC)
That's a good idea. But are they syllables? (Perhaps therein lies most of my confusion?) Perhaps "Pinyin abbreviations" or "Pinyin notation"? Or "Language notation"? "Tone marking for foreign characters"? There really is no easy way to say this; simply calling it "Pinyin" may be accurate, but conveys no information at all to our typical "readers." (Nor me, really.) --Connel MacKenzie 15:34, 28 September 2006 (UTC)
They are almost always syllables in Mandarin Chinese, with at least ㄦ (or something related to it?) as an exception, and as I understand it with many more exceptions in Japanese. So ===Syllable=== isn't really appropriate either. ===Transliteration=== or the name of the system like ===Pinyin=== (or something containing it) are the best options so far. In the former case we would have to find a place to name the system, since there are multiple transliterations for many languages. DAVilla 12:43, 29 September 2006 (UTC)
Okay, the test cases as far as format goes are as follow: in a single language like Mandarin, a word that is a natural collision, the same transliteration of Chinese under two slightly different systems, and a word that is a coincidental collision, the transliteration of completely different Chinese words under significantly different systems. The ===Transliteration=== header would not work well for the second case because it would have to be listed twice, separating the entriely different meanings under different systems. But with ===Pinyin=== and a similar flavor as headers, there will be a lot of duplication. DAVilla 18:26, 7 October 2006 (UTC)
We don't list "==English (American)==" nor "==American English==" nor "==American==" nor "==British English==" nor "==India (country) English==" nor "==Brooklyn English==" nor "==Texas English==" nor any of the dozen other "very, very big" dialects. In English, the various dialects of Chinese are only understood as "Chinese" (the target audience again: English readers) which is why the heading we've used has been "==Chinese==". I remember seeing a scheme where the dialect was identified below the language heading - at least that way our readers would have some clue as to what they were dealing with. "Mandarin" seems like too unfamiliar a term, to our target audience. --Connel MacKenzie 15:34, 28 September 2006 (UTC)
But (e.g.) Min Nan and Mandarin are not by any stretch "dialects" of a "language" called Chinese. They are mutually incomprehensible. If they are dialects, then (as I mentioned above) English and French and Italian are "dialects" of (what? "European"?). Texas English and Brooklyn English share 98% of the vocabulary, and 95% of the pronunciation. Min Nan and Mandarin share 15-20% of the (common, written) vocabulary and none of the pronunciation. They are not dialects. We owe our readers the understanding that "Chinese" is not a language. (No matter how much they have been confused before ;-) A-cai? Robert Ullmann 16:13, 28 September 2006 (UTC)
As I'm beginning to understand it, the right way to think about this is that the relationship from Mandarin, Cantonese, and Min Nan to "Chinese" is the same as the relationship from English, German, Danish, and Swedish to "Germanic languages", and from French, Spanish, and Italian to "Romance" or "Italic" languages. (Or perhaps the same as the relationship from English, German, Danish, Swedish, French, Spanish, and Italian to "Proto-Indo-European", if you believe in such things.) —scs 03:18, 29 September 2006 (UTC)
I understand the concern, but this was already debated and decided upon, or so I thought (Wiktionary:Beer_parlour_archive/July_06#Min_Nan). Connel, I believe you originally suggested giving the idea several weeks before making changes (that was two months ago). Robert is essentially correct. As Davilla pointed out in the July discussions, the words dialect and language are imprecise terms. The ultimate question is whether two forms of communication are mutually intelligible. In July, I provided an example of how Min Nan and Mandarin are unintelligible (for more info, see w:Chinese language#Classification of variations within the Chinese language). I agree with Connel that perhaps the word Mandarin is not as well understood in English as the word Chinese. However, it is not that obscure ("beginning Chinese" google hits vs. "beginning Mandarin" google hits). Besides, there are many languages that are not well known or understood to a monolingual English speaking readership. These might include: Chuvash, Dacian, Kadiwéu (all of these and more are already language headers on Wiktionary. See: Category:All_languages).

A-cai 18:02, 28 September 2006 (UTC)

That is a very unfair comparison though. Can you name even one person who speaks English, that hasn't heard of China? --Connel MacKenzie 23:13, 28 September 2006 (UTC)
But a fair comparison is that many Americans believe that the people of Mexico speak "Mexican", which is not a language. --EncycloPetey 23:22, 28 September 2006 (UTC)
Connel, part of the confusion is that when we say "Chinese" in colloquial English, the majority of the time, we are actually referring to Standard Mandarin. The United Nations lists six official languages. One of those six languages is Standard Mandarin of the PRC (official written correspondence is in Simplified Chinese). However, if you visit the United Nations website, they frequently use the term Chinese. Cantonese is a dialect of Chinese, but if a person could only speak Cantonese and English, but not Mandarin, that person would stand no chance of gaining employment with the U.N. as a Chinese interpreter. The problem is that the association of Chinese to Standard Mandarin is not an absolute. That same Cantonese speaker may very well consider himself to be a Chinese speaker (in other words, he may regard Cantonese and Chinese as being synonymous), despite the fact that Cantonese and Mandarin are not mutually intelligible. Further complicating this is the fact that a huge number of Mandarin speakers are also fluent in at least one other Chinese dialect (see: w:diglossia, and w:code switching). In spoken and written English, we can often determine the intended form of Chinese by looking at the context. An individual entry on English wiktionary often does not provide the necessary context (especially since wiktionary is attempting to document all languages), hence the need for greater precision in the level two header.

A-cai 01:33, 29 September 2006 (UTC)

Question: would it be appropriate to use a compromise notation in the level-2 language headers, something like Chinese/Mandarin, Chinese/Cantonese, Chinese/Min Nan, etc.? The intent would be to simultaneously:
  1. convey that these are not mere dialects (precisely because there would not be any dialectical level-2 headers like "English/American" anywhere else on the wiki to falsely compare them to), but also
  2. reassure and offer some instruction to the dumb Americans like Connel and me, who have had a lifetime to learn the false fact that "Chinese" is one language, and are having trouble letting go of the "fact".
Now, I do realize that issues like this can be very, very sensitive. I realize that the speakers and partisans of some of these languages might not want to be associated with the word "Chinese" in any way, or might be afraid that the intent stated in #1 above would not be understood by readers, that readers would get the unwanted and wrong impression that the languages are mere dialects. Anyway, please, don't shoot me for suggesting this; it's just an idea. —scs 03:07, 29 September 2006 (UTC)
Alternatively, what we really need to do is figure out some good, standard way of dealing with "language groups" at all, because as Robert Ullman pointed out above, that's the right way to think about what "Chinese" is. —scs 03:10, 29 September 2006 (UTC)
The point is that our list of level-two headings aren't all langauges or all dialects, rather languages or dialects, with mutual intelligibility the best yardstick we have for managing the master list of these languages/dialects. Clearly (to anyone who speaks it) some like "Chinese" must be split, whereas I question the wisdom of, on the flip side, maintaining language splits upheld by politics only. DAVilla 12:37, 29 September 2006 (UTC)
Classification of Chinese as a language or language family is problematic because whether you classify it as one or the other has more to do with your political beliefs than with linguistic precision, as has been pointed out before. The question for us is: do we at Wiktionary need to take a point of view on the issue, or can we come up with a descriptive label that both satisfies our need for precise classification of words, while still remaining neutral? I believe the term Mandarin is sufficient for this purpose, since it is a known commodity in English (albeit not as well known as Chinese). However, if Mandarin is thought of as being too obscure, I have no objection to ==Chinese Mandarin==. This matches the ISO 639-3 code of cmn. Cantonese and Min Nan can still stay as is (in other words, no need to say ==Chinese Cantonese== or ==Chinese Min Nan==, which sounds awkward in English anyway).

A-cai 04:34, 29 September 2006 (UTC)

Please note that I was referring to our target audience, moreso than myself. I will admit that I was under the impression that "Min Nan" had nothing to do with Chinese. A-cai, your proposed solution sounds elegant. But it does sidestep the thorny issue raised by Robert Ullman that does need to be addressed at some point. Thank you for pointing out that it isn't America where the language name confusion originated, but China itself (Cantonese speakers calling a "different language" Chinese.) Don't forget, there are worse places than the US that speak English, but are not likely to have any clue, which Chinese language is which.
"Mandarin Chinese" sounds more natural to my ear, than "Chinese Mandarin." I would describe the off-and-on participation in these conversations (no, not just mine) as erratic. Perhaps we should put the issue to a WT:VOTE? There are several possibilities that I see. 1) The A-cai suggested headings. 2) The historic Wiktionary "Chinese" only for all languages in the Chinese language group, to keep a standard heading, 3) Language group prefix before each (e.g. Chinese Mandarin, Chinese Cantonese, Chinese Min Nan,) 4) Use the very-taboo templates with the language headings, so that people can choose to see just "Chinese" for all three or the name with "Chinese" or just the name (or even just the language code, if they really wanted.) Are we nearing a point where we should start voting and writing policies, or do we need a few more days for brainstorming? --Connel MacKenzie 08:56, 29 September 2006 (UTC)
There is a lot of argument against #2, and for those logistical reasons it really doesn't stand a chance IMO. I doubt much need or support for #3, but I'm not opposed to it or to using ==Mandarin Chinese== for clarity, although really ==Mandarin== should suffice, as per #1. Can't comment on #4. Is anyone really pushing for it? DAVilla 12:37, 29 September 2006 (UTC)

Personally I would rather cause temporary confusion to someone who knows nothing about it, than extreme irritation to anyone that has any familiarity with Chinese languages. They are (after all) the people that will be making most use of it. You might expect to see ‘Chinese’ used in some places and on some sites, but on a language website like this one it would be pretty pathetic. Widsith 08:39, 29 September 2006 (UTC)

Widsith, Chinese speakers/readers (all flavors) are using one of the zh: Wiktionaries. The people reading this one are English speakers. So, to facilitate learning, you'd want them to recognize unfamiliar terms? And if they can't then shoo them away? Where should we send people, to go learn Chinese, before they are permitted to look up Chinese entries in en.wikt:? --Connel MacKenzie 09:02, 29 September 2006 (UTC)

No, to facilitate learning I think we should use the correct terms. Lumping Mandarin and Cantonese together as ‘Chinese’ only facilitates ignorance. Also, it is not the case that you have to speak the language to have some understanding of it. Chinese speakers may use the zh:Wiktionary, but people learning or studying Chinese languages will use this one and they require decent treatment. People who don't study it, aren't learning it, and have no interest in it could hardly care either way. Widsith 10:11, 29 September 2006 (UTC)

I created the entry for 草地 so that we can have a concrete example to look at. In particular, note the example sentences. It would be very cumbersome to combine the example sentences, because the sense meanings do not coincide in this case (see false friend). Even if the sense meanings did match, the wording in the Min Nan sentence would need to be completely reworked in order for it to make sense as a Mandarin sentence (I know this is not obvious by looking at it, you'll have to trust me on this point). This is why I ultimately turned away from a generic label of Chinese. As soon as we start to introduce Chinese dialects other than Mandarin into the equation, we run into serious problems.

A-cai 10:58, 29 September 2006 (UTC)

I think the comments earlier about places like Mexico miss the pragmatic point about the number of native "Chinese" speakers -- well over 1 billion, roughly the same size as Europe, or the Americas, or India. I don't remember hearing anyone arguing that Europeans speak European or that all people in the Americas speak American. Certainly, some people believe that Indians speak Indian, but we don't pander to them here -- we say Hindi, etc, etc. Similarly, we should speak of Mandarin, etc, etc.
I think most people in England would recognise that the word Mandarin relates to China (we know the word from history lessons, and also oranges and operattas). I think that most Londoners would also know that Mandarin was an important Chinese language (and the same is probably true in most cities with a significant Chinese population). I can't speak for other English-speaking countries. --Enginear 12:11, 29 September 2006 (UTC)
Hey, he admitted we speak "English" on this side of the pond!! ;-) DAVilla 13:04, 29 September 2006 (UTC)
No he didn't. He was talking about countries like Kenya, where we speak English ;-) Robert Ullmann 13:36, 29 September 2006 (UTC)
Resuming the color flamewar does nothing to lessen the impression that all Brits are pompus idiots. I don't see these underhanded insults as at all helpful or productive. --Connel MacKenzie 08:48, 30 September 2006 (UTC)
Underhanded insults? Did I miss something? —scs 11:10, 30 September 2006 (UTC)
Yeah, you missed the <invisible><font size="zero" color="bgcolor"><!-- ;-| --></font</invisible>DAVilla 18:33, 7 October 2006 (UTC)
I'll note some of the politics here: The United Nations tends to equate "Chinese" with "Mandarin" because the PRC is a charter member, permanent member of the security council. The PRC pushes the political POV that there is one "Standard Written Chinese", which we would call Mandarin written in Simplified Characters. (I'm not knocking the PRC here; there isn't anything wrong with a political body pushing its political POV; that's what it is there for, right? ;-). Look at WT:AC, the first part could have been written by the PRC. They've been doing this for decades; that's why you learned about the Chinese language in school.
The ISO is part of the United Nations; representatives (rapporteurs) to the technical committees represent the national members of the UN. The reason we have just one 2-letter code in IS 639-1 (zh) is that the PRC insisted there be exactly one code, for SWC. This led to awful hacks like zh-tw for Mandarin in Traditional Characters. In IS 639-3 we now have a more reasonable set of codes for the major Chinese languages, and Mandarin is cmn.
One reason that I suggested Mandarin, Min Nan, Cantonese, etc. without identifying one or any as "Chinese", besides technical accuracy, is that it side-steps the political POV(s). Robert Ullmann 13:29, 29 September 2006 (UTC)
I absolutely agree. There is no single language called "Chinese", and there should be no such header. Although the majority of "Chinese" people speak Mandarin, significant populations speak Cantonese and Min Nan. bd2412 T 18:09, 29 September 2006 (UTC)
I agree with the use of headers like Chinese Mandarin and Chinese Cantonese as a solution for this particular case. I prefer putting the word Chinese first, so that the languages and their headers will be grouped together in alphabetical order. As others have noted, many English speakers are ignorant of the different varieties implied in "Chinese", so grouping them will alert users to this difference. As to the use of "mutually intelligible" / "mutually unintelligible" as a yardstick for distinguishhing languages and dialects -- this doesn't work. The languages Macedonian and Russian are mutually intelligible according to a teacher of English I know (who is a native Russian, and realized she could understand the Macedonian that her students were speaking to each other). --EncycloPetey 18:38, 29 September 2006 (UTC)
I did a quick search on google. Your suggestion is not unprecidented. It appears that if we included the word Chinese along with the dialect name, we have the following choices:
  1. ==Chinese, Mandarin== ==Chinese, Cantonese== ==Chinese, Min Nan==
  2. ==Chinese (Mandarin)== ==Chinese (Cantonese)== ==Chinese (Min Nan)==
  3. ==Chinese/Mandarin== ==Chinese/Cantonese== ==Chinese/Min Nan==
  4. ==Chinese-Mandarin== ==Chinese-Cantonese== ==Chinese-Min Nan==
  5. ==Mandarin Chinese== ==Cantonese Chinese== ==Min Nan Chinese==

The ethnologue report for nan lists it as Chinese, Min Nan. I think the "cleanest" is Chinese Min Nan. This reminds me of the saying by Confucius: 工欲善其事,必先利其器. Any opinions? (about the header, not the proverb ;) A-cai 22:13, 29 September 2006 (UTC)

Not dialects. Min Nan should be Min Nan. We can have Chinese Min Nan the day we have European French and Indian Hindi. (Not to mention American English and English English.) This is why a lot of linguists avoid the word "dialect"; preferring "group" and "language" and "variation". Robert Ullmann 22:51, 29 September 2006 (UTC)
That's what I've been doing, sharpening tools. Think about this: our objective is to have the entire written and spoken vocabulary, with regional variations, for every language in the Chinese group, and 7000+ others. Sounds like a huge job, but if a tiny percentage of the speakers of Min Nan, and Wu, and whatever, worked on it, it would done in very little time. (But than of course, not done at all, since the languages will continue to change.) Let us sharpen our tools and be precise; and if someone comes here and learns that "Chinese" is not a language, that is a good thing. Robert Ullmann 23:22, 29 September 2006 (UTC)
I can give my own sense of what can be accomplished by one person toiling away. I have not kept close track of exactly how many words I've entered into Wiktionary thus far, but if I were to venture a guess, it would be somewhere close to 1,000 individual phrases (not counting simplified/traditional duplicates) for Mandarin alone. Not bad for six months by a single contributor. I calculate that if I were to continue at this pace, within five years, we would have over 10,000 entries for Mandarin! My personal hope is that one day, we will attract the attention of language scholars capable of doing what I do and more. We don't seem to have a shortage of programming talent at Wiktionary, but true language experts have not yet arrived in droves (at least, not for Asian languages).
As to the first point, I am trying to not take a firm position on whether or not Chinese should be in the header. Since I am the one adding almost all of the Chinese entries at this point (and thus may be too close to the problem to be objective), it makes more sense for me to provide the community with relevant information, and then let others attempt to reach a consensus (if possible).

A-cai 00:35, 30 September 2006 (UTC)

I've added 25 Chinese phrases. :) ...but then A-cai had to fix most of 'em. :( bd2412 T 03:45, 30 September 2006 (UTC)
I have just found this discussion. I do not agree with the creeping change that "Chinese" is being removed from articles is a good thing. I support the proposal that Mandarin, Min Nan, and Cantonese remain grouped together as before. Another possibility is to use the proposal "Chinese, Cantonese"; "Chinese, Min Nan," etc. The languages are closely related and should be grouped together. Badagnani 22:27, 7 October 2006 (UTC)
Badagnani, allow me to fill in the others on the backdrop of your comments. Badagnani disagreed with my edits to the words 表演 and 木耳. I will therefore use 表演 to refute Badagnani's claim that Cantonese is closely related to Mandarin. The Cantodict website lists the following example sentence in its entry for 表演:
  • English: I was totally speechless with surprise seeing his perfomance.
  • Cantonese: 表演O哂嘴![5]
The above Cantonese sentence makes absolutely no sense in Mandarin!!! Based on the English definition (and a little internet research), the closest Mandarin equivalent would be:

Granted, the hyperlinked characters are cognates, and mean the same thing in both Mandarin and Cantonese. However, the pronunciation is considerably different.
Badagnani, according to your babel template, you speak neither Cantonese nor Mandarin (which seems strange to me, given your strong opinions over issues related to these two languages ;-), so I won't ask you to provide example sentences to contradict the one's I've provided above. However, can you cite credible evidence from a reputable academic source that corroborates your claim that Mandarin and Cantonese are closely related and thus should be lumped together?
P.S. Before you do a bunch of research, please read: Political views on the Macedonian language. Now take a look at a typical Macedonian word on Wiktionary, such as календар. This should help you to put the issue into a broader context.

A-cai 00:12, 8 October 2006 (UTC)

Course in Lexicography

Feel free to move this to a more appropriate place.

The following was posted by Adam Kilgarriff on the CORPORA list.

       A Workshop in Lexicography and Lexical Computing

Venue:     Kowloon, Hong Kong
Hosts:     Language Centre, Hong Kong Univ of Science and Technology
Dates:     December 11th-15th, 2006

Led by Adam Kilgarriff and Michael Rundell of the Lexicography MasterClass,
Lexicom is an intensive one-week workshop, with seminars on theoretical
issues alternating with practical sessions at the computer. There will be
some parallel 'lexicographic' and 'computational' sessions. Topics to be
covered include:

*          corpus creation 
*          corpus analysis:
     o        software and corpus querying
     o        discovering word senses, recording contextual information *
writing dictionary entries 
*          dictionary databases and writing systems
*          using web data

Applications are invited from people with interests and experience in any of
these areas.  

Over the last six years Lexicom has attracted 200 participants from 28 countries including lexicographers, computational linguists, professors, research students, translators, terminologists, and editors, managers and technical support staff from dictionary publishers and information management companies.

The venue, HKUST, is beautifully situated on Clearwater Bay in Kowloon, only 30 minutes from central Hong Kong.

To register for Lexicom, go to: Early registration is advised (the Workshop has been oversubscribed in previous years), and registrations received before 7th October 2006 carry a discounted fee.

Further details, including reports of past events can be found at:

Michael Rundell & Adam Kilgarriff The Lexicography MasterClass --BrettR 13:28, 28 September 2006 (UTC)

Thank you BrettR. Is this going to have on-line participation as well? --Connel MacKenzie 13:44, 28 September 2006 (UTC)
Afraid not.--BrettR 01:53, 14 October 2006 (UTC)


Category:Psychology contains a manually maintained list of phobias and words ending in "-philia". Surely these should be in subcategories so that the lists are updated automatically? — Paul G 10:57, 29 September 2006 (UTC)

Indeed, there is already a "Phobias" category - is there one for "philias" (there is no such word, by the way)? — Paul G 10:58, 29 September 2006 (UTC)
Oooh, there is indeed such a word!
Bruce A. Arrigo, Catherine E Purcell, The Psychology of Lust Murder: Paraphilia, Sexual Killing, and Serial Homicide (2006) p. 15
  • He noted that "these philias have a sexual association attached to them".
Nils K Oeijord, Why Gould Was Wrong (2003) p. 68:
  • Phobias, philias, manias, perversities, and mental disorders are abnormal instincts. ... Phobias, philias, manias, perversities, and mental disorders teach us how normal instincts work (=how the mind works).
Raymond J Corsini, The Dictionary of Psychology (1999) p. 719:
  • [Defining "Philia"] The near-opposite of phobia, except that only a few phobias have a specifically sexual context whereas most philias (called paraphilias) are erotic attachments experienced almost exclusively by men, often termed fetishes.
David E. Young, Jean-Guy (EDT) Goulet, Being Changed by Cross-Cultural Encounters: The Anthropology of Extraordinary Experience (1994) p. 262:
  • Philias and phobias can also be included in this category of behavioral traits. The correspondence of the child's philias and phobias to those of the previous personality, or which could be explained on the basis of the previous personality's mode of death, can be assessed.
Rumack H. Rumack, David G. Spoerke, Barry H. Rumack, Handbook of Mushroom Poisoning: Diagnosis and Treatment (1994) p. 11:
  • Wasson traces the movements of certain groups and identifies pockets of philias and phobias.
Gaston Bachelard, Psychoanalysis of Fire (1987) p. 6:
  • Everyone must destroy even more carefully than his phobias, his “philias,” his complacent acceptance of first intuitions.
Cheers! bd2412 T 04:04, 1 October 2006 (UTC)
Thanks for the quotations. This looks like a new coinage, and would be a back-formation from words ending in -philia. Good work. The plural is clearly "philias" (and not "philiae", as the category page give - this is a Latin plural and "philia" derives from Greek). — Paul G 09:30, 10 October 2006 (UTC)

List of protologisms - too long

Appendix:List of protologisms is upto 182Kbytes. And it seems to be one of the most active pages. There is a currently a proposal on the talk about separating out the large number of number definitions to a separate page. (First question: what would we call it, if we can determine that we want to do it I'll do the splitting). Is tehre any other thngs we want to do to try to keep the size managable? Or do we just accept it is going to be big and be grateful it isn't scattered across the main article space? RJFJR 13:59, 29 September 2006 (UTC)

  • I think that it needs weeding. Just delete all the entries with no Google hits, and move the stupid numbers to a subpage. Some entries may now be neologisms and can have proper entries. SemperBlotto 14:06, 29 September 2006 (UTC)
  • I concur with SBs assessment, but if it does seem like it would just grow beyond manageable again quickly, I would be fine with you chunking it up...or just deleting it and redirecting to --error: link target missing--...but others would probably disagree ;) - TheDaveRoss 15:42, 29 September 2006 (UTC)
  • Something like Wiktionary:Requested_articles:English/DictList would probably be a good way to go for splitting. If any particular letter's list of entries got too big, it would be subdivided on that page. As far as reviewing for later inclusion, the problem is that there is no notation of date added by the entry, so it becomes an all or none thing (or at least an "all until whoever starts it gets tired thing") :-) --Jeffqyzt 16:47, 29 September 2006 (UTC)
If you want to chunk it, just do it by letter of the alphabet. Appendix:List of protologisms/A ... and let it grow. Robert Ullmann 16:54, 29 September 2006 (UTC)

I just split off all the numbers from the list of protologisms to a subpage. The subpage is 63Kbytes long, so it represents about 30% of the protologism list. RJFJR 16:56, 29 September 2006 (UTC)

  • Thank you. That desperately needed doing. —scs 23:02, 29 September 2006 (UTC)
UNTIL it is split, the list can be filtered for entries over a year old by clicking on the appropriate date in the History (it's about 1200 revisions ago!). About 250 are older than that, out of about 1000 total. If anyone can be bothered, those entries can then be checked to see if they are citeable. Once the list is split, then the same will be possible in a year's time for new entries.
I therefore suggest that, before the list is split, the pre-Oct 05 entries are tagged with a "pre-Oct 05" category, and perhaps the later ones tagged with "pre-Jan 06", "pre-Apr 06", "pre-Jul 06", and "pre-Oct 06". This will enable anyone interested to check for cites more efficiently. I would do this categorising myself, but I don't have the knowledge to automate it.
Apart from moving to the main dictionary any words which now satisfy CFI, we could perhaps have a rule that any protologism which does not have at least one (or two) fully independent cites within two years is deleted (or perhaps moved to a list of failed protologisms, which some might consider an intriguing historical record in itself). --Enginear 18:41, 29 September 2006 (UTC)
Um, this is a much better idea than alphabetical, as I suggested above. We could just start a new list periodically. And not remove anything; if they become blue links, fine. We don't need to move them or anything. Once a year would be good, and I think sufficient. LOP/2005, LOP/2006 etc. Robert Ullmann 22:34, 29 September 2006 (UTC)
We'll have to sweep them all occasionally to avoid repeats. bd2412 T 04:05, 30 September 2006 (UTC)

Noun or Noun Form?

Last I noted there was still dispute on whether the POS heading for things like plural were Noun or Noun form. (Similarly for verb/verb form). Has concensus been reached? RJFJR 13:27, 30 September 2006 (UTC)

See WT:POS and the talk page. We seem to be pretty much there; the current revision of the draft policy seems to be acceptable. Short summary is that there are/were people on both sides, but all of the really strong feelings came down against "X form". In any case a plural in a language that doesn't decline nouns other than plurals should not use "Noun form". But please go see, and comment there. Robert Ullmann 13:55, 30 September 2006 (UTC)

help with crossword

i am looking for an answer to a crossword. the clue is "it ends with chalypsography in the Oxford English Dictionary". the answer has 9 letters and i believe the the 2nd letter is O and the 5th letter is M. any help would be greatly appreciated. i could not find the word "chalypsography" in the online Oxford. david

VOLUMETWO (BBC-chalypsography) Robert Ullmann 14:49, 30 September 2006 (UTC)

example sentences from wiki?

In the last day or two, it occurred to me that I could be using wiki much more for example sentences than I have in the past (see: 物换星移). I started doing this with Min Nan, because, as it turns out, Min Nan wikipedia is now one of the largest repositories of written Min Nan on the internet (Min Nan is not usually written down)! Has a policy been formalized on this? As I see it, we have many choices for example sentences (in no particular order):

  1. make something up on the fly (I personally don't like this choice)
  2. cite a printed source that is not available on the internet (books).
  3. cite a printed source that is available on the internet (preferably from wikisource)
  4. cite an internet resource (on-line magazines, chat rooms etc)
  5. cite text from a non-English wikipedia
  6. cite something from a movie or tv show (I have not yet done this, but I am thinking of doing this more often for Min Nan, which is rarely written down. There are a number of Min Nan language tv shows and movies now on dvd that I could use as material for example sentences)

I know that some of this has been outlined in WT:ELE. Two questions, how does everyone feel about using non-English wikipedia articles as a source for example sentences? The advantage would be: no worries about copyright issues, or it disappearing from the web ... that is unless wiki disappears. Another advantage is that I can use the version from English wikipedia as a translation (if it is close enough to the orignal, which is not always the case). Also, should there be some kind of pecking order for example sentences? In other words, something from wikisource is the most desired, followed by printed source that is available elsewhere on the internet, followed by ... ?

P.S. The current guidelines are at Wiktionary:Quotations#How to choose a quotation. As you may surmise from above, I think it should be more detailed. A-cai 23:41, 30 September 2006 (UTC)

Actually, it is good to have both published quotations and sentances made up for purposes of the page. The published quotations provide documentary evidence of the word used. It is therefore good to have such quotations from a variety of dates for each sense of the word, and it is good to qoute from literature, journals, major newspapers, or other sources likely to be widely available or at least reliably archived in major libraries. However, it is also very good to have sentence examples made up for Wiktionary on the page. It is then possible to craft a sentence to demonstrate a particular sense of the word more carefully and in simple examples. These are usually more useful for people learning the word (or the language!). --EncycloPetey 00:38, 1 October 2006 (UTC)
Indeed, for the example sentence the only criteria is that it represents the usage clearly. For the /Citations pages we need durably archived resources. - TheDaveRoss 00:40, 1 October 2006 (UTC)

I added an example sentence to a-soaⁿ to demonstrate what I think a quote from a tv series might look like. Any opinions about the format? A-cai 01:19, 1 October 2006 (UTC)

I'm not sure why you chose the Wikipedia-footnote style, there, where you did. In general, we don't use footnotes the same way Wikipedia does; I haven't found a "good" use for ref/references yet, on Wiktionary. Is your indentation-level significant? It seems fine (other than the ref weirdness) at first glance. --Connel MacKenzie 05:02, 1 October 2006 (UTC)

The indentation-levels are intended to follow the format in Wiktionary:Quotations#Between_the_definitions. In this case, I have provided the original Min Nan text in three different scripts (which are all at the same indentation). The Mandarin subtitles come from the dvd, and can be thought of as a translation (into Mandarin). This is why I put it at the same indentation as the English translation (I included it to make it easier for native Mandarin speakers who may wish to find the scene on the dvd, which lacks English subtitles). The ref/references tags are the best way that I have found, so far, to make these kinds of notes when doing translations. I find it particularly useful when translating classical texts in the etymology sections (see the etymology section of 金屋藏娇). If you can think of a better way to present this kind of information, I am open to suggestions. My goal is to provide enough information for a student of the language to comprehend the original, while still maintaining readability for an English speaking readership. Sometimes, a notes section seems to be the best approach. A-cai 06:26, 1 October 2006 (UTC)

I think it's awesome, as usual. A-cai's entries are consistently excellent. Using example sentences from Min-Nan wikipedia is definitely a good idea. However I still believe the most important citation is one from a referenced, printed source (although that is perhaps more important for English entries than for foreign language ones). Widsith 07:37, 1 October 2006 (UTC)
I don't see any problem with making up example sentences. A poor example will be altered or replaced, and a good example will pass the test of time. The best way to come up with example sentences might be to Google the term and then mix some of the clearer or contextually more easily extracted hits. I've found this to be a great way to come up with ideas, and since you're paraphrasing it's completely legit. It's not that far off from what you're proposing, either, surprisingly enough, just a little less formal. DAVilla 13:39, 1 October 2006 (UTC)