Wiktionary:Votes/bt-2006-03/Request for bot status: TranslationBot

Discussion moved from Wiktionary:Beer parlour/2006/March#Request for bot status: TranslationBot.

Bot: User: TranslationBot
Owner/operator: User: Connel MacKenzie
Purpose: Fill in translation entries of non-English terms, from translations given in translation sections of English entries.
Generation restrictions:
1. Entry must not already exist.
2. Translation must be in an "un-ambiguous" non-numbered-translations section.
3. Language must be one of the "top 40"ish languages (whatever it is that WT:ELE currently recommends.)
No interwikis will be auto-added during this phase, unless GerardM asks me to auto-add them (on the assumption that his interwiki 'bots will remove the relatively few that don't have corresponding entries elsewhere.)

Please do not add interwiki links.. The bot will pick it up when it sees a corresponding entry in another language. GerardM 08:50, 17 March 2006 (UTC)[reply]

VOTE:

For:
1. --Connel MacKenzie T C 05:44, 14 March 2006 (UTC)[reply]
2. -- Tawker 06:26, 14 March 2006 (UTC)[reply]
3. --Yyy 15:25, 15 March 2006 (UTC)[reply]
4. — Vildricianus 10:29, 17 March 2006 (UTC) (see comment somewhere far below)[reply]
5. -- MGSpiller 02:21, 18 March 2006 (UTC) as per Vildricianus' comments...[reply]

Against:
1. Ncik 17:44, 14 March 2006 (UTC)[reply]
2. --Patrik Stridvall 18:41, 14 March 2006 (UTC)[reply]
3. --EncycloPetey 23:32, 14 March 2006 (UTC)[reply]
4. Would have to pull together definitions from multiple red links for the word if more than one, and in most cases there should be more than one. Davilla 03:12, 15 March 2006 (UTC)[reply]
  - Davilla, I may have worded it poorly above, but that is exactly the intent; how ever many English words link to a term, that term when generated will have one line for each definition line that has that as a translation. But only for well formed translation tables. Translation tables that don't start with a description line (if more than one definition is entered) are skipped. --Connel MacKenzie T C 05:45, 15 March 2006 (UTC)[reply]
    By "pull together" I don't just mean enumerate. For a foreign language word, a human could deduce a single or possibly multiple meanings from "what links here", but it takes a little mental computation that a bot hasn't got. For instance, this bot would have made niño into something like:
    baby boy...
    
    boy...
    
    child...
    A person halfway fluent in Spanish could do a better job. Davilla 19:53, 15 March 2006 (UTC)[reply]
Comments:
I'm not voting for this one as I am not convinced that machine translation can always get the nuances right. But I won't vote against it either unitil I've seen it in action (by which time it will be too late!). SemperBlotto 14:31, 14 March 2006 (UTC)[reply]

I'm not convinced this is a good idea either. When I add Swedish entries, the possible translations is the minor issue. The big issue is everything else including grouping the possible translations in different senses as well as adding a short qualification since even words that only have one possible English word as a translation might only mean that in a small subset of the English senses. See for example the Swedish word mede that translates into English as runner or rocker. Try to guess what the word really means then click on the link and see the qualifications. You guessed wrong, didn't you? Then look at Swedish rygg. If you just added the words the entry wouldn't make much sense would it?

Note limiting yourself to "Translation must be in an "un-ambiguous" non-numbered-translations section." will in many case mean that you pick entries that have the lowest quality. --Patrik Stridvall 18:41, 14 March 2006 (UTC)[reply]
- My goodness, what a perfect example! This 'bot would not make an entry for mede as 1) it already exists, 2) the translation ~~is only in the "to be checked" section,~~ is not entered in either entry, not the un-ambiguous sections before it. But, assuming that the entry didn't exist, and the ~~definitions~~translations were entered and placed correctly in the un-ambiguous sections, the entry that would be created for mede would have two "definition" lines with one wikified word each, a semicolon, a space, then the meaning description from the translation sub-section, on each line. Did you completely misread what I had, or was what I wrote too confusingly worded or something? --Connel MacKenzie T C 19:29, 14 March 2006 (UTC) (corrections) --20:55, 14 March 2006 (UTC)[reply]
  - I partly misunderstood the sentence "Translation must be in an "un-ambiguous" non-numbered-translations section.", yes. Still, it only makes it partly better. The Swedish senses are usually not same as the English senses despite being in a named translation table for that English sense. They are there because the English word can be translated to that particular Swedish word and it is the best match given the alternatives. mede was perhaps not the best example. Thinking again the English sense they attach to is almost exactly the same as the corresponing Swedish sense. This is usually not the case not even for cognates.
  - You will just end up creating a lot of low quality Swedish entries that will not help anybody very much. It will just be painful to sort it out. When doing the translation checks for Swedish I have often been frustrated how badly the translation tables match. I have tried to cleaned it up in some cases but often I simply haven't felt I had the time.
  - I hate to put it like this, since I'm not really much into personally attacks, but do you speak any language beside English? Your user page indicates that you don't, so I wonder if you really understand how different even related languages like Swedish are? And you are talking about doing it for the top 40 languages...
  - Note that you now have two native speaker of related languages (Swedish and German) against you. I wonder how speakers of unrelated languages feels about it? --Patrik Stridvall 21:06, 14 March 2006 (UTC)[reply]

Connel and Eclecticology are the main obstacles to sensibly treating non-English entries. They oppose any change to their own, anglo-centric policies. It is not allowed to add proper definitions to non-English words. I'm not surprised Connel now wants to generate entries using a bot. It's just sad. Ncik 23:02, 14 March 2006 (UTC)[reply]

There is no need to make this personal. There are times when my urge to strangle Connel is just as strong as my urge to strangle you. The reference to "anglo-centric policies" is plain false, as is the claim that "proper definitions to non-English words" are not allowed. I do not support the Translation Bot, but neither do I doubt that it was offered in good faith. Eclecticology 20:14, 15 March 2006 (UTC)[reply]

I agree that it is a bad idea for non-English words to be limited to a "translation", particularly when none of the English definitions captures the sense of the foreign word. Consider the word pavo in Latin. Yes, it refers to the peacock, but that doesn't give the reader information that it could be a food or that it is connected to the goddess Hera, both of which could be very important in the context of a Latin document. Consider the Latin word chamaeleon. Yes, the English word chameleon is cognate with the Latin, but in Latin documents the word refers to a mythical creature that subsisted on air alone, without eating food. It is NOT the lizard of sub-Saharan Africa that is meant in Latin texts, though that is the deifintion one would get from the English page by "translation".

- - - As a speaker of English and a dabbler in others, I say it is better to be able to look up a word and get _something_ rather than a dead end. I like to try and follow what the German speakers in #wiktionary on IRC are saying sometimes, and I use en.wikt as my primary resource for this. All to often I have to go elsewhere because we don't even have a simple 1 word entry for what I seek. I have been adding lot's of 1 worders from the spanish language because it is a start, and if you come across abad in your daily life you probably want to know a similar english word, regardless of the fact that it might have nuances in Spanish that the word abbot doesn't have in English. - TheDaveRoss 22:15, 14 March 2006 (UTC)[reply]

Just use the Search button, DaveRoss. The bot won't add anything new. Ncik 23:02, 14 March 2006 (UTC)[reply]

The search button brings us no closer to the "every word, every language", the bot does. - TheDaveRoss 23:08, 14 March 2006 (UTC)[reply]

A low quality bot added Swedish (or German) entry will not help you very much, in fact it might fool you to believe that a human have entered it or at least checked it and that the qualification actually means something that it really doesn't. Even somebody with a Babel level of 2 or 3 might be fooled. In fact for uncommon words you might even fool a native speaker. I believe myself to understand subtle nuances of common English words almost as good as with Swedish common words, but I still have problem finding good qualifications for translations. Believing that the English senses can be used as qualification is simply wrong even for many cognates. So adding the qualifications can actually make things worse and not adding them doesn't really offer anything useful that you can't find using "Search". Humans regardless of Babel level are more likely to understand their own limitations than a bot and are less likely to add something that might fool others. --Patrik Stridvall 23:25, 14 March 2006 (UTC)[reply]

A human did enter them, Patrik, the bot will take the definitions people added to the english pages and create pages from them. An example: solar system has a a Swedish translation solsystem which was added by Mike. The bot would simply create solsystem and state that it meant solar system in Swedish. Mike seemed to think this was correct, and anyone who subsiquently looked at the page did also. The bot wont be making anything up, it is simply doing the redundant work that users would have to do anyway, for the simple translations. - TheDaveRoss 23:36, 14 March 2006 (UTC)[reply]

Bad example, a solar system is something concrete that is clearly defined and doesn't really have any subtle nuances. Most words do. The senses in English are a bad match for senses of foreign words and the human that entered it perhaps should have split the senses but didn't, perhaps because he didn't have time, perhaps because it sometimes is damn hard to do and perhaps because of a number of other reasons. Most translations are only good one way and trying to making it go the other way as well is asking for trouble. We have "Search", please use it instead. --Patrik Stridvall 00:01, 15 March 2006 (UTC)[reply]

But translations between languages are not necessarily reflexive, nor are they usually one-to-one. Suppose someone has entered for the English word woobit that the Dutch translation is waabijd. That may mean that waabijd is a strict one-to-one translation of woobit, or it may mean that waabijd is simply the best translation of a difficult to translate word. It does NOT mean that woobit is the best translation for the Dutch word waabijd. In short, just because A translates into B, does not mean that B necessarily should translate to A. There may be a much better and more accurate word in English. --EncycloPetey 23:53, 14 March 2006 (UTC)[reply]

Exactly. --Patrik Stridvall 00:01, 15 March 2006 (UTC)[reply]

I would be interested in an example where the translation listed in the English entry would not produce a reasonable new page. It is limited to only the simple ones (no multiple definition issues) and only uses already contributed translations. If woobit is translated into waabijd in Dutch, how is it possible that at least one of the senses of waabijd isn't woobit? If dog translates to perro, one of the senses of perro IS dog, even if perro can also mean spaceship. Even though the bot's additions will be incomplete, which of the human entries IS complete? The basis you list for denying this bot a run does not stand up when you look at what the bot actually will do, only if you assume the bot will do things it wont. - TheDaveRoss 01:51, 15 March 2006 (UTC)[reply]

If dog translates to wobitas, there's nothing to prevent the sense of wobitas being, for example, canine, where dog is not a "sense" of wobitas at all but a more specific word English uses that Woobitwegian doesn't, the disparity in information being normally ignored in translation but which is important in definition. This happens often with things like names of colors, for example. γλαυκός may translate blue or blue-green or gray but the real, single sense appears to be broader, and to give one would be misleading. —Muke Tever 12:12, 17 March 2006 (UTC)[reply]

Precisely. Translation in one direction either produces a word with nearly the same circumscription of meanings and connotations, or else produces a term with a broader definition. Reversing the translation is therefore inappropriate, since it applies a narrower meaning to a term than is intended. Consider that most English-Spanish dictionaries translate dog, hound, and mutt as perro in Spanish. This does not mean that Spanish perro carries all those specific senses. It merely means that Spanish does not have a word that specifically means hound with the sense of a hunting-dog while still applying in general nor a word that means mutt in the sense of implying the mixed history of the breed. Thus, to back-translate perro as "dog, hound, mutt" is terribly inappropriate. The same problem occurs many times in translation, any time that the vocubaluary of two languages is not one-to-one, which it seldom is. --EncycloPetey 21:15, 17 March 2006 (UTC)[reply]

Reading the comments here, I am beginning to think you are all insane.
1. Patrick, I do speak one other language and have dabbled in several others (prior to Wiktionary, now I've had much more exposure.) Not being fluent, I am not comfortable asserting that I can reasonably contribute in any other languages authoritatively, so my babel template lists only English. There is no testing requirement for Babel templates; I feel that many who claim en-3 should really have en-1 listed. You, on the other hand, seem to be very competent.
2. EncycloPetey, go look at the explanation for mede again; translations get pulled from wherever the term is defined, NOT ONE TO ONE! (You misunderstand my point. See the passage re: perro just above the line divider preceeding this passage. --EncycloPetey 01:31, 18 March 2006 (UTC))[reply]
3. Patrick, a stub is much easier for you to enhance than a blank entry. If the definitions are wrong then it is a problem, (but then they are a problem already) and if they are already right then you have been saved some typing; no harm no foul.
4. solar system is a great example; simple entries are filled in correctly. That is what this 'bot is for!
5. Ncik, this is the English Wiktionary. Your treatment of Hand is in no way oriented towards a native English speaker. Of course it makes sense in German, but what you said in English borders on gibberish. If you don't like terms being explained in English for English readers, then don't contribute here.
--Connel MacKenzie T C 01:38, 15 March 2006 (UTC)[reply]

Woah there, calm down.

This bot is specified as generating a word in a foreign language from a red link already put in as a translation of an existing english word. There are some restrictions placed to ensure that it does not attempt to create translations which are obviously more complex than a simple one to one but I think a couple more checks should be implemented.

It should generate an italic footer like the Webster entries making clear that it is likely that there are other senses & possible nuances which may be missed by this simple first step.
It should generate a checklist for ease of checking by humans of what it's done, linked from the cleanup pages.
As part of the preparatory work it should check whether a redlinked translation is present more than once i.e. from more than one English word. (It would be nice if the results of this check could be outputted as another cleanup page as it should give a human editor a headstart on creating pages)

Hopefully my humble suggestions may provide if not an outright solution then some thinking points for improvements.MGSpiller 02:17, 15 March 2006 (UTC)[reply]

Forgot to say, the cleanup pages (perhaps that should be translations to be checked) should be split by language though that is probably obvious.MGSpiller 02:24, 15 March 2006 (UTC)[reply]

I'm baffled at how I could have been unclear. The intent of this is to glean meanings from multiple words, such as runner or rocker to build an entry for mede.
Tagging these entries with an explanatory footer is a very good idea.
It is easier to create the list a gigantic single page (as was done for first few hundred Webster entries, the first time around) and let the translators take a shot at them. If that works out anything like the Webster entries have, then we'll get 5 to 10 entries actually entered from that list in the next two years.

--Connel MacKenzie T C 02:38, 15 March 2006 (UTC)[reply]

I don't think it's you that is unclear but the the very nature of the beast of translation. Though you did not specify on the BP what would happen if two English words both had the same foreign word listed as a translation, this should be treated carefully. I'm natuarally optomistic and still newish to wikis and there is also more editors here than there were before so perhaps we can hope for 20 or 40 entries a year :-).

I was going to post this but I was still typing when you posted....

Another subtle point is what Kipmaster said hereabout translations in other wictionaries with respect to what the French have already done. If a French wictionary writer has said that the closest translation of hrunk is weeble in English that is not the same as an English Wictionary writer saying that the closest translation of weeble is hrunk in French. Half the time the closest translation of weeble in English is going to be hrunkle in French, a similar but different word. Perhaps the french were right to translate words that are marked up in other Wiktionaries. The ideal would be do a lot of number crunching & cross reference, checking both this and our sister for the same language. Then we could start with the ones where both agree and then move on cautiously with the ones were they don't agree. MGSpiller 02:50, 15 March 2006 (UTC)[reply]

Well, hehe, guess what? I'll find a French match every time, if I did that. Here-to-there or there-to-here is a straw man. Either we trust our own definitions that we have or we don't. If we don't then we should just remove all translations. --Connel MacKenzie T C 05:49, 15 March 2006 (UTC)[reply]

2 comments:
I'm, for now, opposed to running such a bot on the French Wiktionary, it looks to me just as a way to increase the number of articles, without increasing the quality (the foreign words in translation tables can already be found using the Search button). When we import en->fr from the en: Wiktionary, we take of course the translation, but also the gender (well, not in en...), preterit, pronunciation, ... So, the resulting article looks ok. if we create a en: or de: article from a fr translation table, the more we can have is the gender, which is not always indicated.
On a more positive note, I think we have on the French Wiktionary a good amount of fr->en translations, and I guess that other Wiktionaries would have a lot of them too. Since the templates on the French Wiktionary are very convenient, it's easy to extract those translations from there, + pronunciations, genders and so. I'll be glad to help doing that if somebody wants to spend some time on it (I have not enough time and too many projects to do it myself).
PS: I've never heard of hrunk in French ;-) Kipmaster 12:51, 15 March 2006 (UTC)[reply]
- It should probably be hrunque from the French spoken in the Western US; it is derived from the sound heard when a jackalope from Boisé (pronounced /boy-zee/) gets its antlers caught in the trees on the way to the Grand Teton. There may also be a Spanish translation, jrunque, to reflect the sounds heard when a Texas Horned Frog gets stuck in a gopher hole in the course of its horny pursuits. :-) Eclecticology 20:14, 15 March 2006 (UTC)[reply]

"If you are keeping your head while others around you are losing theirs, perhaps you've misunderstood the whole situation." - Unknown

Connel, sorry for being personal, it is just that from my experience many Americans have a rather vague notation of the world around then, especially foreign languages. I probably went too far, I apologize.

Both sides have made mistakes by choosing bad examples. My mistake was mede (English: runner or rocker) which is something very concrete and distinct. Dave's mistake was solsystem (English: solar system) which is also concrete and distinct. Sure it would work for such words. This is unfortunately the small minority of words. Note however that Dave hasn't voted for despite his comments. Now lets forgot thoose mistakes and move on.

As for "Either we trust our own definitions that we have or we don't.". Well, if that how you wish to put it, I vote that we don't. Seriously, getting the English senses and in extension the labels on the translation labels right is a gradual process and is far from complete even for very common English words. Note that my point of view is from the point of Swedish that shares a common ancestry with English both languagewise as well as culturewise. Both languages have additionally been influenced by Latin. In the case Swedish either through German or directly. In case of English either through French or directly. Unfortunately the additional languages I speak is German and French so I really can't offer any outside perspective. But even from the inside of our common conceptual heritage I see large fundamental differences in how the languages work. As for the top 40 languages of the world I have a hard time to even imagine...

Now even with the English senses correct your really can't normally go in the other direction except for concrete and distinct concepts. Unless you are fluent in at least one foreign language I don't think you can have any real understanding on how bad English<->(foreign language) dictionaries really are and we are talking about dictionaries made by professionals that have been gradually evolving perhaps over hundreds of years. The only sure way to understand a word is to read an explaination in the foreign language itself.

As for "A stub is much easier for you to enhance than a blank entry", no not really, our labels for the translation tables are in most cases too bad to be of much use. Finding possible English translations is the easy part the hard part is all the rest. I much rather add entries de novo.

Please don't let any animosity toward the French lead to rash actions. Now lets try to be constructive instead of just criticizing. It is a non-trivial problem. Perhaps a website that crossreferences all Wiktionaries with automatically generated suggestions for entries to cut and paste from would be useful. Not only translations but also synonyms and such things. Perhaps at "xref.wiktionary.org". --Patrik Stridvall 10:42, 15 March 2006 (UTC)[reply]

Which are the "top 40"ish languages? (did not found a list in WT:ELE) --Yyy 11:24, 15 March 2006 (UTC)[reply]

I would support this, but I do not know if Latvian is in top 40. (I suspect not). Also, would be good, to add category (cat Latvian nouns for Latvian nouns, verbs for verbs and so on)(if coresponding category exists)(and if this applies to latvian language words).--Yyy 13:10, 15 March 2006 (UTC)[reply]

Patrick, So you vote then that we should remove all translation sections from all entries?!
Patrick, I assume TheDaveRoss made an oversight, since he did vote in favor originally.
The "Top-40 languages" I keep referring to was how dewikified languages were originally referred to in ET:ELE. The talk page of that page goes on at length about which languages to de-wikify. I shall make an effort not to call them "top 40" but rather "de-wikified by consensus" languages. I won't just take entries if the language is dewikified though; rather only ones for the official list (wherever it is.)
I personally hold no animosity towards French or the French people or the country of France. French is my favorite foreign language. The French Wiktionnaire is IMNSHO the best Wiktionary. I'd like to visit Paris someday.
Patrick, are you suggesting I post the generated list (when I get to the point of being able to generate it) as a page here, to let you comb through? At that point, you can either enter terms that are "tricky" (thereby preventing the 'bot from ever entering them) or offer logic corrections, or enhance the root entries that cause problems. But to then let that list remain would be a mistake (like the Webster entries) in my opinion.

--Connel MacKenzie T C 15:03, 15 March 2006 (UTC)[reply]

Of course we shouldn't remove them translation section. Don't be silly. The point is that many of them barely fills the role they are meant to accomplish. Sure they will get better over time but you intend to use them NOW in their current state for something they are not "designed" for.

I suggest that we set up a website "xref.wiktionary.org" that crossreferences all Wiktionary that anybody can use to save time as described above. This will help all Wiktionaries not just us. I see it as a continous process not as a "Now lets see if we by using some dirty trick can quickly regain the lead somehow".

Note that even if I would agree to check Swedish, you still have 39 languages to go... Futhermore have you any idea how long it takes to check and modify a list of entries? Especially since most of the them will be at least partly wrong. It must be treated as a continous process... --Patrik Stridvall 16:20, 15 March 2006 (UTC)[reply]

While the other bots in this series have some hope of success, or are at repairable, it's clear to me from the above discussion that this one raises more questions than it solves. It is at best premature. Ultimate Wiktionary (or WiktionaryZ) has been suggested as a solution that would appear to accomplish what is suggested by Patrik's "xref.wiktionary", but we can't just sit around waiting for them to come up with something practical.

If any kind of bot solution for translations is workable it should probably work in the other direction, taking the existing foreign word entry and matching it with the translation lists -- assuming that that is something feasible. Foreign word entries are referenceable; translation lists can't be referenced easily or practically because nearly every element in the list would need a separate reference. Eclecticology 20:14, 15 March 2006 (UTC)[reply]

Oh, my "xref.wiktionary.org" suggestion is not even remotely as ambitious as WiktionaryZ. We are talking ant compared to dinosaur. It's strictly readonly, it doesn't even modify any Wiktionary by itself it is only an aid to help adding entries. The idea is that you take the dumps from all Wiktionaries and parse them and generate some sort of database. Then you have a website that allows you to enter a language and a word. It then shows which Wiktionaries have definitions for that word and what other languages and words that have that word in translation sections. A simple cross reference. Of course it could be made more advanced and actually suggest a possible entry that you can cut and paste. Approximately what Connel's bot is supposed to do but not actually adding anything by itself. A human will have to decide what makes sense and what does not. --Patrik Stridvall 21:01, 15 March 2006 (UTC)[reply]

It's the lack of ambition that makes this idea look attractive. I may perhaps sound extremely negative in this, but I see WiktionaryZ as promising everything and producing nothing despite the early special funding that the project received. At the moment I only have a very broad vision of how xref might work, but I would be prepared to support this if the idea can be fleshed out a bit and WiktionaryZ is acknowledged as going nowhere. Eclecticology 18:04, 16 March 2006 (UTC)[reply]

If WiktionaryZ is a dinosaur, then a sugar ant to your fire ant is the inclusion of other language entries on a spelling page. Davilla 14:39, 17 March 2006 (UTC)[reply]

This seems to be Wiki at its best. Unstructured comments, animosity and lots of spelling errors :-). I agree, though, that the proposed bot's operation was largely unclear at first. Although I was strongly against this when I first heard about it a week ago, I've had the time to think of a good argument and haven't found one.

It may have been suggested above (or below, as the discussion is everywhere), but reading through it once is more than enough: I recommend adding all bot-added entries (for translations) to something like Category:Bot-added entries (Dutch) (or something less verbose). Then it'll work I guess, I can't see why not, certainly if only applied to the top-40. I haven't found that many well-founded arguments in the above/below masses of text. I trust that it will be run with great care and plenty of consideration, and I guess any other issue will settle itself by time. I also trust that it'll be postponed until all other bots have run (as was planned). — Vildricianus 10:29, 17 March 2006 (UTC)[reply]

No amount of care can magically conjure up information that doesn't exist. No bot that we, with a reasonable effort, can program are likely a "understanding" that even remotely similar to human that doesn't even speak a specific language. Translation is hard very. I can read and write in English with not much more effort than I do in my native Swedish. Still, interpreting Swedish into English or vice versa is often hard. Not only because you have to realize what sense of the word is the relevant but also because that you often can't use it as is. Sometimes you even have to change the POS in order to make a correct sentence. Sometimes the best fit is too vague so you have to modify with an adjective, an adverb, a preposition or someting else to adapt it to the situation. The words given as translations are often just vague hints, they are certainly not reversible.

So, how will the bot determine the POS of the translation? Yea, sure in most cases it is the same as in English but even in related language like Swedish a translation doesn't nessarily have the same POS. For example Swedish have a preference for making genitive noun + noun compounds where English uses adjective + noun. So the translation for the adjective ivory is the noun elfenben in the prefix form elfenbens-. Then we have senses of English words that are normally when used in the form of + noun where the correct translation of the noun is an adjective in Swedish.

Note that Swedish is a related language. Most of the top 40 languages are not. There is no reason to believe that it will be better in thoose language, more likely worse... --10:22, 18 March 2006 (UTC)

Suggestion

As a non-linguist here, with little knowledge on the operations of bots, I'll add my (possibly naive) two cents. What I reckon is firstly that Wiktionary is a rather slow-moving project. The French Wiktionary has overtaken us in article-count, which is great for them. If we can get a bot running that does as Connel says, let's give it a try. I'd like to see this 'bot in motion. Or maybe we could start by only doing it for nouns to start with, as nouns are less ambiguous in translations. Just an idea --Dangherous 21:12, 15 March 2006 (UTC)[reply]

Thank you. I agree that I must start with nouns only. --Connel MacKenzie T C 22:35, 15 March 2006 (UTC)[reply]

Don't be dis-heartened Connel, we seem to have a consensus (more or less) on everything but the translations. Take it one step at a time, do the plurals first. By the time the plurals are done I reckon we will have no more opposing votes on any of comparative superlative etc. (I did mention my irredeemable optimism before didn't i? :-) ) They would all appear to be fair game at this point.

I am not aware of any form of automatic translation which is free of controversy, even if based directly upon human work (as this is) so don't feel that one bot being controversial is a failure. It is an ambitious proposal made in good faith, perhaps you did not realise how ambitious it really is but that is no bad thing.MGSpiller 00:51, 16 March 2006 (UTC)[reply]

I am disappointed but the Luddite mentality that seems to be prevailing for translations. To say that the current translations entered are incorrect is an impeachment of all translation attempts made here on Wiktionary. To rearrange the translations entered as I have suggested accomplishes several things: 1) a framework is popularized for filling in translation entries (which so far have seen rather paltry entries) 2) a lot of typing is saved for people! 3) makes Wiktionary the translation resource for English language speakers it was intended to be, ever since it's motto became "every word in every language", 4) it spurs people into correcting the generated entries (which otherwise wouldn't exist) 5) it makes direct lookup of foreign term possible! 6) It naturally (via interwiki links generated one day later) allows for correction and enhancement from other language wiktionaries. 7) It indirectly builds other language Wiktionaries' lists of possible/probable words that they should enter. 8) It provides a starting map-point for WiktionaryZ, especially for controversial and problematic terms.

I am disappointed that the choice of the name (chosen because of a humorous character on a cartoon website) mislead people to think this was some kind of "pump up the count" effort. I have discussed and planned this series of 'bots for quite some time now - more than months; about a year (maybe more than a year.) Furthermore, this one does nothing to increase the English language entry count!

I am disappointed that the complaints have ranged from bizarre to absurd. The only semi-legitimate complaint is that Wiktionary is currently incomplete. Well, duh. How is preventing more entries (of exactly equal quality of our current entries) helpful?

--Connel MacKenzie T C 08:15, 16 March 2006 (UTC)[reply]

The reason adding bad entries is unhelpful is that when the entries that you borrow from improves it this will not result in bad entries that has already been added being better. It is exactly the same reason as to why Copy and paste programming is bad, it breaks the semantic link between various parts of a program and by doing it makes a program much harder to maintain. Doing what you can to preserve the semantic link is what separates the amateurs from the professionals.

Why do you think WiktionaryZ is so complicated? Because it tries as far as possible to preserve the semantic link between words and the concepts they represent. It is an enormously complicated problem that is not likely to have an simple solution. However, calling people Luddites just because that don't have a solution is not helpful.

I refuse to take the bait and comment individually on the 8 point above. It is not about what good comes from doing it, it is about the price you pay. But most people don't want to hear that, they vote for whatever politician that promises to fix their specific problem without thinking about the consequences... ---Patrik Stridvall 20:30, 16 March 2006 (UTC)[reply]

Well, I apologize for returning mild insult for mild insult. But the comparison to copy-paste programming is invalid. We are discussing data elements, not logic elements. Yes of course it would be really cool to have both updated automatically...but that is not how Wiktionary works. Currently, all semantic links "are broken" according to you. But when you find an error in a translation section, you follow the link and fix it wherever else it is broken. That happens occasionally just for English words here (for inflected forms.) That doesn't mean that having entries is wrong! It means the few mistakes are equally wrong wherever they appear. The chances that the error will be found and corrected in both places is better, as the inaccuracies will probably stand out more.

I'm disappointed that after such a scathing (fallacious) impeachment as yours, you arrogantly dismiss numerous benefits...saying you "won't take the bait." They are all valid, tangible benefits - that is the only reason you don't respond. Again, the "price paid" is negligible (if it is any more at all) while the benefits are significant. One way or another, Wiktionary should eventually have all these entries. --Connel MacKenzie T C 04:44, 17 March 2006 (UTC)[reply]

Yes, sure I can be and probably was arrogant. Still, the reason I don't want to debate the details is that the existence or non-existence of useful benefits is not the real problem. Debating it would just lead the debate in the wrong direction. That it what I meant with "I refuse to take the bait".

First of all there difference between logic and data even when programming is purely artificial. In this context the difference is even smaller. In fact it almost meaningless to even talk about. It better to talk about semantic links instead.

If a page doesn't exists the semantic link is not really broken since it can be restored by algorithmic means without the need to have human intelligence and in most case even knowledge of the language is question. In fact this is what the "Search" button does in a limited way. In you add the page the semantic link is broken because when the entries you borrowed from changes you can no longer by algorithmic means propagate the changes. This makes it is even more important to get it right the first time. The problems are:

The quality of most of our entries is not that good yet.
Even except for available editor time, there are huge theoretical problems involved in getting it right.
Even entries that are right in some meaning, will not necessarily produce good entries because translations are by their nature are only partially reversible.

To respond to these three points: 1) It will never be perfect, 2) I can't imagine a more abstract complaint. 3) That is why this is a wiki! Humans have the opportunity to correct entries; they have added incentive, especially if tagged as suggested above. Looking at the separate-but-cut-n-paste method suggested of xref.wt.org, I must point out again that the existing Webster pages (an identical concept, without the separate server) only very very rarely get imported into "proper" entries. (I can't say never because I went through the painful process of entering a couple from the list that someone uploaded before I got here.) Why do they get imported so rarely? Because it is a horrible, cumbersome process. Whereas if non-existent entries are 'bot populated, they are much easier to update. With proper labeling (as suggested above) it becomes clear even to passing visitors that the information is suspect, and should be corrected whenever possible. Also, since this is the first pass, many things will be learned in the process...and future updates will likely build on this, to incorporate updates from the translations sections (for unmodified bot-entered entries only, of course.) But that is months away, I think. --Connel MacKenzie T C 08:26, 24 March 2006 (UTC)[reply]

I'm not questioning the benefits, even though many of them are more wishful thinking than anything really useful, I'm questioned and am still questioning if you understand the price you pay for it. Sure the Wiktionary have numerous technical limitation, that is why people are working on WiktionaryZ. But that only makes it more important to not ignore the limitations. What you are basically saying are "Oh well, we already have a lot of other problems, it doesn't matter so much if we create a few new ones".

I don't understand how you can say that such direct benefits are "more wishful thinking than anything really useful" - that strikes me as extraordinarily arrogant, especially when you haven't seen a sample of a dozen or a hundred entries. That is contempt prior to investigation. Your assumption that I'm creating a few new problems is invalid. --Connel MacKenzie T C 08:26, 24 March 2006 (UTC)[reply]

So, trying be constructive instead, I think your time is better spent implementing something like my "xref.wiktionary.org" proposal instead. You will need something similar anyway if you are to have any hope of generating anything even remotely useful. Then we can discuss whether there are enough acceptable potential entries to bot add them. Even if not, humans can always copy, paste, edit and submit whatever makes sense themselves. --Patrik Stridvall 10:04, 17 March 2006 (UTC)[reply]

The major problem with that link of reasoning is that the method you suggest has been tried (with the Webster entries) before, and has failed miserably. Perhaps a compromise would be to upload the generated pages to a holding area for one week before running (say, 1,000 at a time?) so that you or any other interested translators can remove entries from the list that are off the wall? Then after a week, upload the remaining entries proper, and replace the holding area with the next batch. Even better would be for you (or other interested translators) to preclude questionable entries with proper entries, to prevent the 'bot from re-attempting those translations in the future. Would you be willing to try something like that? After the first two or three batches, I'm sure you'll agree it is easier to upload them all, and make corrections to the handful of exceptions...but the proof is in the pudding. --Connel MacKenzie T C 08:26, 24 March 2006 (UTC)[reply]

- Why aren't we moving entries found in the page "Webster 1913" that don't already have a definition? I understand they need formatting, but the vast majority of formatting tweaks could be accomplished using the Find and Replace box in Microsoft Word (e.g., replace <i> with ''; replace <syn> with [nothing]). We could worry about the formatting and updating each entry further later. Taken to the extreme, you could even write a bot that just copies them straight from the "Webster" pages into a new entry.--217.91.66.6 07:13, 16 March 2006 (UTC)[reply]

My major concern about doing so is the copyright notice in the Project Gutenberg copy of Webster 1913. The secondary concerns are the formatting challenges which you scratched the surface of. For automation, none of the XML tags can be used (that is what the copyright seems to cover most especially.) I need a copyright-free version to start from, which I currently don't have. And a lot of time parsing the entries as they exist there. Then another fair amount of time formatting them into Wiktionary style (which changes each month.) But yes, this is another project I hope to get to, overcoming the barriers I just mentioned, as well as a few other minor ones. --Connel MacKenzie T C 08:21, 16 March 2006 (UTC)[reply]

Someone who speaks of the Luddite mentality should at least read the Wikipedia article on the subject. An appropriate modern comparison with post-Napoleonic Luddites, who saw the mechanization of the weaving industry as an attempt to concentrate wealth in the hand of the factory owners at the expense of the small weavers, might be the free software movement which seeks to prevent the concentration of knowledge in the hands of companies like Microsoft. A person who understands the early role of the Luddites in the struggle for workers' rights can only wear such an epithet as a badge of honour.

The claim that there is a copyright problem with the 1913 Webster is entirely spurious. Whatever difficulties I may have with such a massive import have nothing to do with copyright.

I don't care if the French Wiktionary has more entries than the English one. A French verb, for example, has many more inflections than an English one. That alone would suggest more articles. Quality is far more important than quantity.

Given the number of bots involved, it makes sense to work out the bugs in the nouns first. Speaking for myself, the criticisms that I have about some of the others are very similar to the criticism that I have about the template for nouns. When there is agreement about the nouns, the others should largely fall into place. Eclecticology 19:09, 16 March 2006 (UTC)[reply]

They are separate 'bots only because you unilaterally denied the request of having them all as one. But however they arranged administratively is orthogonal; the nouns will be done before the comparatives are started; the comparatives will be done before the superlatives are started, etc.

Again, this is not about French having more entries. I've become comfortable enough with the 'bot technology and Wiktionary formats that I can do these 'bots that I've thought about, ever since discovering Wiktionary. I've talked about these for a year (give or take) long before Wiktionarre's stunt. --Connel MacKenzie T C 19:15, 16 March 2006 (UTC)[reply]

By the way, please re-read the introduction of Luddite. I'm using the "epithet" not as an epithet, but rather exactly as it is described there, with the current modern meaning...not the meaning the term might have had about 200 years ago. The analogy drawn seems very misguided, as well.

As for Eclecticology's ignorance regarding the copyright status of the Project Gutenberg copy of W 1913, please take a look at it. I do not have a copyright-free electronic copy of W 1913 to use! Do you know of one?

--Connel MacKenzie T C 08:26, 24 March 2006 (UTC)[reply]

Huh? Any copyright on republished public domain works is limited to the any additional formatting that was not present in the orginal work. See User_talk:Eclecticology#Template:R:Century 1914. -Patrik Stridvall 13:44, 24 March 2006 (UTC)[reply]

First of all "xref.wiktionary.org" is primarily meant to be a cross reference and only secondarily a tool to help to add translations and other things. So comparing it to the Webster entries is not really that relevant. Beside as I said you need something similar anyway. The hard parts are writing a good parser that stores it in a good database and then writing a good generator for the new pages. When you have done this you have done 95% of the difficult parts of "xref.wiktionary.org". Doing the rest should be easy. Note that you could have a button that says "Automatically add this" that queues it up for adding by a bot.

Having a real cross reference will makes it possible to discuss any shortcomings. Yes, I'm may be pessimistic and possibly arrogant, eventhough I would prefer to label myself realistic. However, perhaps I'm wrong. You admitted yourself that "the proof is in the pudding" so now it is up to proof. :-)

BTW, you still haven't answered how you will determine the POS for translations. It is not always the same, even for related languages like Swedish. While, in Swedish, I suppose you could take the adjective or the verb that an English noun translates to and make it an adjectival or a verbal noun, but this might not be how a sentence that contains the English noun is normally translated. Hmm, perhaps we should mark translations that changes POS somehow.

As for "Humans have the opportunity to correct entries; they have added incentive, especially if tagged as suggested above.". True, however, have you any idea what effort is involved in doing good translations? I have been trying to check the Swedish translations and have managed to do a large part of them by now. The problem is that even very simple words have a lot of different meanings that translates to different things. Even entries that looks good on the surface are often on closer examination revealed to be much more complicated. Some words have taken more than one hour to fix. In many case I had to add better English defintions, more English senses and more English synonyms and antonyms before I could properly understand how to translate the word. My Swedish-English dictionary is quite good, but very insufficient for this task. See for example what I did on dirty, prime and variable. When I started I thought that it would be easy. It was not. And no it will not be much easier the other way.

Your autogenerated entries are likely to not even look very good on the surface. How many of them do you think I can correct in a week? Even 100 would be very optimistic. 1000 is far out. Just saying that an entry is bad will not help very much. Then you have the other 39 languages... It must be treated as a continuous process. --Patrik Stridvall 13:44, 24 March 2006 (UTC)[reply]