English wiktionary XML dump statistics


Lines 11213241
Pages/IDs 394009
Etymology 29
Pronunciation 28209
Definitions 525988
Synonyms 12724
Groups of translations 25797
SubID 356814
Derived terms 10097
Related terms 14826

http://wiki.webz.cz/wikt.png - Statistics for entries in groups of translations

Suggested changes for English wiktionary

  • More definitions from Webster's unabridged dictionary (public domain)
  • IPA pronunciation
  • 20000 English .ogg files made with TTS
  • More images from commons
  • Entries to validate from wikipedia interlanguage links




Have a look at this page. It gives the IPA symbols for all the sounds in English. Let me know if you have any questions. — Paul G 07:39, 15 October 2006 (UTC)Reply

IRC Chat - Arael and Connel

The following is an IRC "private" chat between IRC nick "Connel" and "Arael", reposted here with permission of both parties. WMF guidelines specify that no wikimedia channels should not be publicly logged. Reposting of excerpts is permitted if permission is granted from all parties. --Connel MacKenzie 16:56, 19 October 2006 (UTC)Reply
  • 17:28:50: Good morning
  • 17:29:00: Good evening ;-)
  • 17:29:04: :-)
  • 17:29:27: I verified that I got them, and listened to most of the Zs
  • 17:29:34: I am quite impressed by most of them
  • 17:30:05: I can create even more in few hours if you want - but we have the most normal words covered by the 20K
  • 17:30:19: Have you uploaded the first three onto Commons: yet?
  • 17:30:44: I'd like to know precisely the format I should use before starting the mass-upload.
  • 17:31:14: Nope it would be cool to upload the zip file and let the server guys unpack it on the server.
  • 17:31:40: That is, whatever license and comments you use, I'll copy, and append "Uploaded by User:Connel MacKenzie by bot, on behalf of User:Arael."
  • 17:31:52: Well, I seem to be taking on that role.
  • 17:32:04: Yes of course - we just need to get them online ASAP so users can use the files.
  • 17:32:18: ... when we have them ready and waiting ;-)
  • 17:32:23: I don't think there is any critically urgent rush.
  • 17:32:35: You know the proverb right?
  • 17:32:48: * Connel blinks *
  • 17:33:01: No, I'm not up on Czech proverbs...
  • 17:33:06: What can you postpone on the day after tomorrow do not postpone on tomorrow - you will gain two days of free time :-))))
  • 17:33:33: heh
  • 17:33:55: Well, I'm not procrastinating; I'm trying to allow reasonable time for review of the idea by interested parties.
  • 17:34:09: Unfortunately, right now, that seems to be limited to you and me.  :-)
  • 17:34:35: The way I see this proceeding is something like this:
  • 17:34:48: So what do you suggest? Ask wiki commons pump guys again how to upload the files easily?
  • 17:35:10: 1) You upload three to ten individual sound files and link them to entries on en.wiktionary.org
  • 17:35:31: 2) Comment on them in WT:GP and then WT:BP the next day
  • 17:35:43: I am wiki newbie so I know only few about how wiki works but in the real world it is possible to upload the zip file then launch unzip on the server and voila ;-)
  • 17:35:46: 3) Work out whatever licensing kinks arise
  • 17:36:12: 4) Let my bot start uploading the first 100
  • 17:36:22: I would like to have the .ogg files free so everyone could use them - I guess public domain is the right licence.
  • 17:36:28: 5) Wait another two to seven days for comments
  • 17:36:42: 6) I fire off the bot to upload the rest
  • 17:36:49: Oki - I am on the IPA right now.
  • 17:36:52: 7) TheDave fires off his bot to link them
  • 17:37:13: so,
  • 17:37:28: I'm not sure what your IPA task involves.
  • 17:37:43: Are you manually entering 20,000 IPA transcriptions?
  • 17:37:57: We just need to find a way that is really simple so we do not have to assign things *manually* - I have created all those .ogg files in few hours by launching a *single* command ;-)
  • 17:38:27: Nope I am creating translation table for a file which has some 120,000 IPA transcriptions - all normal English words covered.
  • 17:38:40: Well, that was the simple step. But making sure we don't piss someone off with the uploads is important too.
  • 17:38:49: whoa
  • 17:38:54: where'd you get that file?
  • 17:39:06: Is it public domain? GPL? GFDL?
  • 17:39:20: I got it from Moby. Let me find a link.
  • 17:39:56: The file is here http://www.dcs.shef.ac.uk/research/ilash/Moby/mpron.html
  • 17:40:01: Thanks.
  • 17:40:08: I'll investigate that
  • 17:40:16: Sorry I was wrong the file has 175K entries ;-)
  • 17:40:21: Do you mind if I post this conversation on my talk page?
  • 17:40:27: What is a talk page?
  • 17:40:46: http://en.wiktionary.org/wiki/User_talk:Connel_MacKenzie
  • 17:41:22: Oki some guys might get new ideas when they read our chat ;-)
  • 17:41:41: It isn't a race.
  • 17:42:03: The thing is, many others will have insightful input to offer.
  • 17:42:14: I can also add thousands of Czech entries into wiktionary if I would have wiktionary like: dog - definiton of dog - Hund - chien
  • 17:42:16: The signal to noise ratio there is very high.
  • 17:42:30: Czech? ABSOLUTELY.
  • 17:42:47: I am Czech guy Czech is piece of cake for me.
  • 17:43:20: The thing is there are files on the net which will help us do it. Some guy released LARGE FREE Czech dicts ;-)
  • 17:44:09: When I run a program I can create relations with a special program I can suggest Czech words for wiktionary and then check them in excel - but how will we upload excel file into wiktionary so new entries are added into DB?
  • 17:44:40: Well, remember that there is Free and there is Free
  • 17:44:56: What is the difference between Free and Free?
  • 17:45:13: Free (as in beer) vs. Free (as in speech)
  • 17:45:38: Freely available does not mean that it is covered by a GNU Free license.
  • 17:46:16: I can ask the guy who released the free dicts to help us - I think he will also like Czech entries in wiki - because he also released dictionaries for free.
  • 17:46:44: The Wiktionary project (and all WMF projects) by default use the GFDL license - so whoever contributes is implicitly releasing their work under the GFDL.
  • 17:47:02: And GFDL is good or bad licence?
  • 17:47:06: One aspect of GFDL is attribution - the other main one is that it can be freely reused.
  • 17:47:10: Very good.
  • 17:47:20: GFDL is a very good license.
  • 17:47:36: I have heard some complaints about it - but I am not a lawyer.
  • 17:47:51: Can someone sell GFDL data for money?
  • 17:48:08: I have heard some complaints too...I generally disagree with those complaints though.  :-)
  • 17:48:19: Yes, you can sell GFDL data for money.
  • 17:48:29: The key is *attribution*.
  • 17:48:52: As long as you honor the terms of the GFDL license, you are fine.
  • 17:49:07: attribution as in "the attribution of lighting to an expression of God's wrath"
  • 17:50:06: I have to look at some more examples to understand what attribution means ;-) I know retribution but not attribution :-))
  • 17:50:06: Sort of. Attribution meaning that you identify /where/ you got it from, and who worked on it.
  • 17:50:58: ok, I need to get to some other things, so,
  • 17:51:03: oki
  • 17:51:14: I'd like you to get the first couple files uploaded onto commons.
  • 17:51:25: Ask in #wiktionary for help, if you hit any snags
  • 17:51:27: When you make some progress let me know I will return to IPA file.
  • 17:51:40: and also, if you could,
  • 17:51:56: But I do not know how to upload files onto commons - why does not commons have FTP?
  • 17:52:12: repost whatever portions of this (you are comfortable with) on http://en.wiktionary.org/wiki/User_talk:Arael
  • 17:52:52: Commons allows only /single/ files to be uploaded, so they can be described, categorized, licensed, reviewed and linked separately.
  • 17:53:06: Hmm we just chat to get some work done - smalltalk ;-)
  • 17:53:24: You would be amazed how much work can be done in few seconds with the right tools ;-)
  • 17:53:41: Oh no, I am quite well aware.
  • 17:54:01: I still cannot imagine that thousands upload one file at a time - has to be exhausting and time consuming.
  • 17:54:07: That is the inherent danger; upload 20,000 files /WRONG/ and have to redo them - YUCK!
  • 17:54:26: It *is* very time consuming, for the bot.
  • 17:54:48: The bot has built-in "sleep" periods it has to honor.
  • 17:54:50: I also think that the mass bot things make server busy.
  • 17:55:22: I agree. Technically it would be much more efficient to do just the zip file, unzipped on the server.
  • 17:55:28: But that is not the Wiki way.
  • 17:55:32: Normally I would upload file via ftp unpack it with single command and add a line into the .php code which checks if file exists.
  • 17:55:56: Exactly.
  • 17:56:01: Wait a second - wiktionary has to have developer who can check if dog.ogg exists.
  • 17:56:19: Then display dog.ogg on the page. It is 5 minutes for developer.
  • 17:56:42: Well,
  • 17:56:55: In .php it looks like if(file_exists("sound/english/dog.ogg"))echo "dog.ogg";
  • 17:57:13: You are talking about customizing Wiktionary code, to check a different server (commons) for a file, living outside of mysql.
  • 17:57:31: The developers would never do that.
  • 17:57:39: We can store the urls of the files in table in wiktionary.
  • 17:57:52: The MediaWiki code is used by a lot more sites than just Wikipedia/Wiktionary
  • 17:58:39: Then it takes select command - select url from oggfiles where url like "%/dog.ogg"
  • 17:58:47: * Connel does not know how many thousands of sites have a wiki. Probably hundreds of thousands, by now. *
  • 17:59:18: Erik told me where to get dump of WZ - I will unpack it and take a look at it - does wiktionary also have dumps?
  • 17:59:20: For code to be added to Wiktionary, it has to make it into the main branch of the MediaWiki code
  • 17:59:27: Yes it does
  • 17:59:32: same place
  • 17:59:40: http://download.wikimedia.org/
  • 17:59:42: WZ dump was in a forum.
  • 18:00:02: ah, right. WZ is not part of the cluster yet.
  • 18:00:30: So you say to modify wiktionary code takes several months ;-)
  • 18:00:48: basically.
  • 18:00:58: Weeks is more typical.
  • 18:01:10: Big things require a lot of time for things to get modified.
  • 18:01:15: Something like that though, that would affect everyone, would take quite a while.
  • 18:02:00: So what do you need next in wiktionary?
  • 18:02:04: So, working /within/ the constraints of the current system, the files get uploaded by bot.
  • 18:02:38: What if the constraints limit/delay the development?
  • 18:02:41: A better spell check.
  • 18:03:16: There is aspell and ispell - I have heard they work quite well - would they help?
  • 18:03:50: Arael, you have to remember that the majority of the Wiktionary contributors are interested in building the lexical content, not with automated uploads. They are looked at with disdain.
  • 18:04:04: My current spell checker uses ispell.
  • 18:04:30: But it needs to treat wiki syntax a bit better
  • 18:04:50: But take a look at how long would it take to manually create the .ogg files and the IPA entries. It would take years.
  • 18:05:10: We have it in days.
  • 18:06:31: It is not possible to edit everything by hand - I code programs to do stuff for me - I have converted/formatted over 2000 .doc, .rtf, .txt files in just two days.
  • 18:07:38: Wait,
  • 18:07:43: That is not true.
  • 18:08:02: It certainly *is* possible to do everything by hand - if you have enough hands.
  • 18:08:09: Unfortunately,
  • 18:08:32: I have only two hands so I code programs to do stuff for me - helps a lot ;-)
  • 18:08:56: Most contributors feel that their human input is so much more valuable than automated entries, that they'd rather prohibit the automated entries and get them all right.
  • 18:09:08: rather than risk a handful of mistakes from a bot.
  • 18:10:10: Even worse, are when questions of licensing crop up.
  • 18:10:44: On WT:BP, we are currently discussing *deleting* 23,000 entries from "NanshuBot" because of license issues.
  • 18:11:19: I would like to chat with these guys - they have no idea - wiki uses mysql - that is automatic thing impossible to do searches by hand.
  • 18:11:59: How about creating very simple code allowing users to validate entries of the bot? It would go really fast.
  • 18:12:01: Their position is sometimes understandable.
  • 18:12:48: More often, I am as baffled as you are, by the anti-bot mentality that pervades the conversations.
  • 18:13:03: I think these guys understand it badly - bots are here to save us from clicking and typing too much.
  • 18:13:17: The "validation" step is what is supposed to happen *before* the uploads start.
  • 18:13:30: Arael: I very, very strongly agree.
  • 18:13:40: Can you give me some nicks of the anti-bot guys? I would love to chat with them.
  • 18:14:09: I know a lot about doing stuff automatically I would explain them why they are so wrong.
  • 18:14:17: But as I said earlier, it is more important not to piss people off, when mass-uploading. The way to do so, is to discuss it at length before starting.
  • 18:14:31: That wouldn't really help though, would it?
  • 18:14:56: Whenever you simply tell someone they are wrong, they *automatically* get defensive.
  • 18:15:04: Sometimes that defense is very agressive.
  • 18:15:09: I think anti-bot guys do not know what they talk about - they need someone to explain them what they want to talk about
  • 18:15:29: I like agressive guys - expecially when they are wrong :-))))
  • 18:16:00: So who is the Bin Laden of anti-bot movement? Give me his nick ;-)
  • 18:16:10: You don't get it.
  • 18:16:27: The contributors that hate bots don't use IRC.
  • 18:16:36: Why not? I would really like to chat with them. You have no idea how silly guys are out there.
  • 18:17:06: Girl who got Master in IT gave me a lecture about how not useful the firewall is.
  • 18:17:11: Alas, I think you overestimate their intelligence.
  • 18:17:24: :-)
  • 18:17:27: ... that it bothers her while she does her important work with questions.
  • 18:17:42: And she is Master of information science! Imagine that!
  • 18:18:10: Then I have heard about PhD candidate girl who does not know what is .zip file and cannot unpack it - also studies PhD in IT!!!!!
  • 18:18:44: Over the years, I've become quite understanding and tolerant of people's shortcomings.
  • 18:19:04: But I hope 99% wiki users are different then this.
  • 18:19:08: People experience life in different ways - they shape their opinions based on their experiences.
  • 18:19:35: "If the only tool you have is a hammer, every problem begins to look like a nail."
  • 18:19:41: I wonder why anti-bot guys use computers at all - they are bots too.
  • 18:19:49: hehe
  • 18:20:14: Really I would LOVE to chat with these guys - they are soooo wrong :-))))
  • 18:20:46: How many anti-bot guys are there in wiktionary? Like 1% ?
  • 18:20:54: About 95%
  • 18:21:08: You almost made me cry!
  • 18:21:42: Of the ~500 "Very active" contributors, four of us run bots.
  • 18:22:07: I say bots are our friends - if the have good logic they help a lot.
  • 18:22:16: People are *very* afraid of bots running out of control.
  • 18:22:20: No need to hate them - bots do the best they can.
  • 18:22:47: So they are scared of skynet bot taking control of wiki and launching nukes against Russia? :-)))
  • 18:23:13: People's talents lie in different areas. No one wants to have their work inadvertently replaced by a bot
  • 18:23:32: I say use bots to suggest stuff and then validate it.
  • 18:24:00: And be careful with replacement bots - think twice before running them.
  • 18:24:01: Well, I hope I've described that that is an uphill battle.
  • 18:24:32: I am glad wiktionary is run by someone who understands how important it is to make things effectively.
  • 18:24:41: Oki I will think about things we have talked about.
  • 18:25:07: Can you repost part or all of this at http://en.wiktionary.org/wiki/User_talk:Arael please?
  • 18:25:15: I have wiktionary converted into .txt file.
  • 18:25:41: That is a good first step. Parsing the XML is very straight-forward.
  • 18:25:50: I can put it into a excel table and show to you what can be done with the data in excel - that would help a lot I think.
  • 18:26:11: I can start with Czech entries - I will surely not add them manually ;-)
  • 18:26:15: Well, yes. (Wouldn't help convince me, of course.)
  • 18:26:25: "Preaching to the choir" and all...
  • 18:26:43: Oki I will try to copy this chat into Arael page.
  • 18:27:02: The best approach for getting bot approval is to show *by* *example* how useful they are.
  • 18:27:05: Will it make guys angry at me?
  • 18:27:23: No, it will make them cautious.
  • 18:27:44: About what? That skynet is taking over wiki?
  • 18:27:58: hehe
  • 18:28:08: These guys have to understand that we do not take anything from them.
  • 18:28:20: Our objective is to make wiki as good as possible in short time.
  • 18:28:34: They will want to see that you know how to create entries *manually* first, so that you don't tell a bot to do something wrong.
  • 18:28:47: I agree.
  • 18:28:47: Oki
  • 18:28:50: Hrmm.
  • 18:29:12: You'd better take out my comment about getting arrested from the above. "No personal threats" might apply.
  • 18:29:31: Oki I will post it and you can censor it.
  • 18:29:31: :-)
  • 18:29:45: That's fine. Thank you.
  • 18:29:49: Bye