User:RJFJR/WTconcord

I've been working on the program. While I've improved it, please make suggestions and comments. RJFJR 11:45, 3 June 2006 (UTC)
Looking good! The next things to work on automatically culling, I think, are derived words, Wiktionary user names (what's mine doing in there?), and Chinese transliteration particles.
I've written a script for culling derived words and applied it to WTconcord2; I'll put the details on this page's talk page. Wiktionary user names shouldn't be hard to handle. Weeding out the Chinese particles (and the English pronunciations, for that matter) should be pretty straightforward as long as they routinely sit aone under some distinct heading, and as long as there's some easy way for you to omit looking at selected headings when doing your text analysis. –scs 18:01, 3 June 2006 (UTC)
Actually, I'm making entries of the Chinese transliteration particles. Cheers! bd2412 T 21:16, 26 September 2006 (UTC)

These are 1,000 words most used in Wiktionary that are not defined in wiktionary. Since Wiktionary is a dictionary it should define all the words it uses...

Note: Some of these are not valid: They are names (my concordance builder strips colons so it loses track of which namespace), common templates or html tags. If you spot one of these please mark it and I will add it to the filter list before the next run.

If the word is ever used uncapitalized then my program assumes the capitalized versions are at the beginning of sentences and lists it lower cased.

There are a few other oddities. for one thing, some of these words are already defined inspite of the fact I didn't find them in the XML dump. I will work on this for future passes.

  • It has been pointed out that one of the oddities is that my word-extraction algorithm doesn't recognize non-latin characters (that includes accented characters, other than hyphen and apostrophe, as parts of words. So words that contain these characters are separated into two words with the unusual character missing. I'll think about this.
  • If a word is ever used without leading capitalization then it is assumed it should be all lower case. This is my way of handling capitalization for first word of a sentence. This may not be the best way to handle this.

form means that this is a form (plural, past, etc. of a word someone had indicated we have). (I tried to preserve comments when I ran it again with a new filter list).


This run has some changes:

  • I've ignored words of less than 5 characters.
  • It now can preserve comments from prior runs.

More changes:

  • Fixed bug so properly removes words that are already defined and contain vowels with diacritcals
  • this pass removes all templates before scanning.


  • Updated 08:50, 27 October 2007 (UTC) using the October 14 dump.