User talk:RJFJR/WTconcord

Latest comment: 17 years ago by BD2412 in topic Wikipedia concord?

eliminating derived forms edit

Conceptually, this is simple. If we use "clicking" but don't have an entry for it, but if the root (stem) of "clicking" is "click" and we do have an entry for that, we probably don't need to list "clicking" as a word we don't have.

As Connel pointed out in the Beer parlour (discussion now moved to the Grease pit), a program for extracting word roots is the "Porter Stemmer". It's not perfect, but it's a place to start.

I don't know what kind of platform you're using, RJFJR. I'm using Unix (well, actually, Mac OS X at the moment), and it excels at writing scripts to automate tasks like this one. I don't know if it will be of use to you, but here's a reasonably straightforward script for using the "PorterStemmer" program to cull derived words from your list:

ifile=$1               # input file (list of missing words)
allw=./allwords        # file of all entries in Wiktionary

# Note: this script assumes single words,
# i.e. it won't work for multi-word entries with spaces in them

tf1=/tmp/tf$$.1
tf2=/tmp/tf$$.2
tf3=/tmp/tf$$.3

# Run "PorterStemmer" program to create list of stems
./PorterStemmer $ifile > $tf1

# paste words and stems together (side-by-side) to make a 2-column file
# (will be used later to map missing stems back to words)
paste $ifile $tf1 | sort +1 > $tf2

# extract second column (sorted stems),
# use comm to select those not present in list of all words
awk '{print $2}' $tf2 | sort -u | comm -23 - $allw > $tf3

# having list of stems not present, go back and correlate with
# words from which those stems were derived
join -1 1 -2 2 -o 2.1 $tf3 $tf2 | sort -u

rm $tf1 $tf2 $tf3

This presumes that:

  1. You've fetched the PorterStemmer program from http://www.tartarus.org/~martin/PorterStemmer/ and compiled it if necessary. I fetched the C version, c.txt, renamed the source code to "PorterStemmer.c", and compiled it with cc -o PorterStemmer PorterStemmer.c .
  2. You have a sorted file containing all the words already defined in Wiktionary. The script asssumes this is in a file in the current directory called "allwords"; you can either rename your file or edit the script as appropriate. (And it must be sorted; use sort -o allwords allwords if not.)

If the script is in the file "RJFJRcull.sh", usage is simply

sh RJFJRcull.sh missing_words

where missing_words is (obviously enough) the file of missing words. The script spits out a culled version of the input list, which you can capture by doing

sh RJFJRcull.sh missing_words > culled_missing_words

or whatever. (Apologies if I've belabored the obvious here; I don't know whether you know anything about Unix sh programming or not.)

scs 18:26, 3 June 2006 (UTC)Reply

I forgot to mention: One small problem with the PorterStemmer program as written is that it seems to convert everything to lower case. So if you've got a candidate undefined word Porter, it stems that to port, which we have, so Porter goes off the "missing words" list. This is obviously fixable, but I'm not too worried about it just now, because the effect is small, and it's not as if we're overpruning the "missing words" list down to nothingness such that there's nothing left to do. –scs 19:41, 3 June 2006 (UTC)Reply

Wikipedia concord? edit

Can we get a similar concord for Wikipedia? Also, is it possible to get a list of all single-word entries for Wikipedia that do not have corresponding Wiktionary entries? Cheers! bd2412 T 03:25, 30 July 2006 (UTC)Reply

I never got around to replying to this but I started on it. Problem is the program seems to slow down as it runs on that big database. I think it's a garbage collection problem in .NET. Then I stopped working on it for a while. I'll try to get some more work done on it. RJFJR 06:44, 3 March 2007 (UTC)Reply
I'm thinking it may be faster to just go through Wikipedia article titles, rather than all Wikipedia content. bd2412 T 07:14, 3 March 2007 (UTC)Reply

Delete blue links/junk links? edit

Hey, should we delete blue links, and can I knock out obvious junk like mackenziebot, Cerealkiller, male-X, male-Y, male-Z. Perhaps make a separate list for your program to review so that it can avoid picking those up in the future? bd2412 T 04:36, 3 March 2007 (UTC)Reply

I need to copy the junk to an exclusion list so it doesn't pop up again next time I run the data. The blue links are either because they were added after the database dump was done, or because I'm making a mistake in the program. I'm running this again (I've been on other things for a while) so things should be getting updated now (I may need to wait for the next database dump to occur). Please just make a note next to junk so I know to add it to the exclusions list. RJFJR 06:40, 3 March 2007 (UTC)Reply

Improving the search links edit

The Google search links next to each word are great (I just used them to get rid of several common misspellings). They return, however, lots of non-article pages. One easy way to get rid of a lot of them would be to append +-Talk%3A to the search line so that it doesn't search User_talk: and Talk: pages.

Return to the user page of "RJFJR/WTconcord".