User:Mutante/German nouns/source

Generating User:Mutante/German nouns edit

This is how i generate the lists:

  • run "check_nouns.php" like:
    • php check_nouns.php <word>
      where <word> is replaced with the word to start with

I run that on a remote server where i have a PHP-cli on the shell. This gives me lines like this:

I *[[Testament]] - <span style="background:#c0c0c0">uses 'infl|de|noun' template</span>
X *[[Testosteron]] - <span style="background:#ff0000">could not find 'de-noun' nor 'infl|de|noun' template</span>
X *[[Tetanus]] - <span style="background:#ff0000">could not find 'de-noun' nor 'infl|de|noun' template</span>
I *[[Tetraeder]] - <span style="background:#c0c0c0">uses 'infl|de|noun' template</span>
X *[[Teufelsdutzend]] - <span style="background:#ff0000">could not find 'de-noun' nor 'infl|de|noun' template</span>
I *[[Textmarker]] - <span style="background:#c0c0c0">uses 'infl|de|noun' template</span>

The first character of each line is either "B", "G", "I", or "X". The resulting files can then easily be sorted with a text editor, so the list is grouped by "color", afterwards the first character is discarded and the list is sorted again, so its grouped alphabetically. Finally the result is pasted into of the A-Z pages manually.

The meaning of the letters can be found on User:Mutante/German_nouns and in the source code below. The checks are based on "de" specific templates, so it needs adjustment for other languages.

(Hint: Actually i run

php check_nouns.php Word | tee Word.txt

to watch the output and redirect into a file at the same time)

If you find this useful, help me fix some of the listed words or modify / enhance the code to make it work for other languages. Mutante 14:02, 4 April 2010 (UTC)

check_nouns.php edit

<?php
# check (German) nouns on en.wiktionary.org
# do they use the right template? is there a plural to be created?
# mutante - 2009/2010


# extract a word from a URL (Category:German_nouns) , see below

function extract_words($url) {

ini_set('user_agent','Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2');
ini_set('default_socket_timeout', 5);

$buffer=file_get_contents($url);
$buffer=explode("<p>The following",$buffer);
$word=explode("title=\"",$buffer[1]);

# 200 words per page
$limit=200;

for ( $counter = 1; $counter <= $limit; $counter += 1 ) {

$outword=$word[$counter];
$outword=explode("\"",strip_tags($outword));
$outword=str_replace(" ","_",$outword);
$outarray[$counter]="$outword[0]";

}
return $outarray;
}
# check the word for template usage
# outputs one of:
# "B" (blue) - To do: is "done", just check if the plural is correct (output of B is skipped by default, commented out below)
# "G" (green) - To do: use "Accelerated" and create the plurals (if they are correct)
# "I" (grey) - uses 'infl|de|noun' template - To do: replace with de-noun template if possible
# "X" (red) - could not find 'de-noun' nor 'infl|de|noun' template - To do: insert de-noun template if possible
#
# i use a "sort" function of my text editor to sort by the first character of lines and then discard the first character
#
# we are checking for "de" specific strings here, so this needs to be adjusted for other languages, check the "if (strpos($buffer," lines

function check_plural($word) {
        ini_set('user_agent','Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2');
        ini_set('default_socket_timeout', 5);

        $buffer=file_get_contents("http://en.wiktionary.org/wiki/".$word);

        if (strpos($buffer,"form-of plural-form-of lang-de\"><i>plural</i>")) {
                if (strpos($buffer,"plural</i> <b><a href=\"/w/index.php?title=")) {
                        $plural=explode("plural</i> <b><a href=\"/w/index.php?title=",$buffer);
                        $plural=explode("&amp;",$plural[1]);
                        $plural=$plural[0];
                        $output="G *[[$word]] - <span style=\"background:#00ff00\">de-noun - green link - plural: [[$plural]]</span>\n";
                }

                if (strpos($buffer,"plural</i> <b><a href=\"/wiki/")) {
                        $plural=explode("plural</i> <b><a href=\"/wiki/",$buffer);
                        $plural=explode("\"",$plural[1]);
                        $plural=$plural[0];
                        $output="";
                        # $output="B *[[$word]] - <span style=\"background:#00ffff\">de-noun - blue link - plural: [[$plural]]</span>\n";
                }

        } else {
                if (strpos($buffer,"<p><b lang=\"de\" xml:lang=\"de\"")) {
                        $output="I *[[$word]] - <span style=\"background:#c0c0c0\">uses 'infl|de|noun' template</span>\n";
                } else {
                $output="X *[[$word]] - <span style=\"background:#ff0000\">could not find 'de-noun' nor 'infl|de|noun' template</span>\n";
                }
        }
return $output;
}


# function to jump to the next URL in the Category word list and be able to check more than one page at a time

function nexturl($url) {
ini_set('user_agent','Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2');
ini_set('default_socket_timeout', 5);

$buffer=file_get_contents($url);
$nexturl=explode("previous 200</a>) (<a href=\"",$buffer);
$nexturl=explode("\"",$nexturl[1]);
$nexturl=str_replace("&amp;","&",$nexturl[0]);
$nexturl="http://en.wiktionary.org/".$nexturl;
return $nexturl;
}


# main program
# take the word to start at from command-line argument
# f.e.  php check_nouns.php <word>

$word=$argv[1];

# our basis is the "Category:German_nouns" page

$url="http://en.wiktionary.org/w/index.php?title=Category:German_nouns&from=$word";


# currently the loop is limited to 10 executions here

# do it !

$i = 1;
while ($i <= 10):

$words=extract_words($url);

foreach ($words as &$word) {
       print check_plural($word);
}

$i++;
$nexturl=nexturl($url);
$url=$nexturl;
endwhile;


?>