Wiktionary:Grease pit/2015/February

Help with ff-root

Need help with reviewing syntax for the new Template:ff-root. Seems to work but for the articles in the category not showing in the category. TIA.--A12n (talk) 18:45, 1 February 2015 (UTC)[reply]
Seems to be working now, but a review of syntax would still be appreciated. TIA.--A12n (talk) 19:41, 1 February 2015 (UTC)[reply]

I'm not sure what the template is supposed to do, as there is no documentation. Can you elaborate? —CodeCa t 19:56, 1 February 2015 (UTC)[reply]

It's modeled after the ar-root template - but only needs to display the root on the page on which it's placed, put that page in the Fula roots category, and put itself in the Fula template category. There was a delay in populating the categories so I thought there was a problem. Will look at how to do the documentation (appears from the ar-root example to require a separate page.--A12n (talk) 20:06, 1 February 2015 (UTC)[reply]

If the purpose is only to show the page name, and add a category, then you don't need to make a new template. The standard template "head" will do: {{head|ff|root}}. —CodeCa t 20:11, 1 February 2015 (UTC)[reply]

Ok, thanks. Will look into changing.--A12n (talk) 20:40, 1 February 2015 (UTC)[reply]

Option for checking past contributions...

I really think that there should be an "only show items that are not on your watchlist" option for checking one's past contributions.

Oftentimes, I remove items from my watchlist once I feel that they are no longer in any danger of being vandalised or the like. However, sometimes I wish to check on those items that I have removed from my watchlist just on the off chance that something did happen to them.

Is there any way to implement such an option for that? Tharthan (talk) 17:15, 2 February 2015 (UTC)[reply]

this script makes unwatched entries bolder on Users Contribution page, but is awfully slow.--Dixtosa (talk) 23:04, 8 February 2015 (UTC)[reply]

Bug in romanization of Arabic

The automatic romanization of Arabic has a small bug in the translations list. If you look at the English word 'wolf', the Arabic translation is given as ذِئْب (I have no idea whether that'll come out correctly here.) This is correct, but the romanization is (ḏīb). That is, it is not recognizing that the middle ya is the bearer of hamza, and is treating it as a ya of prolongation, giving a long vowel. On the actual page for the word ذِئْب, the hamza is correctly coming out in the transcription (ḏiʾb). – 194.106.220.86 16:29, 4 February 2015 (UTC)[reply]

We don't have automatic romanization of Arabic. It's all manual for that language. If someone romanized it as ḏīb when it should be ḏiʾb, then they just made a mistake. —Aɴɢʀ (talk) 17:00, 4 February 2015 (UTC)[reply]

Not quite true. It's automatic but only if the transliteration module determines that the word is fully vocalised. In any case, though, manual transliterations will override automatic ones. —CodeCa t 17:03, 4 February 2015 (UTC)[reply]

Sure 'nuff. I took out the manual translit and now it automatically generates ḏiʾbun. —Aɴɢʀ (talk) 17:15, 4 February 2015 (UTC)[reply]

By convention, we don't include ʾiʿrāb in the translations to Arabic, so I changed the translation to ذِئْب (ḏiʔb). ذِئْبٌ (ḏiʔbun) is the nominative singular indefinite form in the MSA or Classical Arabic. --Anatoli T. ^{(обсудить}/^вклад) 22:15, 5 February 2015 (UTC)[reply]

There's an ongoing discussion about the use of ʾiʿrāb in Wiktionary. Suffice to say that nunation is not pronounced in "pausa" (end of a clause before a pause) even in standard Arabic. No dialect preserves nunation, except for some accusative forms, especially adverbials but this also usually affects unvocalised spellings (alif is written in most cases). See also Wiktionary:About_Arabic#.CA.BEi.CA.BFr.C4.81b_.28final_short_vowels_and_nunation.29. --Anatoli T. ^{(обсудить}/^вклад) 22:26, 5 February 2015 (UTC)[reply]

nive#Walloon

It's showing "uncountable, plural -", which doesn't make sense. This, that and the other (talk) 10:53, 5 February 2015 (UTC)[reply]

It's fixed now. —CodeCa t 14:11, 23 February 2015 (UTC)[reply]

Template:headtempboiler:letter

As mentioned at Template:headtempboiler#Letter template there's the parameter "lower2=" in Template:headtempboiler:letter. But that doesn't work anymore and seems to have been remoed here. A "lower2" is e.g. needed for σ (sigma). So the template needs to be fixed. Or should {{head|LANG|letter|lowercase|LOWER2|uppercase|UPPER}} be used like in β? -Yodonothav (talk) 21:56, 5 February 2015 (UTC)[reply]

Telugu script not showing up correctly

[[:File:Telugu-antarctica.png|right|thumb|A picture, for anyone seeking to troubleshoot this. - -sche (discuss) 08:33, 7 February 2015 (UTC)]][reply]

Hi! So I noticed that there seems to be a problem with how certain aspects of the Telugu script show up within entries (i. e., not in the titles). Consonant adjuncts don't seem to be working at all; consonant clusters appear as the two base consonants next to each other, the first with a virama (the inherent vowel deleter) and the second with the appropriate vowel adjunct. While this technically produces the same sound if read out loud, it is not generally how Telugu orthography works. Secondly, many vowel adjuncts don't seem to be working either... The adjunct simply shows up next to the base consonant it's supposed to be modifying, but just hovering next to it instead of being integrated like it should be. Below is an example of an entry which features all of these problems:

అంటార్కిటికా

The word should look like it does in the title of the entry, but nowhere else in the article does it look remotely like that. Does anyone know how I could fix or help fix this issue? It's rather widespread in Telugu articles. –Axaios Rex (అక్షయ్⁠రాజ్) 00:24, 7 February 2015 (UTC)[reply]

Fixed by removing a crappy font from Common.css (Sangam Telugu is good, though). —Μετάknowledge^{discuss/deeds} 09:12, 7 February 2015 (UTC)[reply]

Space in Template:IPA

Can anyone figure out why {{IPA}} is no longer placing a space after the colon? Kc kennylau says he doesn't think it's because of his recent edits to Module:IPA, but I don't see any other recent edits to relevant templates or modules that could be causing it. —Aɴɢʀ (talk) 08:12, 7 February 2015 (UTC)[reply]

PTO translated with a combined mark into Hungarian

Hello there,

I wonder if it's possible to add this sign: ˙/. as a translation for PTO in its second meaning ('please turn over'). There seems to be an issue with this string. I wrote "fordíts!" as well, because that's how it's expanded in speech, but in terms of writing, this form is not used, only the combination of these three characters. Thanks in advance for your help. Adam78 (talk) 23:36, 7 February 2015 (UTC)[reply]

@Adam78: I replaced it with the ٪ symbol. --Panda10 (talk) 14:42, 7 November 2015 (UTC)[reply]

Thank you! I don't think it's exactly the Arabic sign that is used in Hungarian, but it may be better than nothing. :) Adam78 (talk) 22:11, 3 December 2015 (UTC)[reply]

@Adam78: Is this better: ⸓? Using an Arabic character is not a good idea. --Wiki Tiki 89 22:34, 3 December 2015 (UTC)[reply]

It might be better but the character does not show up in my browser, only the Unicode. I looked it up elsewhere to see it. --Panda10 (talk) 22:48, 3 December 2015 (UTC)[reply]

I'm sorry for replying late. I just got the notification of the replies. If it's this one ("dotted obelos"), then I hardly think it looks the same because of the position of the dots in relation to the slash, as well as the angle of the slash in relation to the base line. I think this is the closest in looks: ˙/. except that the dots should be the same size (and the same distance left and right). Adam78 (talk) 12:42, 6 June 2016 (UTC)[reply]

The Hungarian Wikipedia (hu:w:„Fordíts!” jel) uses this: ˙/ . (\u+02d9\u+002f\u+200a\u+002e). The next-to-last character, u+200a, is a hair space. You have to do something like this or the computer reads it as a bad string. —Stephen ^(Talk) 20:56, 6 June 2016 (UTC)[reply]

Wiktionary talk:Babel#Greenlandic_.28kl.29

It seems that (a) the Wikimedia #Babel system has a bug affecting Greenlandic, and (b) we're missing Template:User kl-0. See Wiktionary talk:Babel#Greenlandic_.28kl.29 for discussion. - -sche (discuss) 00:58, 8 February 2015 (UTC)[reply]

Automated flagging of missing Wiktionary entries

Hello! I am an information scientist and natural language complexity researcher at the University of Vermont, leading a project that predicts "missing" phrase-entries from a dictionary. This development only applies to dictionaries that include larger-than-word lexical objects (such as the the Wiktionary). For example, I am able to generate shortlists of four-word phrases that are similar to those defined in the Wiktionary, which in fact are missing:

benefit of a doubt
keep an eye to
roll off the presses
one of a million
one upon a time
made up your mind
what time is new
down in the count
keep an eye for
...

These lists are ordered according to how likely they are to be meaningful (in need of definition).

Notice that some are completely absent idiomatic entries, like

roll off the presses,

which is similar to the extant, "roll off the tongue".

Many more are variants of existing metaphoric forms, like

keep an eye for,

which are still without reference or redirect.

I would like to add to the requested entries list on Wiktionary:

https://en.wiktionary.org/wiki/Wiktionary:Requested_entries

as part of this ongoing research project, mapping out and defining the greater, English lexicon of phrases.

As this could generate large lists of requested entries, I must ask, is this reasonable within the current framework of the Wiktionary system?

If not, would it be possible to create a separate access point through which I could make these shortlists public?

I am very interested in enhancing the breadth and depth of knowledge---already enormous---on the Wiktionary.

My service and interest in this is purely academic, and I offer it freely and openly.

Looking forward to this discussion :)

Sincerely, Jake Ryland Williams

---

jake[dot]williams[at]uvm[dot]edu http://www.uvm.edu/~jrwillia/

---

Hi. Please sign up with a user name, and then you can create subpages under your user page, like (for example) User:MyName/mypage1. I don't think that a new experimental project will be quite ready to post on WT:REE yet. Equinox ◑ 17:54, 8 February 2015 (UTC)[reply]

But many of your phrases are just, plain wrong :-

benefit of a doubt - benefit of the doubt

keep an eye to - keep an eye out

one of a million - one in a million

one upon a time - once upon a time

made up your mind - make up one's mind

what time is new ?

down in the count - down for the count

keep an eye for - see above

SemperBlotto (talk) 18:03, 8 February 2015 (UTC)[reply]

Some may be attestable alternative forms. More details about how the list was generated would help in knowing whether we would simply want to add alternative form entries for them en masse or whether they should be researched first using, for example, {{REEHelp}}. DCDuring TALK 20:33, 8 February 2015 (UTC)[reply]
roll off the press - OneLook - Google (Books • Groups • Scholar) - WP Library
roll off the presses - OneLook - Google (Books • Groups • Scholar) - WP Library

Hello again, and thank you all very much for your responses. Thanks Equinox---I have created a user account---and DCDuring---I have transported this conversation to my user page, enhancing it to a more full description. Please visit jakerylandwilliams and feel free to contact me with an questions or suggestions. As stated, I am very interesting in working with the Wiktionary, and within whatever framework is deemed productive and acceptable. Best, Jake.

Edittools no longer working

Has anyone else found that Edittools no longer works? It appears in my UI when I'm in edit mode just as expected, and I can click on any of the items, but instead of inserting the clicked text at the location of the cursor in the textbox, the UI focus just ... vanishes. The blue outline on the textbox, indicating that the textbox is the active UI element, disappears, and nothing else is highlighted. I have to click within the textbox before I can type again.

This non-functionality first arose maybe a month ago. I had made no changes to my Edittols config, and something (I forget what) led me to think that it was a browser update issue (I had been using slightly-outdated Chrome 30-something), but updating Chrome didn't fix the issue. I decided to do some testing yesterday, and found the same problem under Chromium on Ubuntu, and on Firefox on Mac, leading me to conclude that the Edittools infrastructure must have changed somehow.

Any further information would be much appreciated. ‑‑ Eiríkr Útlendi │ Tala við mig 19:48, 9 February 2015 (UTC)Á[reply]

This kind of error is most likely caused either by broken/outdated personal JavaScript or one or more broken/outdated gadget(s). We've been seeing this on a number of wikis recently. I suggest you try disabling non-default gadgets and commenting out user scripts until you find that Edittools works again. This, that and the other (talk) 06:19, 10 February 2015 (UTC)[reply]

I have disabled almost all gadgets, deleted my common.js, and cleaned up most checked boxes in my per-browser preferences, but I still cannot add characters from the extended character set menus. Could this be Java-version specific, ie attributable to recent updates of these? DCDuring TALK 21:22, 10 February 2015 (UTC)[reply]

Can you check the webconsole of your browser and see if there is a javascript error? I had something like that: ReferenceError: insertTags is not defined. I think that "insertTags" may have been deprecated in the latest release, and it should normally work while showing "Use of "insertTags" is deprecated. Use mw.toolbar.insertTags instead." Maybe try to purge your cache. — Dakdada 17:14, 11 February 2015 (UTC)[reply]

Well, it did change, see phab:T85787. If purging does not solve your issue, open a bug report there. — Dakdada 17:22, 11 February 2015 (UTC)[reply]

I've purged and still get the inserTags error, but I'm not sure if the issue is with MW -- I suspect the problem is that our infrastructure here is outdated, as I dimly recall that Edittools is based on old code from Conrad Irwin. Last I mucked about with my own personal JavaScript settings for Edittools, the best practice at the time was to copy Conrad's code. Is there some MediaWiki code that we should be copying instead, or transcluding instead? Our own WT page discussing Edittools seems to be somewhat out of date, and I'm not sure where else to look. I'll poke around phab:T85787 later when I have more time. ‑‑ Eiríkr Útlendi │ Tala við mig 20:38, 11 February 2015 (UTC)[reply]

Apparently the issue can be resolved by adding a dependency to mediawiki.toolbar in the gadget definition. C.f. mw:Extension_talk:CharInsert#Character_insertion_stopped_working_after_1.25wmf14_rollout_52853 for details.

If anyone reading this knows how to do this, please implement the required change. I poked around in MediaWiki:Gadgets-definition, but I didn't see anything related to charinsert. ‑‑ Eiríkr Útlendi │ Tala við mig 08:57, 12 February 2015 (UTC)[reply]

Thanks for doing the research. I hope it gets implemented quickly. Now I can't even do a copy and paste from the Edittools character sets. I would need to use Unicode to get the characters. DCDuring TALK 14:27, 12 February 2015 (UTC)[reply]

The charinsert is implemented in MediaWiki:Edit.js, loaded by MediaWiki:Gadget-legacy.js (the first, default gadget). — Dakdada 16:20, 12 February 2015 (UTC)[reply]

Thank you for the pointers, Dakdada -- I've made the changes in the relevant files, and Edittools is now working for me. DCDuring, is it working for you now? ‑‑ Eiríkr Útlendi │ Tala við mig 21:29, 12 February 2015 (UTC)[reply]
@Eirikr, Darkdadaah: Thanks to both of you. I wish I knew enough to help. DCDuring TALK 21:53, 12 February 2015 (UTC)[reply]

Template:alternative form of

This template starts with a capital letter, whereas all other similar form-of templates appear to begin with a lowercase. Could someone please deal with this? This, that and the other (talk) 23:47, 10 February 2015 (UTC)[reply]

"all other similar form-of templates appear to begin with a lowercase" Such as...? Look at the templates in Category:Form-of templates, all of the ones I've checked so far all begin with an uppercase letter. Some of them seem to have a parameter that allows you to render it in lowercase for whatever reason (using the template amid a definition instead of on it's own line perhaps?). Bruto (talk) 01:58, 11 February 2015 (UTC)[reply]

Our whole set of non-gloss templates is not entirely consistent on whether to start with an uppercase or lowercase letter and end with a dot or nothing. It'd be nice to standardize. Since we generally (though a few object to this) begin English sense-lines with uppercase letters and end them with dots, while beginning other languages' sense-lines with lowercase letters and ending them without dots, perhaps the templates could even be set up to capitalize and punctuate based on the lang= parameter. - -sche (discuss) 18:51, 12 February 2015 (UTC)[reply]

Help with ff-noun

Need to request help to include a parameter in Template:ff-noun that would add the entry to a category for the indicated noun class. That is, with {{ff-noun|sg-nc|plural|pl-nc}}, to have this category generated: [[Category:Fula noun in class sg-nc]]. The object is to group entries for nouns by noun class. These new categories would then be subcategories of Category:Fula nouns. TIA for any help or pointers.--A12n (talk) 04:56, 11 February 2015 (UTC)[reply]

Maybe you could do it the same way as {{sw-noun}}? Are your needs any different? —Μετάknowledge^{discuss/deeds} 08:28, 11 February 2015 (UTC)[reply]

Thx. Looks like that approach could be adapted. Is there a simpler way, taking the contents of the sg-nc field and putting it in the specified location in the category? (I'll need to read up on the coding, evidently.)

Well, I used a different system of categorising the noun classes, one that makes sense for Swahili but is not the numerical system standardly used by Africanist linguists. Besides other benefits, it greatly reduces what has to be typed into the template. That said, if you really want three parameters where the template itself is unable to predict anything and you must fill them all out, I can do that for you. —Μετάknowledge^{discuss/deeds} 17:19, 11 February 2015 (UTC)[reply]

Thinking about this. Noun class names in Fula unlike Swahili (if I'm seeing the latter correctly) also have a function - so ki for instance is also a particle functioning as a determiner and an indicative depending on whether it is after or before the noun. So the {{ff-noun}} template is set up so you type in whichever of the 22 or so singlar classes is appropriate (there are 4 plural classes but I still need to generate a template for plural Fula nouns). The other two parameters - the plural and the plural class - also need to be keyed in (no way to predict the plurals that I can see - ending can vary, and some initial consonants shift). So yes, if you could help that would be most appreciated.--A12n (talk) 04:49, 13 February 2015 (UTC)[reply]

Soundex search

This site demonstrates a Javascript function that generates a soundex code for a string. I assume that it is useful only within a given language. Couldn't we supplement our existing orthographic indexes (and our incomplete misspellings, IPA, and rhymes coverage) with a soundex index to enable search for terms (words?) the spelling of which is not correctly known? It would be nice if it were integrated into search, but it would first be nice to determine whether it would work and be useful at all.

Is it a good idea? What would be involved? DCDuring TALK 23:21, 11 February 2015 (UTC)[reply]

I see at w:Soundex that there are improvements over the original soundex system. DCDuring TALK 23:28, 11 February 2015 (UTC)[reply]

For misspellings the w:Levenshtein_distance is actually a better approach. The search engine used by MediaWiki already supports this, you'll need to add ~ to the search term (fuzzy search). Jberkel (talk) 01:41, 19 February 2015 (UTC)[reply]

@Jberkel: Thanks a lot. It's wonderful that we have it already. Is what we have "tuned" for English? What scripts and languages does it work with? DCDuring TALK 03:28, 19 February 2015 (UTC)[reply]

The Levenshtein distance is language agnostic (in contrast to the Soundex/Metaphone group of algorithms). The implementation used in MediaWiki has full unicode support so should work with all scripts supported by that standard. – Jberkel (talk) 14:37, 19 February 2015 (UTC)[reply]

Well yes but Soundex is about sound, not writing or misspelling. Is it not what DCDuring asked (words for which we don't know the spelling, but an approximate pronunciation)? — Dakdada 16:17, 19 February 2015 (UTC)[reply]

True, it's not about sound. But looking at the references in the article, Soundex (and most derivatives) are optimised for English (or non-English words familiar to English speakers). It would be very hard to build a version of Soundex which works well with the majority of languages and scripts in use here. However It would be interesting to see if the IPA data (where available) can be used to implement phonetic search. – Jberkel (talk) 17:01, 19 February 2015 (UTC)[reply]

Both sound and spelling are issues. Many misspellings, especially in English, are based on the sound. Hardly any ordinary users know IPA, so the only tool, short of asking at Info Desk or Tea Room, is to use conventional orthography as best one can. So: spelling matters, probably much more than anything else. But a Levenshtein or other distance would be more accurate if it "knew" whether the source of distance was a typo, or a scanno, or a thinko, or a pronunciation spelling (ie, a spelling intended to represent what was heard). For near-misses all of the above could be used to determine what the search engine offers the user, but a better focused list would be generated if the user could specify that pronunciation representation was the objective. A special interface to elicit better sound information from a user would be nice.

Any effort that worked for English would be a good start. For almost all searches we are likely to see the matrix language at least would be known and a secondary language could be guessed. DCDuring TALK 17:10, 19 February 2015 (UTC)[reply]

IPA can be used to search for sounds to some extent, at least as long as the user can type the sounds that he wants. I already did a tool like that for French (no fuzzy searches though), and I opted to use a virtual keyboard to type IPA symbols (see here). This approach can be found in other dictionaries like TLFi. The most difficult problem seems to be how to help the user type what he wants to find, rather that the search itself. — Dakdada 17:36, 19 February 2015 (UTC)[reply]

'#English

Neither {{head}} nor {{en-part}} works at '. Both result in this being displayed as the headword line: ’[[Category:English lemmas|]][[Category:English particles|]]. - -sche (discuss) 22:22, 12 February 2015 (UTC)[reply]

This is because apostrophes are stripped when making category sort keys. Of course in this case there is nothing left after that. I'm not sure what the best solution for this would be. The simplest, that I can think of, would be to skip creating a sort key altogether if the page name is only one character, but that would still break when someone creates something like ''. —CodeCa t 22:28, 12 February 2015 (UTC)[reply]

sort=' solved it. — Ungoliant ^(falai) 22:30, 12 February 2015 (UTC)[reply]

Thanks! - -sche (discuss) 22:44, 15 February 2015 (UTC)[reply]

Chinese classifier template

I'm not sure if this idea has been run by you guys before, but what do you think of the idea of having a template that generates the correct classifier(s) for each Chinese entry? (@Atitarev, CodeCat, DCDuring, Wyang Any input?) WikiWinters (talk) 11:12, 17 February 2015 (UTC)[reply]

Did I break anything?

Hi. I've been playing with some Modules recently, which is probably not healthy for Wiktionary. Anyway, I'm trying to generate categories for missing noun forms, and later will try to do the same for other parts of speech. I've fiddled with lots of modules, but the only fiddle that worked, much to my delight, was my one on Module:ca-headword. My edits to Module:en-headword , Module:pt-headword , Module:fr-headword , Module:gl-headword and Module:ru-headword did not have the desired effect, and I'm afraid I might have broken something. Modules, by the way, are really complicated things! --Type56op9 (talk) 17:46, 16 February 2015 (UTC)[reply]

It would be useful to have a page Help:Modules to explain how to write and use the damn things, you know. --Type56op9 (talk) 17:47, 16 February 2015 (UTC)[reply]

One of the lines of Help:Modules will be like "Do not touch anything that is used by thousands of entries if you do not know what you are doing", for sure... --Dixtosa (talk) 18:25, 16 February 2015 (UTC)[reply]

I'm sure CodeCat would be happy to help. DCDuring TALK 20:18, 16 February 2015 (UTC)[reply]

If I'm to help, I'm just going to revert it all. —CodeCa t 20:24, 16 February 2015 (UTC)[reply]

Why's that? Can't you assist with the objective, provided it is expressed, of course? DCDuring TALK 21:20, 16 February 2015 (UTC)[reply]

The objective is for Wonderfool to continue creating form-of entries with his bot or through some other (presumably automated) means, even though there have been complaints about the mistakes he has been making. Since it doesn't seem he wants to hold himself accountable for his edits (if he did, then why does he circumvent blocks?), I've chosen to stay far away from this topic, and want to bear no responsibility if it causes more problems. Let someone else deal with it. —CodeCa t 22:08, 16 February 2015 (UTC)[reply]

I appreciate the comments, CodeCat. You are right about everything - the objective is to enrich Wiktionary with form-of entries (semi-automated, using WT:ACCEL, in fact). It's a pity that modules are so complicated, because it means less of us are able to use them. I'll follow this topic closely, and play with modules some more, until I either figure them out or I give up. --Type56op9 (talk) 10:34, 17 February 2015 (UTC)[reply]

If you work on modules you unfortunately have to spend some time to learn how to program. If you're unsure what you're doing then you should try your changes with one module first (preferably sandboxed). Once everything works as expected apply the changes to the live module. As far as I can tell you just blindly copy-pasted code snippets around. Jberkel (talk) 01:13, 19 February 2015 (UTC)[reply]

If someone could tell me how to -- or where to find the docs telling me how to -- sandbox a module, or even to create a module in userspace for testing before bringing it out into mainspace, and how to invoke it either way, I would be very much obliged. --Catsidhe ^{(verba, facta)} 01:19, 19 February 2015 (UTC)[reply]

Everyone has their own sandbox module, yours is at Module:User:Catsidhe. You can create that and use as many subpages as you like. —CodeCa t 01:30, 19 February 2015 (UTC)[reply]

Catsidhe and Type56op9 raise a valid point: we don't have good documentation around modules and the development approach in general. Everything feels quite ad-hoc and every module author does things a little bit differently. Wheels get reinvented. Code gets copied. It would be good to work towards a consensus on how certain things should be done. Wiktionary:Coding_conventions#Lua is not enough. – Jberkel (talk) 02:13, 19 February 2015 (UTC)[reply]

Spanish nouns without Template:es-noun

Hi there. How would one go about generating a list of Spanish nouns not including Template:es-noun? --Type56op9 (talk) 10:43, 17 February 2015 (UTC)[reply]

Well, I would take the contents of the Spanish nouns category together with "what links here" of the template and sort them together. Throw away all the entries that occur twice and Robert is your parent's brother. SemperBlotto (talk) 10:48, 17 February 2015 (UTC)[reply]
- That's exactly what I want to do! Only problem is, I don't know how to :(. Something like User:Mglovesfun/to do/English is what I had in mind. I'll ask User:Renard migrant. --Type56op9 (talk) 16:29, 17 February 2015 (UTC)[reply]
- If you can provide me with the two lists (offline) I can sort/merge them and find the missing ones for you. SemperBlotto (talk) 16:35, 17 February 2015 (UTC)[reply]

Or alternatively, you can ask the author of module:head to change it so that it categorizes just like you want. Or even better option is to change es-noun by yourself (not protected yay! :D) so that it does not categorize es-nouns and then get the list of new Spanish nouns. you may need to do massive null-edits on pages though. --Dixtosa (talk) 17:11, 17 February 2015 (UTC)[reply]

I think you can use AWB to compare lists (and possibly even to generate them from categories and whatlinkshere) even without being approved to save edits with it. - -sche (discuss) 22:08, 17 February 2015 (UTC)[reply]

Urgent help please - boxing spammer

A very persistent spammer keeps adding "mywikibiz" rubbish to pages. He was using Talk:boxing until I protected it, and is now using other pages. He is a human, not a bot, and responds aggressively to people trying to stop him. He has many IPs. Can someone prevent "mywikibiz" being inserted into articles? -- that is the only way to stop him spamming his site. I tried adding it to a filter but I must have done it wrong. Thanks. Equinox ◑ 20:18, 17 February 2015 (UTC)[reply]

Done. --Yair rand (talk) 23:30, 17 February 2015 (UTC)[reply]

They still seem to be getting through, on kickboxing and martial art now. —CodeCa t 22:19, 18 February 2015 (UTC)[reply]

The two bad edits that CodeCat fixed were both from the 208.54.32.xxx range. Equinox, could you tell us if this spammer consistently uses this range? If so, maybe we just block this range for a few days / weeks from making anon edits? ‑‑ Eiríkr Útlendi │ Tala við mig 22:38, 18 February 2015 (UTC)[reply]

IPs used by the spammer so far: 172.56.0.109 172.56.0.112 172.56.0.166 172.56.1.82 172.56.1.135 172.56.1.179 172.56.32.69 208.54.64.175 208.54.64.164 208.54.64.188 Equinox ◑ 17:05, 19 February 2015 (UTC)[reply]

Blank page

The page share is currently totally blank. Does anyone has an idea of the problem? — Automatik (talk) 14:39, 18 February 2015 (UTC)[reply]

It could be an ad blocker. —CodeCa t 15:05, 18 February 2015 (UTC)[reply]

Exactly, thank you! AdBlock disabled for this page. — Automatik (talk) 15:21, 18 February 2015 (UTC)[reply]

Kassadbot still not running?

There are now over 12,000 entries in Category:Requests for autoformat. SemperBlotto (talk) 08:44, 21 February 2015 (UTC)[reply]

Oh is there still no replacement? Sheesh. I've been, uh... in inpatient treatment for a while (borderline personality disorder sure is fun) and I got a new PC and lost maybe half my files due to a less-than-reliable USB hard drive. I might give it a try if I can set everything up again. -- Liliana • 10:36, 21 February 2015 (UTC)[reply]

Wikisaurus change

Well, since wikisaurus has been proposed as a tool in order to find synonyms, antonyms, etcetera. Instead of adding synonyms of synonyms shouldn't all synonyms be linked together.

For example, if I add a synonym entry to cat as 'feline', then shouldn't wikisaurus create an entry 'feline' if it doesn't exist, and add 'cat' plus all synonyms, antonyms of cat? The reason for this is, that it might be easier to manage all the synonyms on a 'collective' space, so they're maybe, easier to manage together, and it might increase the size of wikisaurus way faster.181.50.196.58 18:28, 21 February 2015 (UTC)[reply]

If I understand what you're proposing, I would say it's not a good idea. A big problem with Wikisaurus is that it's not always obvious when you're creating an entry whether there's already a Wikisaurus entry that covers it. If I put felid as a synonym for cat, Wikisaurus:felid would duplicate Wikisaurus:feline. Also, WS entries are often based on subtle semantic distinctions that automated methods wouldn't be able to handle. The likely result of an automated method would be lots of single-member WS entries that would just add clutter and confusion. Chuck Entz (talk) 18:51, 21 February 2015 (UTC)[reply]

Redirects would solve the problem of people creating Wikisaurus:felid because they don't know about Wikisaurus:feline. Perhaps someone could even create a gadget similar to the one used on rhymes pages, which would create redirects automatically when a new synonym was added to a Wikisaurus page (i.e. if I add foobar to Wikisaurus:feline, the gadget would create Wikisaurus:foobar as a redirect to Wikisaurus:feline). - -sche (discuss) 19:05, 21 February 2015 (UTC)[reply]

Redirects are unnecessary since (a) the user can use the search bar present at the top of each Wikisaurus entry to find whether a WS page already contains the term, and (b) the mainspace Synonyms section for each word should eventually link to the corresponding Wikisaurus pages (I have now expanded felid to link to WS:feline). --Dan Polansky (talk) 14:56, 22 February 2015 (UTC)[reply]

I agree with Chuck Entz. I add that, generally speaking, most synonyms are not 100% equivalent, and this can be addressed in Wikisaurus, but not automatically. And Wikisaurus should not address only synonyms, antonyms... but should be a true thesaurus. @-sche: redirects are a good idea, but this cannot be automatic: many words have several meanings, and might appear in several Wikisaurus pages. Lmaltier (talk) 19:11, 21 February 2015 (UTC)[reply]

Template:sa-verb-pres

I wonder if anyone could fix Template:sa-verb-pres? It has extra "}}". (See हन्ति for example) --KoreanQuoter (talk) 12:13, 22 February 2015 (UTC)[reply]

Nevermind, I think I got it, --KoreanQuoter (talk) 12:50, 22 February 2015 (UTC)[reply]

Module:en-headword

I was fiddling with a Module again. It didn't work. Could someone check it, and correct it, please? --Type56op9 (talk) 15:16, 23 February 2015 (UTC)[reply]

software database dictionary

I have invented a word game and would like a free concise dictionary, in the form of a downloadable software database file, for inclusion within it. Is there such a file which can be used commercially? The word list I am using for the game is SCOWL and I am hoping to get a dictionary which will contain all the words that are in that word lst, so that when a word ((in the forum of the link ) is pointed to and clicked, the player will be directed to a short meaning of it.

Thanks Paul

See Help:FAQ#Downloading_Wiktionary. You'll have to run a manual comparison against SCOWL; also be aware that we are fairly inclusive of unusual and offensive words: your players might object to some of them if they are not in mainstream dictionaries etc. Equinox ◑ 22:02, 23 February 2015 (UTC)[reply]

Wikidata experiment with taxon hypernyms

(Pinging people who might be interested, but might not check Grease pit very often. Sorry for ping spam.)

@DCDuring @Chuck Entz @SemperBlotto @I'm so meta even this acronym @JohnC5 @Equinox

Messed around today with Wikidata and lua. Thought I'd share in case anyone might be able to use the output in some way, or wanted to push it along, or just wanted to see what might be possible when they ever enable Wikidata on Wiktionary.

So I was considering creating some sort of bot to generate the "hypernym" section for species and other taxon entries here, (and also pondering the mess on Wikipedia which is the Taxobox template, which is a related problem), and I thought it'd be far better to have a template with a lua script that did it all instead of running a bot. Was going to just spend an hour or two on it in the morning, but ended up spending most of the day getting it working.

Due to how Wiktionary being disconnected from Wikidata, the script will only run on Wikidata's internal wiki right now, but some day they might connect us to Wikidata and enable "access to arbitrary items". So for now the module only runs on Wikidata.

It outputs something you "could" paste into Wiktionary. It takes a "Q" number of a taxon's Wikidata item, and outputs the wikitext for the hypernym section.

Here's a sample of the kind of output (so far):

Octopoda (hypernyms) {{#invoke:Wiktionary-taxon|hypernym|Q40152}}

(order): Biota; Eukaryota - domain; Unikonta - subdomain; Opisthokonta; Metazoa - subkingdom; Epitheliozoa; Eumetazoa - subkingdom; Bilateria - infrakingdom; Protostomia - branch; Lophotrochozoa - superphylum; Mollusca - phylum; Cephalopoda - class; Coleoidea - subclass; Vampyropoda - superorder

The dodo: (hypernyms) {{#invoke:Wiktionary-taxon|hypernym|Q43502}}

(species): Biota; Eukaryota - domain; Unikonta - subdomain; Opisthokonta; Metazoa - subkingdom; Epitheliozoa; Eumetazoa - subkingdom; Bilateria - infrakingdom; Deuterostomia - superphylum; Chordata - phylum; Craniata; Vertebrata - subphylum; Gnathostomata - infraphylum; Teleostomi; Tetrapoda - superclass; Reptiliomorpha; Amniota; Reptilia - class; Sauropsida - class; Eureptilia; Romeriida; Diapsida; Lepidosauromorpha - infraclass; Lepidosauria - superorder; Squamata - order; Sauria - suborder; Archosauromorpha - infraclass; Archosauriformes - infraclass; Archosauria; Avemetatarsalia; Ornithodira; Dinosauromorpha; Dinosauriformes; Dinosauria - superorder; Saurischia - order; Eusaurischia; Theropoda; Neotheropoda; Averostra; Tetanurae; Orionides; Avetheropoda - order; Coelurosauria; Maniraptoriformes; Maniraptora; Pennaraptora; Paraves; Avialae; Aves - class; Columbiformes - order; Raphidae - family; Raphus - genus

(more examples)

So while the script can't be used directly on Wiktionary yet, you could copy-paste the output into Wiktionary, but you would probably want to trim it down first. Obviously it still needs some work. Mostly it needs some added heuristics to choose which ranks to ignore. But thought I'd share it so far anyway.

You can try editing/previewing this with other species/taxa here: d:User:Pengo/hypernym, or see the module here: d:Module:Wiktionary-taxon. Will be glad if it can be be used in its current state.

Happy editing. Pengo (talk) 06:21, 26 February 2015 (UTC)[reply]

The proliferation of names, both ranked and unranked, for taxonomic clades and the unsettled relationship among them makes keeping track of relationships hard. It also makes keeping up sometimes counterproductive for dictionary users, who are generally not reading works that are up-to-the-minute in this regard. The "correct" placement and circumscription of a taxon is often provisional for years or decades and is sometimes controversial, with multiple hypernymic and hyponymic relationships being in use for some time. Most of the existing sources of taxonomic information have a hard time keeping track of the information for genus and species, let alone higher and lower taxa.

Even Wikispecies and English Wikipedia often disagree, sometimes without acknowledgement in Wikipedia of controversy. Wikispecies is particularly bad at recognizing multiple placements and circumscriptions, doing so only for the "highest" taxa, Commons attempts to reconcile them. The non-WMF external sites that try to have comprehensive coverage of many ranks or clades, firstly, do not actually have comprehensive coverage, secondly, rarely present controversy, and, thirdly, lag behind specialized websites, which are numerous, but often relatively short-lived (10 years being "long" and I'm not just talking about web addresses).

Thus, the grand project of presenting the apparently straightforward data structure of taxonomy requires a huge effort to simply keep track of the twists and turns of classification and may miss the mark in presenting how the authors our readers actually are reading have actually used taxonomic terms.

I have no particular solutions to the problem, other than including links to as many outside sites that cover this kind of thing. I wouldn't know how to usefully present multiple discrete circumscriptions (hyponyms) and placements (hypernyms) of taxa (some kind of diffs?). I don't want to discourage any work in this area, but I expect that there will be much more enthusiasm for working on the programming for the simplified snapshot of the latest taxonomy than for maintaining the data or reflecting the history and diversity of opinion. DCDuring TALK 14:32, 26 February 2015 (UTC)[reply]

Higher-level taxa tend to be less stable, since they're more abstract. Even when there's no question as to the branches, different taxonomists may represent them using different ranks: one may see a family with subfamilies, while another may see a superfamily with families, a family with tribes, or even an order with suborders. DNA and cladistic analysis don't always clear things up, since one study may focus on specific mitochondrial genes, while another may look for transposon sequences within nuclear DNA; choice, weighting and coding of features, choice of outgroup, and various other differences in methodology can lead to radically different trees from one study to the next. These will eventually get sorted out, but things are mostly in an unsettled, preliminary stage for the near future. These are exciting, but confusing times.

As for filtering algorithms: a lot of it is context within the larger structure- nodes that have sisters should be shown. Family, genus and species are always of interest, and often orders, classes, divisions/phyla and kingdoms. When there are multiple unbranched levels, omit prefixed ranks: orders, but not suborders or infraorders, families, but not superfamilies or subfamilies, etc. Subgenus is especially awkward, since it comes between the two parts of the binomial- so omit it whenever possible. I hope this helps. Chuck Entz (talk) 15:26, 26 February 2015 (UTC)[reply]

@DCDuring Ultimately the goal, if I were to spend way too much more time working on this, would be to make it resemble the existing lists, but share the maintenance with the other many other WMF projects which use taxonomies.

Maintaining taxonomy data is happening separately already on Commons and Wikispecies and every wikipedia and wiktionary. I don't expect all the projects to switch to using Wikidata tomorrow or any time soon though, but ultimately it could only be less work to do so.

The conflicting taxonomies thing always come up when talking about Wikidata and taxonomies. The idea of using Wikidata seems to quickly get shouted down because en.wiki need to do their taxoboxes differently to fr.wiki (I have to admit, I've never worked out what the specific disagreements/differing views are actually about, but I accept they're valid).

However the problem should be solvable. Wikidata might be centralized, but it allows multiple, "conflicting" data items, which can be tagged with their source and dates, and other such things. If multiple taxonomies were imported into Wikidata, it should be possible to have one project pick one set of preferred sources, and have another pick another, but both still use the same data source, the same code, and use the data in the areas where there isn't controversy. A simplified taxonomy should also be possible, perhaps borrowing the IUCN's red list, where the focus appears to be on large familiar groupings of species more than on accurate cladistics (e.g. it doesn't place birds under reptiles). So the projects could be much more internally consistent. Another project could leverage the conflicting viewpoints and choose to present either one or both. (Yes, the job of working out how to display it best is difficult too, but at least it might become possible to find a new way to display information and actually apply it to existing data)

Hebrew Wikipedia is currently using Wikidata for its taxoboxes, but due to the current limitations on accessing Wikidata from Wikipedia, the tree can't be recursively climbed like I've done here, and instead each taxon in Wikidata needs its own links to some limited set of higher taxa. The guy who made the module presented it on a talk page to en.wiki two years ago but if anyone was enthusiastic about it, they hid it well. The responses were about it not handling weird edge cases, and about how it spelled the end for the all important English/French taxonomy divide, so the dev just went back to he.wiki and took his templates with him. No mention of Lua was made for Taxoboxes again since (in my limited search anyway).

Anyway, these experiments here are just from the data that was already in Wikidata, and as far as I can tell, no one's actually attempted to view it from bottom to top like this before, let alone make it presentable. But it seems there's a good amount of taxonomy data already imported in Wikidata.

Some day, I imagine it might possible for the user to change the timeline on a taxobox to choose which era's taxonomy to view, or to find some way to automatically list alternate taxonomies on Wiktionary, etc. The main thing here for me is the possibility of actually separating data and presentation.

Populating the data is certainly a huge task too, as you say, especially for anything even slightly historical. But if the various projects which use taxonomy data can work together, and only have to agree on where data has come from, and can decide separately which to display, building something that surpasses the existing systems should be achievable relatively quickly. It doesn't have to have everything, it just needs to be better than what's existing.

That said, my test here are very simple, and largely an experiment to see what's possible. The algorithm just picks the first "parent taxon" listed and repeats, without any smarts as yet. It could be interesting to find some area where taxonomies disagree and attempt to get that disagreement stored in Wikidata and add a switch to the module allow flipping between them, but it's all academic at this stage, especially as it can't even run properly anywhere but on Wikidata's own Wiki. It was really just meant to be a brief distraction to answer a "would that work?" kinda of question, but is worth thinking about for some time in the distant future. Pengo (talk) 16:58, 26 February 2015 (UTC)[reply]

@Chuck Entz Those rules are pretty good and do help, thanks. As far as I can tell, there's no way to easily find child nodes in Lua/Wikidata right now, so the amount of branching is impossible to tell. Another good reason to go back to just writing code on my local machine where there aren't so many arbitrary limitations. :) Pengo (talk) 16:58, 26 February 2015 (UTC)[reply]

We already have some waste of time and needless confusion for our users in presenting in an entry a simple ladder of one-child taxonomic hypernyms in the same way as a branching structure (trees). But it is not always easy to tell whether a ladder will remain a ladder or become a tree, except perhaps by the length of time that it has remained a ladder.

I am already in the process of eliminating mention of subfamilies, supertribes, tribes, and subtribes from the hypernymic portion of the "definition" (in {{taxon}}) of genera and subgeneric taxa and substituting families, which tend to be more meaningful to non-specialists and somewhat more stable, notwithstanding the all-too-frequent conversion of families to subfamilies (and vice versa) that Chuck refers to. I am also substituting family for genus in subgeneric names, especially, subgenus and species.

I had also determined to limit the display of potentially long sequences of taxonomic hypernyms to one sequence leading to some taxon that has a recognizable connection to an English common name, eg, Plantae, Aves, Tetrapoda, Mammalia, Reptilia, Insecta, Crustacea, Mollusca, which hopefully is also stable. In contrast, the dodo sequence above is an example of a sequence that is probably not particularly helpful to a typical user. It conveys merely the idea that taxonomic classification is well developed, at least at the levels above Aves. I would be perfectly happy to leave to others the question of how to present taxonomic data above the rank of order, or even family, in entries above the level of genus.

It might be useful to rely on Wikidata for the presentation of complete (ie, like that of dodo above) taxonomic hypernyms via external links and limit ourselves to taxonomic names proximate to the headword. DCDuring TALK 18:07, 26 February 2015 (UTC)[reply]

I'd be curious how the ladder would look if I limited it to taxa which have some number of common names across languages. That would be relatively easy to do. I agree that displaying everything above "Aves" like in this example has very limited usefulness. I'll have a go at incorporating some of the suggestions some time. Thanks for the feedback. At the very least it would be nice if I could make something that could stand in for a human-edited list. Pengo (talk) 09:08, 28 February 2015 (UTC)[reply]

I apologize for the less-than-clear use I made of ladder (contrasted with tree). I was referring to the cases for which a taxon, say, a species, is the sole species for a sequence of higher taxa, eg, genus, family, order. An extreme example is Ginkgo biloba, which is the sole (known extant) member of genus Ginkgo, family Ginkgoaceae, order Ginkgoales, class Ginkgoopsida, division Ginkgophyta. This sequence is an unbranching portion of the taxonomic tree of life.

Ah yep. Monotypic taxa. Gotcha. I just latched onto it because I've been looking for a term for the sequence of hypernyms (to use internally within code, which I'm now calling ladder objects :) ). Pengo (talk) 04:19, 1 March 2015 (UTC)[reply]

This implementation like a fairly bad idea to me, because it's a rather outdated view of taxonomy. The biosciences have long since moved into the world of cladistics, wherein taxa are simply hierarchical clades, instead of having subjective labels for each level (e.g. phylum, order, etc) as in the Linnaean system. If we're going to have taxonomic nomenclature, we may as well not categorise it in a way that is in the process of becoming obsolete. —Μετάknowledge^{discuss/deeds} 05:29, 3 March 2015 (UTC)[reply]
Sure, but we are serving a general audience. And, in any event, even most biologists, including systematic taxonomists of almost every school, tend to conserve names. Most of those names retain suffixes that are indications of rank. Take a look at, say, the Angiosperm Phylogeny Website. They have replaced superordinal names (class, etc) with clade names of their own devising (though strongly reminiscent of older-style names), but mostly retained order, family, and subfamily names, no matter how much membership and placement may be revised, and whether or not clades, often unnamed, are inserted between ranked names. Similar practices prevail at the Tree of Life web project. In any event lots of the names in Pengo's listing for Dodo appear unlabeled, ie, have no rank assigned. In my listings they appear as "clades". In principle all retained taxa will eventually be monophyletic, ie, clades. But in paleontology, for example, there is often not enough evidence to make firm assignments of that kind and there remains a need to group and classify specimens, using morphological features only. The Paleontology database tends to retain older taxonomic names to an even greater extent than APG and ToL.

Evidently the process that will lead to the obsolescence of the names that biologists have been using is not actually proceeding quickly enough to render existing names and ranks completely useless.

Finally, there will always be some use for retaining definitions based on obsolete concepts and classifications to make it possible to understand older literature and to help folks who are decades removed from their biology classes to make connections between the names they learned and current ones. DCDuring TALK 06:01, 3 March 2015 (UTC)[reply]

It depends what level you're talking about. As I said, higher levels are very abstract and prone to constant reinterpretation and rearrangement- but that's true whether you use clades or ranked heirarchies. Todays clades are tomorrow's polyphyletic relics: remember the edentates? The insectivores? The ungulates? The dicots? Who knew that whales were even-toed ungulates, or that termites were cockroaches?

The closer you get to the species level, though, the more stable taxa are, the more grounded they are in objective reality, and the easier it is to fit them into traditional named hierarchies. I doubt that the levels covered by the taxonomic codes (species-, genus- and family-group ranks) are going to disappear anytime soon. Sure, there are often unranked clades in between, but they're additions, not replacements.

The filtering criteria I suggested shouldn't be applied the same at lower and higher ranks: the preference for un-prefixed ranks only makes complete sense in the family-,genus-, and species-group levels. At the higher levels, the preference for ranks with sister nodes is more important- the first is Linnaean, the second cladistic.

Regardless of the filtering, though, there will always be hierarchies. It doesn't matter whether you call them clades or taxonomic ranks, they're still nodes in a tree structure. I don't see anything wrong with giving them rank names, as long as the actual structure is reflected. Chuck Entz (talk) 14:28, 3 March 2015 (UTC)[reply]

Relatedly and alternatively, it would be useful to be able to generate clade diagrams on demand from well-maintained, complete taxonomic data ("WMCTD"). I borrowed {{Clade}}, which does that, from WP and applied it in a single entry: Ornithodira. Populating it automagically from WMCTD would be wonderful. How far is Wikidata from being/having WMCTD? DCDuring TALK 15:49, 3 March 2015 (UTC)[reply]

@DCDuring I thought a cladogram would be excessive, but it doesn't look too bad, or take all that much space really.

I nudged along the discussion of Wikidata "arbitrary access", and there's been a trickle of discussion on the Wikimedia issue tracker for the last week or so (phabricator T49930). It's been added to the roadmap, although currently tagged as "unscheduled", so something will actually happen and how well maintained the taxonomic data is kept once that happens is anyone's guess. Pengo (talk) 05:58, 7 March 2015 (UTC)[reply]

When one considers that a Wiktonary-resident cladogram could appear hidden until wanted or a Wikidata-based cladogram could be produced on demand in a pop-up window the visual space taken need not be much of an issue. The new style of systematics makes "coordinate terms" an inadequate label for useful semantic relationships. DCDuring TALK 13:10, 7 March 2015 (UTC)[reply]

Question 3.25

This question was posed a few days ago. The question and the answer to it are as follows:

Question

"I have invented a word game and would like a free concise dictionary in the form of a downloadable software database file for inclusion within it. Is there such a file which can be used commercially? The word list I am using for the game is SCOWL and I am hoping to get a dictionary which will contain all the words that are in that word list, so that when a word ((in the form of the link ) is pointed to and clicked, the player will be directed to a short meaning of it.

Thanks Paul"

Answer

"See Help:FAQ#Downloading_Wiktionary. You'll have to run a manual comparison against SCOWL; also be aware that we are fairly inclusive of unusual and offensive words: your players might object to some of them if they are not in mainstream dictionaries etc. Equinox ◑ 22:02, 23 February 2015 (UTC)"

 I forwarded the answer to my programmer who replied as follows:

"As far as wikitionary, it's not in text format. Another thing is that it's massively big database. So, I am afraid, it's not feasible to use something like wikitionary. We need to find someone, who provides text (.txt) format of such dictionary. And that too should be concise, in size. Say maximum 5 to 8 mb in size."

Can anyone help me in finding such a dictionary? Thanks for your help thus far.

Paul

multi-stream bz2 Wiktionary dumps

How am I supposed to use multi-stream bz2 Wiktionary dumps? I have downloaded the following files:

enwiktionary-20150224-pages-articles.xml.bz2 enwiktionary-20150224-pages-articles-multistream.xml.bz2 enwiktionary-20150224-pages-articles-multistream-index.txt.bz2

In the dump page (http://dumps.wikimedia.org/enwiktionary/20150224/) it says that *-multistream.xml.bz2 is in multiple bz2 streams, 100 pages per stream. The *-multistream-index.txt.bz2 file contains a list of all the titles of all the pages. Each line is in the format

   num1:num2:title_string

It seems to me that num1 is the id for a segment, num2 is the id of the page with the title title_string. Since the entire *-multistream.xml.bz2 file is too big when decompressed, I want to only decompress one of the segments of the file to retrieve the page that I'm interested in. Is there a way to do that? I don't see the point of the multi-stream bz2 file if it's impossible to extract only a part of it.

Thanks GA

If you don't get an answer here, I suggest asking on Wikipedia, which has a larger base of contributors and hence a larger base of technically-adept contributors. I imagine it should be straightforward to take knowledge of how to decompress part of a Wikipedia dump and apply it to a Wiktionary dump. - -sche (discuss) 00:56, 1 March 2015 (UTC)[reply]

num1 is the byte offset into the -multistream.xml.bz2 file, num2 is the page id. It's not straightforward to do this with command line tools, one approach would be the following:

curl http://dumps.wikimedia.org/enwiktionary/20150224/enwiktionary-20150224-pages-articles-multistream.xml.bz2 -r 39191316- | bzcat | less

. This would fetch the data from offset 39191316 and then decode it. For a local file you could do: dd if=enwiktionary-20150224-pages-articles-multistream.xml.bz2 bs=1 skip=39191316 | bzcat. (Assuming you use a unix based system). The problem with this approach is that bzcat will keep decoding past the end of the first stream, so you'll end up with more data than needed. I created a small python script to show how you would just extract the data you need. Jberkel (talk) 14:10, 2 March 2015 (UTC)[reply]