Wiktionary talk:Statistics

Latest comment: 1 year ago by Jberkel in topic Update?

"good" and "bad" entries edit

Can we come up with different adjectives here? I think "good" and "bad" may be significantly misleading to the uninitiated. I had proposed "interesting" and "uninteresting", but Dangherous didn't like them. —scs 00:08, 30 June 2006 (UTC)Reply

How about describing them as "entries with wikilinks" and "entries without wikilinks".
Also, how about removing the "mostly redirects" comment. It used to be true, but I'm not so sure that it is, now. --Connel MacKenzie T C 05:36, 30 June 2006 (UTC)Reply
It's not about wikilinks... is it? They are mostly redirects I guess, but up to now I haven't found any decent descreption of what is considered an entry and what not. Since we do have a huge amount of redirects, I expect them to be the majority. Now, "good" and "bad" are the terms that have always been used. Never mind, though, it's just a detail. Call them "empyreal" and "purgatorial" if you like. — Vildricianus 08:46, 30 June 2006 (UTC)Reply

scrunch up a little edit

For all namespaces after NS:1=Talk, why not just combine two rows into one, and add a "Talk" column for them? (I'm tempted to suggest that all except the subtotals should have Show/hide type auto-hiding.) --Connel MacKenzie T C 05:38, 30 June 2006 (UTC)Reply

The show/hide crap will make it complicated, but feel free to play around with the table. — Vildricianus 08:46, 30 June 2006 (UTC)Reply

Take after the French edit

This page would be more interesting/helpful if it contained more information in an easier to use format, such as fr:Wiktionnaire:Statistiques. Jade Knight 19:43, 23 October 2006 (UTC)Reply

Spanish and English statistics edit

Curiouser and curiouser... we now have more Spanish words than English ones. Beobach972 03:15, 24 October 2006 (UTC)Reply

Right - on this iteration I did not exclude the "form of" templates. --Connel MacKenzie 03:16, 24 October 2006 (UTC)Reply

Detail edit

I'm surprised to have gotten so little feedback on the "Detail" section. Perhaps the explanation is clear enough? Honestly, I expected somebody to ask why the numbers (say, for English) don't add up to get "Total definitions." (The answer is that "real definitions" is exclusive of the others, but something can count as an "inflected form" and as "slang" while actually being only one definition line.) I also kindof expected someone to ask why "total language sections" is so much higher than "real definitions" and so very much lower than "total definitions." I guess that is self-evident? --Connel MacKenzie 20:48, 23 May 2007 (UTC)Reply

Translingual edit

What does it refer to exactly? Does translingual mean words which are used in more than one language? DaGizza 23:03, 10 January 2008 (UTC)Reply

  • It refers to two main groups of things. 1) Symbols that don't really belong to any language at all (see %). 2) Taxonomic names (some people call them New Latin) that are used across all languages (that use the Roman script) (see (Homininae). SemperBlotto 23:11, 10 January 2008 (UTC)Reply

I also used it on CCC which is initialism for Chaos Computer Club, which works in English and German, not sure if that was right. Mutante 23:16, 10 January 2008 (UTC)Reply

Language codes edit

Could we add language codes to this data? I'm going to do so manually right now but it ought to be added to the script that generates this page too. — hippietrail 05:07, 3 February 2008 (UTC)Reply

PAGESINCATEGORY: edit

I've converted vi:Wiktionary:Thống kê to use {{PAGESINCATEGORY:}} for the language breakdown. It'd be a bit more difficult to do that here; for instance, Category:English language doesn't directly contain all English words, so you'd have to add up all the parts of speech. In any event, it'd be a nice extension to the automatically-updated Special:Statistics page. – Minh Nguyễn (talk, contribs) 22:08, 21 May 2008 (UTC)Reply

Statistics update edit

Is this supposed to be updated so rarely? The last dump is 50 days old. --Vahagn Petrosyan 20:49, 4 March 2009 (UTC)Reply

I could be wrong, but I think the question if one of responsibility. Connel took care of this page for a long time, but he has been mostly absent as of late, and not doing the updates. Conrad did it a few times, and certainly has access to fresh dumps. I suggest you nag him. -Atelaes λάλει ἐμοί 20:55, 4 March 2009 (UTC)Reply

please tell me... edit

What are "Form-of" definitions, and why has Mandarin only got 80 of them? Can someone please leave a message for me on my talk page about it? Cheers Tooironic 13:45, 21 November 2009 (UTC)Reply

A "form of" definition consists of an entry that is defined solely as being a "form" of another word. For example, each English noun has a plural "form", and each English verb has a past, past participle, and present participle "form". A Latin verb may have over 100 "forms" (see the links in the inflection table at amō, for example). I suspect Mandarin doesn't have very many "form-of" entries because Mandarian verbs have oly a single form, which is the main entry form. "Form-of" entries exist primarily in languages that conjugate their verbs or inflect their nouns and adjectives. --EncycloPetey 16:58, 21 November 2009 (UTC)Reply

Gloss definitions edit

What is meant by "gloss definitions"? - -sche (discuss) 10:21, 8 February 2013 (UTC)Reply

I think it's a definition that is not a "form-of" definition. Maro 18:46, 15 February 2013 (UTC)Reply
See here: gloss. It'd be good to add this link to the table header: [[gloss#Noun 2|gloss]]

Fix grammar edit

Template:edit protected

"requests for definitions, this may divide things incorrectly"

This is a comma splice. Please change the comma to a semicolon or add "and" before "this." 2001:18E8:2:1020:1463:E53C:61CD:5659 15:37, 13 June 2013 (UTC)Reply

English lemmata edit

In June of 2012, Ruakh counted how many English lemmata Wiktionary covered in three different ways. See here. "Approach 1 gave 298,322; Approach 2 gave 299,516" and approach 3 (which lumped different parts of speech together, rather than considering them separate lemmata) gave 133,470. - -sche (discuss) 04:51, 30 August 2013 (UTC)Reply

How does Latin have more entries and definitions than English? edit

How does a long-dead foreign language get more stuff here than the current, wider used, actual language of this wiktionary?-47.20.162.183 00:20, 17 June 2014 (UTC)Reply

Latin words have loads of inflected forms. — Ungoliant (falai) 00:21, 17 June 2014 (UTC)Reply
Thanks! :)-47.20.162.183 01:30, 17 June 2014 (UTC)Reply
I prefer using the gloss definitions column as a measure of how much content we have in a given language. The entries and definitions columns are heavily biased towards languages with complex inflection. Poor English, with its 4~5 inflected verb forms, stands no chance against Latin, which has over 100. — Ungoliant (falai) 01:36, 17 June 2014 (UTC)Reply
Maybe the gloss definitions column should be first one or should be given prominence in some other way. --Vahag (talk) 08:23, 17 June 2014 (UTC)Reply
I support that idea. If no one objects I’ll change the format for the next dump. — Ungoliant (falai) 13:10, 17 June 2014 (UTC)Reply
No objection, but if "gloss definitions" is moved to come after "definitions", the latter should probably be renamed "total definitions" in the interest of clarity. Actually, as long as things are being changed around, could you also put a 1 or something after gloss definitions, so it can be linked to an explanation like this? Given that even I who edit this dictionary had to ask what the term meant, the number of passersby who know what it means is probably small enough to make it worth a footnote. - -sche (discuss) 15:23, 17 June 2014 (UTC)Reply
While we’re at it, if there is any other layout change anyone wants to propose, speak up. I’m thinking of moving the data of appendix defs/entries to the same columns as the non-appendix data, since most languages have 0 anyway. — Ungoliant (falai) 15:43, 17 June 2014 (UTC)Reply
Now that we have categories for every language called "Foo lemmas" and "Foo non-lemma forms", maybe the number of pages in each of those categories for each language could be added to the table. —Aɴɢʀ (talk) 20:35, 21 December 2014 (UTC)Reply

Translation statistics edit

I’ll be keeping translation statistics at this page. — Ungoliant (falai) 15:54, 28 July 2015 (UTC)Reply

I'm gonna bookmark that :) —Aryamanarora (मुझसे बात करो) 22:05, 8 December 2015 (UTC)Reply
Good stats, thanks! Russian at #2, after Finnish (60,823 translations). Not bad at all! --Anatoli T. (обсудить/вклад) 23:13, 8 December 2015 (UTC)Reply
Finnish is a surprise to me - and then there's Hindi, somewhere in the 40's. —Aryamanarora (मुझसे बात करो) 21:39, 3 January 2016 (UTC)Reply

Statistics on Sindhi language edit

The information on Sindhi language is NOT correct even as of 2-12-2015. There were more than 1000 definitions in Sindhi wiktionary on that date. Please fix the error.

Aursani (talk) 09:57, 21 December 2015 (UTC)Reply

This information is about English Wiktionary only. — Ungoliant (falai) 13:50, 21 December 2015 (UTC)Reply

Statistics on lemmas and non-lemmas edit

I think it would be useful if the statistics included measures on how many lemma and non-lemma entries have been created or removed. Right now there is only a generic "entries" column, but that includes all entries, and I don't know if it distinguishes cases where a new lemma POS section has been added to a page that already has a section for the current language. That is what I would consider an "entry", a single page can have multiple entries in one language. —CodeCat 21:35, 22 February 2016 (UTC)Reply

Lemmas pie chart edit

Numbers from subcategories of Category:Lemmas by language, code copied from mw:Extension:Graph/Demo/CategoryPie:

The chart updates automatically. Would it make sense to add this to the page? --Yair rand (talk) 04:52, 24 February 2016 (UTC)Reply

Why does, eg, Spanish have 47,817 lemmas, German have 42,014, but Spanish doesn't show up on the chart? DTLHS (talk) 04:57, 24 February 2016 (UTC)Reply
Hm. Might be an API limitation. It seems to be ignoring all languages past the first 500 in the list. I'll go ask the author of the chart template if there's any way to fix it. --Yair rand (talk) 05:07, 24 February 2016 (UTC)Reply
Apparently it can't find more than 500 subcategories at a time, and it can't automatically just get the largest categories. I've changed it to a manual list of the largest 150. Unfortunately, this won't automatically add in new languages that enter the top 150. --Yair rand (talk) (not logged in) 14:34, 24 February 2016 (UTC)Reply

how these column headers correspond to "etymology"s? edit

gloss definitions, entries, gloss entries, form definitions, total definitions - which of them is "etymologies"? --Qdinar (talk) 12:58, 6 February 2020 (UTC)Reply

Pageview stats edit

@Ungoliant MMDCCLXIV I added some links to Wikimedia's pageview stats in Special:Diff/51268749/51268782, but it looks like they got removed (by a script?) in Special:Diff/57992037/58655295. – Jberkel 18:10, 17 February 2020 (UTC)Reply

That was my fault. I accidentally edited WT:Statistics instead of WT:Statistics/generated when I added this month’s stats. — Ungoliant (falai) 00:52, 18 February 2020 (UTC)Reply

Amharic Wiktionary counter edit

Just in the first page of list of words starting with "a" there are 345 words (look at here) But still the counter of Amharic says 384 content! What is this madness! Abreham97 (talk) 00:34, 16 November 2021 (UTC)Reply

Update? edit

How can we update the stats on Wiktionary:Statistics/generated (currently from the 2022-01-01 dump)? A455bcd9 (talk) 12:44, 13 March 2022 (UTC)Reply

Same issue for May :) A455bcd9 (talk) 07:22, 29 May 2022 (UTC)Reply
poke @Ungoliant MMDCCLXIV. Would be amazing to have the code on GitHub or GitLab so that anyone can generate and update this page. A455bcd9 (talk) 07:23, 29 May 2022 (UTC)Reply
@Ungoliant MMDCCLXIV Hi, I hope all is well. Could you please update the statistics or create a document explaining how to generate them so that anyone can run them in your absence? Thanks for any help you can provide. A455bcd9 (talk) 07:34, 14 July 2022 (UTC)Reply

I must confess that I, too, am becoming slightly impatient. On the other hand, Ungoliant may just have quit, and we can force no user to stay active and keep things up to date. Maybe raise the issue centrally? Steinbach (talk) 14:36, 17 October 2022 (UTC)Reply

I'm working on a replacement for Ungoliant's stats, the code will be hosted on gitlab/toolforge, to avoid this situation. However, it's not quite ready yet. – Jberkel 15:15, 17 October 2022 (UTC)Reply
Thanks for your help @Jberkel. FYI the French Wiktionary has detailed statistics and they would be happy to help. A455bcd9 (talk) 13:54, 1 November 2022 (UTC)Reply
How is it going, @Jberkel? Steinbach (talk) 18:03, 3 January 2023 (UTC)Reply
First iteration is now done. Jberkel 04:23, 11 March 2023 (UTC)Reply
Thanks! A455bcd9 (talk) 10:50, 11 March 2023 (UTC)Reply
Thanks indeed! Btw, what explains the apparent drop in the number of languages? Steinbach (talk) 14:56, 11 March 2023 (UTC) O, and can you provide the gitlab link? Steinbach (talk) 14:58, 11 March 2023 (UTC)Reply
The repo: gitlab. It contains a lot more code than just the stats. The drop in language is probably because reconstruction and appendix namespaces are not included. This is a limitation of the HTML dumps, see Wiktionary:Statistics#cite_note-1. Jberkel 21:57, 11 March 2023 (UTC)Reply
Thank you. I hope someone (either you or someone else) can fix that. Appendix-only languages were already hard to find, this makes them even less visible. Steinbach (talk) 11:25, 12 March 2023 (UTC)Reply
I have another question for you. Is it right that there are no new languages? After sorting the table for "change in number of gloss definitions" I noticed that several languages had gone up from very few (often enough one or two) to a decent number, but none where entirely new (that is, change in number of gloss definitions equals number of gloss definitions). How does your script handle any new language headers, @Jberkel? Steinbach (talk) 12:26, 16 March 2023 (UTC)Reply
I didn't include a diff of new language headers in the output, but there were probably new headers. The stats generation tool reads all L2 headers and transforms them into language codes based on the data of Module:languages. Languages not listed in this module are ignored (usually typos or errors). For the next run I can include a diff. Jberkel 07:47, 24 March 2023 (UTC)Reply
Regarding the missing Appendix/Reconstruction languages, it sometimes helps to signal your interest on phabricator (subscribing to the task etc) in order to get things moving a bit faster there. Jberkel 21:52, 25 March 2023 (UTC)Reply
Return to the project page "Statistics".