User:Dan Polansky/Inclusion arguments

What follows is a catalog of inclusion arguments and exclusion arguments, strong and weak ones. Some may be used to supplement WT:NSE, other to inform policy discussion. It further treats classes of entries and other considerations.

Inclusion arguments edit

Entry hub edit

Avoid a hole in network and improve navigability and discoverability. #Derived-term principle is related, but is weaker since a single derived term would apply to it, while the avoid-hole principle for navigability requires at least two items to interconnect. Entry hub is more general than WT:THUB, which is for translation only.

Application:

  • Microsoft, the proper noun, serves as a node for all the derived terms: Microsofter, Microsoftian, Microsoftie, Microsoftify, etc. Microsoft, the common nouns, could be somewhat inappropriatel used for the purpose, but the rationale would apply even if there were no figurative use of the name.
  • Trump, the sense for the U.S. president, serves as a node for synonyms Orange Man, Trumperor, Forrest Trump. There are also derived terms Trumpmania, Trumpness, Trumpoid, but the need for synonyms is stronger.
  • Bush, the sense for the 43rd president, for synonyms. Less strongly for the derived terms.
  • Putin, the sense for president, for synonyms. Less strongly for the derived terms.
  • Biden, the sense for president, for the synonym. Less strongly for the derived terms.
  • Musk, the entrepreneur, for the synonym. Less strongly for the derived terms.
  • Star Trek (a franchise) via Star Trekky, Star Trekker, Star Trekkish

Derived-term principle edit

This is related to #Entry hub, which is a stronger argument and should be preferably invoked if applicable. Principle: if there is an includable term X derived from term Y, include term Y. If X is derived from a specific sense of Y, include that sense.

This can be applied quantitatively: the more numerous the derived terms, the stronger the case.

The rationale is that if a sense in an entry has produced a derived term, the sense is probably notable enough to be included in the entry. An extreme case is of Hitler, whose name much sooner activates the referent in the mind of the listener or reader than a generic "someone named Hitler" sense. The rationale is not that the entry for the derived term itself needs the base term, since that is not so: Zeldaesque can mention the game in the definition and in the etymology so the reader will never need to navigate to Zelda for further information.

Application:

See also User talk:Dan Polansky § Derived-adjective principle.

Quantitative impact: Category:English terms suffixed with -ian has over 2,624 items. Category:English terms suffixed with -esque has nearly 450 items.

Dictionary-only lexical information edit

Keep the entry if it has lexical information not covered by Wikipedia, Wikispecies or Wikidata.

Classes of information:

  • Gender
  • Inflection
  • Hyphenation
  • Pronunciation (sometimes in Wikipedia)
  • Etymology (sometimes in Wikipedia)
  • Translation (in Wikipedia via interwiki, in Wikidata but without tracing to sources or quotations)

The principle was enunciated in passed Wiktionary:Votes/pl-2010-05/Placenames with linguistic information 2. The vote was later rescinded since the requirement that the entry has to have such information from the start was deemed too stringent. After that, rather lenient inclusion criteria for place names were adopted, but such that it was no longer based on that sound principle.

Application:

  • Polish Muminek (Moomin) has pronunciation, inflection and etymology (including suffix -ek).
  • Lysistrata has etymology, and in Czech, inflection.
  • Czech Microsoft has gender and inflection.
  • Czech Gondor has gender and inflection.
  • English Rivendell has translations. WP has W:Rivendell, being generous with these kinds of entities.

See also User talk:Dan Polansky § Include attested proper names that are lexicographically interesting.

This principle does not protect multi-word names of biological taxa. Although they are being included in Wiktionary, they duplicate Wikispecies; about 1 000 000 entries for biological taxa can eventually be included. Nor does it protect "X County" entries.

Linguistic information edit

This is a more inclusive version of #Dictionary-only lexical information, which makes it a weaker argument. The principle: include an entry if it has non-compositional linguistic information even if it is covered by Wikipedia, Wikidata or Wikispecies. It is inspired by Wiktionary:Votes/pl-2010-05/Placenames with linguistic information 2.

The non-compositionality treatment is essential: any name has a compositional pronunciation and compositional etymology, and in inflected languages, compositional inflection.

Classes of information:

  • Etymology
  • Pronunciation
  • Hyphenation
  • Gender
  • Inflection
  • Translation

The referent is arguably also a linguistic class of information, having to do with extensional semantics of names. However, if it would count, we would have to include all names of scientific articles from Wikidata, which would very badly swamp Wiktionary with linguistically low-value content.

Translation may be rather controversial, but it is a linguistic concern. An encyclopedia is not a name translation dictionary. Translation is also subject to non-compositionality treatment so not any translation of a capitalized descriptive phrase serving as a name counts.

Taxonomic names are covered by the principle, even if they duplicate Wikispecies and threaten to eventually swamp Wiktionary database.

See also #Name translation.

All words in all languages edit

See #Single word.

Single word edit

This is a broad inclusion argument based on the Wiktionary slogan "All words in all languages". Its curtailment in WT:FICTION, WT:COMPANY and WT:BRAND was never properly and credibly justified. Wiktionary has enough database space to cover all Wikipedia's single-word names, and even all multi-word names if there is lexicographical merit such as translation. See also User:Dan Polansky/All words in all languages.

Applications:

  • Nike is a word to be included, whether it meets WT:BRAND or not.
  • Verizon, a company name, is a word to be included.
  • Metroid, a game name, is a word to be included.
  • Muminek, a name of fictional entity, is a word to be included.

See also all the single-word examples in User talk:Dan Polansky § Include attested proper names that are lexicographically interesting.

The "All words in all languages" slogan is taken seriously in some ways, but not in other ways. It serves to include all 3-attested very rare words formed from easily parseable highly productive prefixes such as non- and anti-. But other items that are clearly words, with interesting morphology, etymology or formation strategy, are excluded, as per above.

Extrapolate lemmings edit

The principle: if multiple dictionaries include multiple items of a class, consider not outright banning the class but rather figuring out inclusion criteria for the class. That is, extrapolate from what other dictionaries are doing, erring on the side of inclusion. A related argument is #Wikipedia-style generosity. An advantage of this principle over the per-term lemming principle is that it allows rounded criteria to be developed, reducing the arbitrariness from following lemmings on a per-term basis.

Application:

  • The policy for geographic names was probably inspired by other dictionaries' coverage of them.
  • The policy for astronomical names was probably inspired by other dictionaries' coverage of them.

Extrapolate lemmings for organizations: A multitude of organization names is being included by multiple other dictionaries, so do not regulate to exclude all or nearly all of them but rather figure out which to include. See also WT:OED and WT:MWO.

  • OED has United Nations, League of Nations, W.H.O. and Interpol. Has EU but no European Union. Has Ku-Klux-Klan mentioned in Ku-Klux entry. Has Greenpeace. No Red Cross.
  • MWO has United Nations, European Union, North Atlantic Treaty Organization, Ku Klux Klan, Federal Reserve Board, Red Cross (as an emblem), Red Crescent
    • No Warsaw Pact, no Greenpeace, no Democratic Party, no Republican Party, no Federal Drug Administration, no International Monetary Fund.
    • In legal dictionary: Federal Bureau of Investigation, Federal Aviation Administration
  • Multiple political parties are being included by other dictionaries, so do not exclude political parties wholesale. Some supported are per Talk:Democratic Party and Talk:Republican Party. A list is in Thesaurus:political party.

See also User talk:Dan Polansky § Criteria for inclusion of multi-word names of organizations.

Extrapolate for consistency edit

The argument is that if a policy leads us to keep some items, we should extrapolate to similar items not covered by the policy. This is a relatively weak argument: if the policy keeping an item is not particularly good or keeps the item for simplicity of administration of rules, there is no reason to extend the boundary even further. The item on the boundary is not really 1.0-kept but rather, say, 0.6-kept and an item near to it can be 0.4-kept, meaning not kept.

Application:

  • North Atlantic Treaty Organization: an important international organization. This is no worse than the kept United Kingdom of Great Britain and Northern Ireland, which is actually fully transparent. Against that, the full U.K. name is not particularly dictionary-worthy and there is no need to extend that inclusion to even more such items. On the other hand, it is not just U.K. but all those "X County" entries; if such insignificant entities can have their multi-word names included, so should NATO.

Coalmine edit

WT:COALMINE is an actual policy that continues to be controversial. The idea behind it is that if a word exists in multiple spellings, we should include the most common spelling even if it is a sum of parts. The last vote on the subject in 2019 yielded a near-unanimous approval: Wiktionary:Votes/2019-08/Rescinding the "Coalmine" policy.

Semantic coalmine edit

This is an analog of #Coalmine and states: if a rare or non-neutral term for a concept is included, also include the most common and neutral term for the concept even if the term is sum of parts. The force of this argument is unclear; it is not required for decoding. An application: since Anglistics is includable, we should also include the more usual term English studies. However, English studies is protected by WT:THUB anyway.

Spelling guide edit

This is a reminder that dictionaries are used not only for decoding, for finding what a word means, but also as a spelling guide, hence various word lists without definitions being published. This use is recognized by the appreciation of "nonstandard" and "proscribed" labels. A vote to remove the proscribed label failed: Wiktionary:Votes/2016-10/Removing label proscribed from entries.

An application: non-French is useless for decoding, but it serves to show existence and a recommendation by a usage guide. For humans, nonadrenal is only marginally more useful for decoding.

This inclusion argument is usually covered by #Coalmine, which is a policy: non-compositional is protected. What is not protected by coalmine is non-French unless one creates nonFrench based on rare nonstandard quotations. Arguably, the preferable treatment is to delete rare nonstandard forms and keep non-French via #Affixed word.

One may object that the reader is better served by consulting a general hyphenation guide. That is true, but having an entry that points to a hyphenation guide is more convenient. One may further object that this would lead to inclusion of all attested hyphenated compound adjectives, as a convenience. That is a point to ponder.

Restricted use edit

A fairly weak inclusion criterion: an arguably sum of parts term sees restricted use that is hard to predict from the parts.

Applications:

  • political entity: from the name, one would think it can be used in reference to anything political, including legislatures, meetings, programmes, votes, campaigns, etc., but quotations only show two uses: on one hand, for entities like states, countries, municipalities, and European Union; on the other hand, it refers to political parties and their candidates.
  • constituent country: it refers to "country" in any sense that is "constituent" (member) of a larger whole. The term is predominantly used in reference to non-sovereign "countries" within the U.K., Denmark and Netherlands, and then sometimes to EU members. Both kinds of uses are sum of parts.
  • island state: it is used for the U.K., Japan and also Hawaii. All these uses match "island" (attributive) and "state" for some values of these, but one would not necessarily know that the U.K., whose territory does not consist of islands, fits within the term.
  • federal state: it is used ambiguously for sovereign states on one hand and non-sovereign states on the other. This may not be a truly restricted use but rather confusing use, one worth explicit documentation, leading to different translations for different senses. More useful than nonchocolate.

These kinds of entries help users clarify the scope of reference of these expressions, but are out of remit of traditional lexicography.

Name translation edit

The principle: Include an attested name, single-word or multi-word, if it has translations different from the name. This invokes lexicographical merit but is wildly inclusionist, possibly leading to inclusion of 1,000,000 names from Wikipedia or the like. On the other hand, this would be an addition to those 1,000,000 names from Wikispecies, on the same order of magnitude. The inclusion knob can be finetuned: do so only for (very) important entities; do so only if there are at least 10 translations; do so only if there are 10 independent attesting quotations, etc. It does not have to be all or nothing.

This kind of multi-lingual content is provided by Wikipedia via interwiki links, but these are an accident outside of the articles proper, not primary encyclopedic content. The interwikis themselves have no tracing to sources. One has to hope that Wikipedia editors for the sought language chose the most fitting title for the article. Translation data is also present in Wikidata, but without tracing to sources or attesting quotations.

Put differently, Wikipedia is no name translation dictionary. That is not its remit. It happens to serve the purpose relatively well.

Some sources indicate translation of names is a hard problem, especially between English and CJK languages. There is a potential for unique service here.

Wikipedia and Wikidata are currently probably the translators best sources for the purpose. However, the translator has to treat them as sources of hypotheses to be verified, with no hope for proper tracing for names as names. Wikidata is extremely generous and inclusive with its 99,894,920 items. Most Wikipedia articles are covered by Wikidata and much more. Wikidata includes not only entities but also sum-of-parts topics such as "history of England" (Q11755949), providing translations for such descriptive phrases into various languages.

See also the bold User talk:Dan Polansky § Translation dictionary of proper names, Wikipedia and Wikidata.

Important-entity name edit

This inclusion criterion is not purely linguistic, being less good. It helps limit flooding Wiktionary with names while providing lexicographical benefits for at least some names, such as #Name translation. Something like this criterion plays a role in the current place name policy, which allows the likes of United Kingdom of Great Britain and Northern Ireland, and in the current astronomical name policy.

Applications:

See also #Capitalized descriptive phrase.

Show class sample edit

This argument is related to #Important-entity name. It says: if you fear flood of items of a class, include a sample of these anyway so that anyone interested in the class can get an impression of it, of how its members are formed and how they are translated. If the sample reveals the translations are sum of parts, that is also an observation to be made available to the reader or translator.

Application:

  • Include at least 1000 most important ones so that readers can investigate patterns of term formation and translation.
  • Include at least 200 ex-X entries to document productivity of the prefix. However, the productivity can be documented in the Derived terms section of ex- prefix without having actual entries. Still, this prevents collecting attestation evidence for the reader on a per term level. Having 200 mainspace entries is preferable.

This principle is useful not only as an inclusion argument but also as a practical prioritization tool. Given a class of items, it pays off to create a sample of it for the reader but not bother to create every single attested member. This may apply to nonX terms, semiX terms, inhabitant names, Czech possesive adjectives (králův), etc. Sometimes, it is the rarer term that provides a unique value for the reader, to remove all doubt of existence and "correctness". Whether this has any force for nonX terms is questionable: it is a regular and "correct" formation to add prefix "non-" to any adjective.

Translation hubs for compounds edit

This is a very inclusionist extension of the current WT:THUB policy. It would drop the requirement that the supporting translations are not closed compounds. At least two editors in the THUB vote support this (Wiktionary:Votes/pl-2018-03/Including translation hubs).

Application:

The rationale is the same as for THUB: improve the navigability between languages, one that many non-English Wiktionaries get by free. For instance, de:Autoschlüssel is automatically a translation hub without a special policy treatment. An alternative would be to allow non-English entries to host THUB translations. In the English Wiktionary, translation tables are disallowed from non-English entries, while German Wiktionary allows them.

A consequence would be a flood of English translation hubs. This would bring the English coverage closer to, say, German and Danish coverage. Whether it would be a bad thing is unclear. It would likely be controversial.

To reduce the flood, one could require, say, 7 translations instead of the 2 required by the current THUB. However, this would probably not reduce the flood all that much. car key is supported by 7 translations.

Free variable edit

The criterion is that free variable is the most natural place to define the term, and most convenient for lookup. The question that the viewer is asking is not "what does free mean" or even "what does free mean in mathematics" but rather "what does it mean for a variable to be free". It is the combination that needs a definition. The combination is not syntactically frozen: a variable can be free, and a set can be open.

Relevant tests:

  • 1) The adjective in the relevant sense modifies only a single noun or a few nouns.
  • 2) The adjective has multiple other meanings outside of the combination.

Both 1) and 2) apply to free variable and open set.

This contrasts to "green leaf", where "green" applies to all objects that can have color and not specifically to leaves.

A minor variation in the modified noun does not detract from the rationale: there can be "free variable" but also "free occurrence" of the variable.

When 2) is not met, the case becomes weaker. Such is the case with retroactive law: the question still is "what does it mean for a law to be retroactive" but looking the definition up in retroactive is convenient enough, unlike looking up a definition of "free variable" in free.

One test is whether the combination leads the collocation set in frequency. Such is the case for bulleted list per bulleted *_NOUN at Google Ngram Viewer, where "bulleted list" and "bulleted lists" lead the pack by a wide margin. By contrast, retroactive law does not lead the pack per retroactive *_NOUN at Google Ngram Viewer; "retroactive legislation" would be more of a candidate but even that is not the leader.

Related RFDs resulting in keeping or undeleting include Talk:prime number, Talk:free variable, Talk:acute angle and Talk:nominative case. RFDs resulting in deletion include Talk:local variable (should have been kept) and Talk:Acadian epoch. See Special:Search/incategory:"RFD result (passed)" "free variable" and Special:Search/incategory:"RFD result (failed)" "free variable".

Some concerned entries: algebraic number, algebraic integer, bound variable, cardinal number, complex number, free variable, imaginary number, rational number, real number, transcendental number, free software, open set, closed set, complete graph, normal distribution, classical logic, and intuitionistic logic. Some are listed at Talk:free variable.

Syntactic unfrozenness or unboundedness is typical:

See also User talk:Dan Polansky/2013 § free variable.

Utility edit

WT:CFI says: "In rare cases, a phrase that is arguably unidiomatic may be included by the consensus of the community, based on the determination of editors that inclusion of the term is likely to be useful to readers."

To consider utility for the readers is praiseworthy and allowed by CFI. However, it is not an easily administrable criterion. We should be seeking specific administrable tests that reveal utility; invoking utility without hinting at an administrable test is less than ideal.

The cases are said to be "rare". This is not a problem: the individual cases kept as "useful" are a drop in the bucket compared to 10,000 nonX terms (already included) and a million taxa (yet to be created).

See also #Page views.

Page views edit

When examining utility for our readers, we may consider page views as objective evidence of what readers find worthwhile. The entry still needs to be lexicographical: it would not do to create an excellent well sourced encyclopedic article in the mainspace and then defend it by page views. For defending nominally sum of parts entries, it is relevant. It is inferior to other specific tests but is not without force as an argument.

Example applications:

One may object that this is not a fair comparison since noncholestatic is naturally a low performer. But that is the point: we do include trivially parseable low performers in great numbers and we would do well not to delete much more viewed entries as long as they only contain lexicographical information.

Unique value edit

This is an auxiliary inclusion rationale. It is more restrictive than #Dictionary-only lexical information. It is an antonym of #No real dictionary. Wiktionary content provides unique value if it is not found elsewhere, including other dictionaries. Thus, content better covered by other dictionaries does not have unique value. Thus, paradoxically, it is content on the margins that provides unique edge.

Application:

  • Nicknames of individuals such as Mango Mussolini are unique.
  • Words prefixed with non- are unique in the extent to which they are covered. They are uninteresting, but the category with them provides evidence of productiveness of the prefix, available for anyone to inspect. The same is true for most words prefixed with anti-.
  • Name translation dictionary would provide unique value, full with traces to authoritative sources for names including style guides, and attesting quotations. The service is there for place names and astronomical names, but is not established for other names, being exposed to deletion whims in RFD based on deletion precedent.

Phrasebook edit

Phrasebook is a policy in WT:CFI and is covered in Wiktionary:Phrasebook. Phrasebook entries may be useful for translation: I am hungry is mám hlad in Czech, as if I have hunger. Those not interested in these entries do not need to use them. I proposed a lemming test for phrasebook but it failed a vote. The current tests are "utility, simplicity and commonality" of the phrase; it can be sum of parts. Utility is subjective, and leads to arbitrary deletions, especially since phrasebook itself is controversial. Another test may be #Page views: if people visit the entry, let's keep it. Despite the phrasebook's being controversial, Wiktionary:Votes/2022-01/New phrasebook regulations passed unanimously. Category:English phrasebook has over 400 entries so there is no flood.

Kept and deleted phrasebook entries can be found via Special:Search/incategory:"RFD_result_(passed)" phrasebook and Special:Search/incategory:"RFD_result_(failed)" phrasebook.

I love you performs really well in terms of page views; Merry Christmas and a Happy New Year performs well only around Christmas; perfectly uncontroversial sum-of-parts-even-if-solid nonchocolate performs very poorly.[4] I'm twenty years old, nominated for deletion, performs many times better than nonchocolate. See also #Overflood perspective.

Exclusion arguments edit

No real dictionary edit

One exclusion argument is that no real dictionary contains entries similar to the one under discussion. It is a weak argument: its line of reasoning is rejected overall by policy choices made.

First, other dictionaries do not declare #All words in all languages as their motto.

Second, Wiktionary is set up to have the following entries that no "real" dictionary such as OED or M-W has:

  • All attested names of villages and "X County" names, per WT:CFI#Place names. {{place}} currently sees over 87,000 transclusions, and we are far from done. {{en-proper noun}} sees over 100,000 transclusions.
  • All taxonomic names. There are on the order of 1,000,000 of them in existence.
  • All words prefixed with non-, of which there are currently over 10,000 in the mainspace: Category:English terms prefixed with non-.
  • All attested surnames.
  • All inflected forms as separate entries. This multiplies the number of entries by a factor.

Furthermore, having content other dictionaries do not have provides a unique differentiator, in the business sense, a reason for people to visit Wiktionary sooner than any other dictionary for some needs. A person interested in usual English vocabulary can be better served by other professionally edited dictionaries, usually getting better quality.

See also #Overflood perspective.

Slippery slope edit

WT:CFI#Attestation vs. the slippery slope says that for attestation, entries should be considered on their merit and not with the fear that their whole group will be included. Nonetheless, a form of slippery slope argument was repeatedly invoked in RFD, e.g. in Talk:ex-Christian to delete ex-X entries as being allegedly too numerous. It was pointed out that -ness entries are as numerous or more, but to no avail; indeed, Category:English terms suffixed with -ness has 9,830 entries. There are over 10,000 nonX/non-X entries. This form of slippery slope seems fundamentally fallacious, not attempting to do any serious quantitative analysis.

In general, the slippery slope argument has some force: when considering putative inclusion criteria, we should analyze their overall impact and not focus on a single entry only. Having many entries with little lexicographical value has costs: database storage use, size of dump downloads, computing resources needed for fulltext search, monitoring changes to entries for defects or vandalism, dispute resolution via RFD or RFV, etc. It helps to analyze each slippery slope argument in perspective: if there is a risk of including, say, 1000 ex-X entries and we have 10,000 nonX entries and support 1,000,000 taxa, these 1000 entries are a drop in the bucket.

See also #Overflood perspective.

Encyclopedic content edit

WT:CFI contains a section on encyclopedic content, but that section provides nothing that should lead to exclusion of certain entries. "Delete as encyclopedic" is therefore not a CFI based argument. In detail, numbering mine:

(1) Care should be taken so that entries do not become encyclopedic in nature; if this happens, such content should be moved to Wikipedia, but the dictionary entry itself should be kept.
(2) Wiktionary articles are about words, not about people or places. Articles about the specific places and people belong in Wikipedia.

Item (1) says the dictionary entry itself should be kept; it only regulates the content within entries. Item (2) is confusing and should ideally be removed. (a) Wiktionary entries are not only about words but also about multi-word terms, e.g. New York; (b) one reading would be that specific people and places should have no sense lines in Wiktionary entries, but for place names this would directly contradict their long-term treatment: London is not restricted to a single sense stating "place name", "any of various municipalities" or the like. The only practically acceptable interpretation of (2) is that it reinforces (1) exhortation to delegate encyclopedic information about referents to encyclopedia, while keeping short dictionary definitions covering referents, specific entities.

Thus, "delete as encyclopedic" to delete a complete entry is ad hoc lawmaking, allowed by WT:NSE. However, it is poor lawmaking: the principle is no simple test and points to no simple tests. Something being covered by Wikipedia is in itself no exclusion principle, certainly for non-name words. If this were an exclusion principle for names, there should be almost no geographic names covered since they are covered by Wikipedia.

See also #Dictionary-only lexical information and User talk:Dan Polansky § What is encyclopedic content and dictionary material.

Capitalized descriptive phrase edit

This is a decent exclusion criterion, although perhaps unnecessarily strict.

Application:

Subjective dislike edit

Some RFD nomination are in the spirit of subjective dislike. W:WP:IDONTLIKEIT is relevant. Some of the examples given there are reminiscent of what we sometimes see in RFD: "Delete: No need.", "Delete as cruft.", "Delete as trivia." The WP page W:Wikipedia:Arguments to avoid in deletion discussions is an essay, not a policy. Consider its advice, numbering mine:

  • This page in a nutshell:
    • (1) Please remember that deletion discussions are not decided through head count
    • (2) Explain why an article does or does not meet specific criteria, guidelines or policies
    • (3) Always try to make clear, solid arguments in deletion discussions
    • (4) Avoid short one-liners or simple links (including to this page)

(1) does not apply: the discussions are mostly decided through head count, and nothing else can be called "consensus", properly speaking. (4) is too stringent: "delete as SOP" is not necessarily bad; sometimes it's all that needs to be said. "Keep per WT:COALMINE" is fine as well: the link does all the arguing. "Keep as a single word", while not highlighting the policy part, correctly invokes it. (2) and (3) seem relevant and interesting, and are all too often violated in our RFD discussions. Arguments should ideally be based on policy, and when it is not possible, they should invoke a candidate deletion principle that one could wish could be adopted as a policy.

Rare misspellings edit

They are excluded by policy. Deleting them as useless does no harm; there is no use case for them. Common misspellings can be useful for non-native speakers, who can type e.g. concieve and get to the required entry.

An alternative position would be that there are no misspellings, only very rare attested alternative forms. The resulting markup would serve the purpose: most readers understand not to use vanishingly rare alternative forms. This position is not taken by CFI. The practice would have a poor economy, requiring us to store many vanishingly rare 3-Usenet-attested spelling variations in the database, including japanese and london in lowercase. The economy would be improved by entering them as hard redirects.

Tests for misspelling:

  • Relative frequency using the likes of conceive/concieve at Google Ngram Viewer or [anti - Japanese/antijapanese] at Google Ngram Viewer. An original analysis of frequency ratios was at User talk:Dan Polansky/2013#What is a misspelling. The policy is at WT:CFI#Spellings.
    • The idea is: let the copyeditors whose work results in copyedited corpus have a vote, as it were, and what slips through their fingers vanishingly rarely is a misspelling.
    • Requiring zero hits in copyedited corpus for a misspelling is too much to ask; concieve does slip here and there.
    • This test is evidence-based, "objective" and descriptivist; it does not depend on the whims of the analyst.
    • For hyphenated vs. solid forms, using relative frequencies is tricky: Google rather often captures hyphenated uses as solid ones. The results tend to numerically overrepresent the solid forms. If search in Google Books for the solid form shows mostly hyphenated forms, the results can be interpreted in the light of that.
  • An estimate of what a copyeditor would do, without looking at frequency. Not as objective as frequency test.
  • Attestation only in uncopyedited/unedited corpora such as Usenet. If a spelling is attested in edited works, some will argue it is a rare spelling, possibly nonstandard, and not a misspelling. This argument was made for unEnglish. But edited corpora do contain misspellings, e.g. concieve.
  • Authority saying it is a misspelling, e.g. an article on concieve[5]. However, this test is prescriptivist and better suited to inform "sometimes proscribed" label.
  • Absence from dictionaries. This is a poor test for a descriptivist dictionary: copyeditor is absent from most dictionaries yet finds significant use per GNV.
  • An abstract analysis of whether it is a misspelling: whether it fits a pattern of spellings that are accepted.
    • For instance, there is a rule that prefixing capitalized words retains the capitalization and uses a hyphen. However, there are exceptions, e.g. transatlantic. Also, antichristian is relatively common per GNV evidence. One can thus argue this pattern is not necessarily "incorrect" and that therefore, antimuslim and antizionist are not misspellings but rather rare alternative forms. Still, the rule is nearly always observed and, arguably, only strong evidence for a particular case can override the rule, which is not there for antimuslim. This kind of analysis is much harder to administer than a frequency-based criterion, and is necessarily subjective.

Tests for "common" misspellings:

  • Relative frequency again. The ratio of 10 000 shows a "rare" misspelling. Some will see the ratio of 1000 as evidence of "rare", but let us note conceive/concieve at Google Ngram Viewer shows the ratio of about 2500.
  • Lack of attestation in copyedited corpus. A misspelling that is only in Usenet may be argued to be not "common".

Anomalous spellings, those failing a pattern, are not misspellings, e.g. unchristian vs. un-Christian. See Wiktionary:Misspellings for more examples.

Typos are deleted as typos regardless of frequency, e.g. amgydala.

Precedent:

See also Wiktionary:Misspellings.

Sum of parts edit

We need the sum of parts criterion to exclude nearly all attested phrases that are transparent syntactic constructions. We cannot include "green leaf", and "history of England". But sum of parts should not be taken to be strictly sufficient for deletion with no alleviating concerns possible. WT:THUB is one such concern discovered and codified over time. There are other concerns to be discovered and articulated, and this can happen on the fly before a policy is codified. See also #Utility.

Classes of entries edit

Affixed word edit

Include attested affixed words even if hyphenated. Thus, include ex-pilot, ex-Christian, self-govern and non-French. This is in keeping with #All words in all languages. See Wiktionary:Beer parlour/2022/September § Including hyphenated prefixed words as single words.

Pronunciation spellings edit

Attested pronunciation spellings are being widely included as per Category:English pronunciation spellings. Wiktionary:Votes/pl-2011-01/Final sections of the CFI removed Typographic variants section of CFI, which dealt with "G-d, pr0n, i18n or veg*n". The voters did not indicate intent to remove those spellings; some voters indicated intent not to remove them. A change requires a vote.

Past discussions: Wiktionary:Beer_parlour/2008/March#-in'_forms, Talk:bein' and Talk:frontin'.

Asterisk spellings edit

Spellings with asterisk are being included, e.g. veg*n. Wiktionary:Votes/pl-2011-01/Final sections of the CFI removed Typographic variants section of CFI, which dealt with "G-d, pr0n, i18n or veg*n". The voters did not indicate intent to remove those spellings; some voters indicated intent not to remove them. A change requires a vote.

Included items: veg*n, f**k, f*ck, f*der, b******s, d*ck, d—n, etc.

Category: Category:English censored spellings.

Discussions: Talk:f**k yielded a near-unanimous keep while Talk:f*ck yielded 6:4 for deletion.

Leet edit

Leet is being included per Category:English leet. Wiktionary:Votes/pl-2011-01/Final sections of the CFI removed Typographic variants section of CFI, which dealt with "G-d, pr0n, i18n or veg*n". The voters did not indicate intent to remove those spellings; some voters indicated intent not to remove them. A change requires a vote.

Nicknames of individuals edit

Neither "nickname of individual" nor "multi-word nickname of individual" are sufficient grounds for exclusion. Governator is a word, and if we are to document it, then as a nickname. Other nicknames are in Category:en:Nicknames of individuals, which has merely over 80 entries.

Some nicknames for presidents:

Multi-word nicknames are not protected by "all words in all languages".

This does not duplicate Wikipedia: W:Donald Trump does not list the above nicknames.

We are not swamped by nicknames nor are we about to become so any time soon. Rather, we have over 10,000 nonX solid-written words, trivially decipherable for humans, very uninteresting. And we are set up to duplicate on the order of 1,000,000 taxa from Wikispecies.

An early surviving nickname is Talk:Governator, 2007. Talk:Pharma Bro survived a 2019 RFD. However, Talk:Baghdad Bob was deleted in 2022, with the nomination rationale "Nickname of an individual".

See also User talk:Dan Polansky/2016 § Nicknames of specific people.

Arguments supporting particular nicknames include #Single word (Governator) and #Dictionary-only lexical information (Donald Trumpet).

However, the value is not so unique in so far as Wikipedia does cover this sort of lexicography:

Baghdad Bob is currently covered in W:Muhammad Saeed al-Sahhaf.

Names of literary works edit

The inclusion of books and other literary works including plays is governed by WT:NSE. We have Bible, King James Bible, Genesis, Pentateuch, Book of Mormon, Old Testament, New Testament, Tanakh, Torah, Neviim, Ketuvim, Talmud, Octapla, Qur'an, Tao Te Ching,‎ I Ching, Torah,‎ Veda, Bhagavad Gita, Kama Sutra, Decameron, Little Red Book, Shahnameh, Edda, Iliad, Odyssey, Aeneid, Lysistrata, Hansel and Gretel, Jabberwocky; and further dictionaries: AHD, OED, CCE, COD, DARE, DCHP, LDE, NOAD, and RHD. There is Category:en:Books.

Discussions resulting in keeping include Talk:Odyssey, Talk:Kama Sutra, Talk:Hansel and Gretel, Talk:Jabberwocky.

Discussions resulting in deletion include Talk:Ali Baba and the Forty Thieves, Talk:Pearl of Great Price, Talk:Merseburger Zaubersprüche, Talk:Urban Dictionary, and Talk:基度山恩仇記. Talk:Oxford English Dictionary and Talk:Shorter Oxford English Dictionary were deleted as failing RFV, which makes no sense to me.

#Extrapolate lemmings arguments leads use to include some names and figure out criteria for them: there is Aeneid[6] and Kama Sutra[7], but not Lysistrata[8].

#All words in all languages suggest the following: include attested single-word names not originating as a capitalization of a non-name word. Thus, include Lysistrata and Decameron but not the Clouds. In case of doubt, to limit the numbers, include only Wikipedia-notable ones. Some may be protected by #Entry hub. Many will have #Dictionary-only lexical information.

Place names edit

There are rather lenient criteria for place names ("geographic names") in place. However, they include many uninteresting names such as "X County" entries, while excluding many single-word names such as German street names, e.g. Hauptstraße. #All words in all languages and #Dictionary-only lexical information would lead to a different approach.

Place names are linguistically distinct from organization names:

Organization names edit

As pointed out in #Place names, organization names have in general fewer saving graces than place names. This class of names is supported by #Extrapolate lemmings.

Some names are protected by #All words in all languages (e.g. Greenpeace) or #Linguistic information (e.g. Ku Klux Klan).

United Nations Organization is supported by #Important-entity name.

Some names can be supported by #Name translation, but that is likely to be more controversial.

Some political parties can be supported by #Derived-term principle: Democratic Party produced Democrat; Republican Party produced Republican; Green Party produced Green.

A 2022 vote failed: Wiktionary:Votes/2022-06/Updating CFI for names of organizations. Noteworthy comments in support for names of organizations from the vote and its talk page:

  • "I plain like the translation hubs on some of the entries" --brittletheories
  • "I am particularly concerned at the glee expressed in deleting translations on the talk page, which clearly do serve a lexical purpose." --Theknightwho
  • '"WT:NOTPAPER. In addition, so long as our entries avoid falling into the encyclopedic trap -- describing the thing referred to by the term, rather than the term itself -- and we focus instead on the names as terms -- looking at the derivation, pronunciation, date of first use, sense development (if any), and other lexicographic details -- then I see no particular reason not to treat these as "dictionary material".

    As a translator, I'm often curious about how different languages construct the names for things. Some of these names are arguably idiomatic, as the choice of this word or that as a translation for part of the original name can be arbitrary.' --Eiríkr Útlendi

  • '"...we lose our legitimacy as a special project" How?? We're a dictionary. A dictionary's job is to host definitions for words and phrases, including those that happen to be names of organizations. For in-depth coverage, we direct readers to the Wikipedia article using a simple link, just like we do for any other word or phrase that has a corresponding Wikipedia article. No need to tear out our dictionary entries for perfectly-functional words and phrases just because they happen to form the names of organizations.' --Whoop whoop pull up
  • 'Agreed, I think a policy against names of organizations should be analogous to our policy on brand names.' --Imetsia

Above, we have a witness of a translator. There is also the mention of WT:BRAND, which, as exclusionist as it gets, is much more inclusionist than excluding nearly all names of organizations.

Talk:Republican Party and Talk:Democratic Party failed RFD with the incorrect rationale that they fail WT:COMPANY. Political parties are not companies: not by dictionary definition and not by hyponymy in WordNet.

Surnames edit

Surnames are protected by policy. Their inclusion illustrates the principles applied. They are supported primarily by #All words in all languages and #Dictionary-only lexical information. One can argue they are covered by Wikidata, but they have no gender and inflection information there. They would seem excluded via #No real dictionary since OED and M-W do not have surnames, but there are specialized dictionaries of surnames.

Surnames have no definition proper; they are defined merely as "surname". They can have etymology and pronunciation. They do not meet CFI's introductory rationale, "A term should be included if it's likely that someone would run across it and want to know what it means."

forebears.io reports 5,095,698 surnames in the United States. The magnitude matches the claim made in Quora that in the last census, there were 6.3 million surnames in the United States. By contrast, The Oxford Dictionary of Family Names in Britain and Ireland reports to have over 45,000 entries. According to Wikipedia, OED includes 616,500 word forms in total; 6 million surnames is 10 times as many.

It would make sense to delegate surnames to a specialized dictionary project. But Wiktionary claims to aspire to include all words in all languages.

Taxonomic names edit

In so far as taxonomic names are proper nouns and names of specific entities, they are not protected by CFI, and are subject to arbitrary deletion. They have not been subjected to RFD so far. There are on the order of 1,000,000 such names in existence. They are not protected by #All words in all languages. They would seem excluded per #No real dictionary; however specialized dictionaries of taxonomic names do exist. They could be excluded as duplicating Wikispecies.

Other considerations edit

Dictionary-style treatment edit

This is a rebuttal of the notion that if something is in encyclopedia, it thereby should not be in the dictionary. A dictionary treats the term as a dictionary entry, providing a definition and other classes of lexical information. A definition is not an encyclopedic article. This notion is in keeping with WT:CFI#Wiktionary is not an encyclopedia's "Care should be taken so that entries do not become encyclopedic in nature; if this happens, such content should be moved to Wikipedia, but the dictionary entry itself should be kept." See also User talk:Dan Polansky § What is encyclopedic content and dictionary material.

Wikipedia-style generosity edit

While Wikipedia is very inclusive and generous with coverage of popular culture, businesses, commercial brands, relatively minor organizations, etc., Wiktionary has adopted policies that unnecessarily curtail dictionary coverage. The curtailing policies are WT:FICTION, WT:BRAND and WT:COMPANY. That curtailment was never properly justified. It seemed to follow the exclusionist version of the lemming principle: if "real" dictionaries do not include that kind of content, nor should Wiktionary. A Wikipedia analog would be: if "real" encyclopedias do not include this kind of content, nor should Wikipedia. And yet, Wikipedia has generous article about W:Gondor.

Overflood perspective edit

This is a response to various claims that we are going to be overflooded by various kinds of entries if we allow them. Often, the arguer speaks of "infinity" of entries without doing any quantitative analysis. Thus, we must allegedly exclude ex-X entries (the prefix is too productive) or most names of literary works regardless of lexicographical merit. To that, we may note we have over 10,000 entirely uninteresting nonX entries, that we are multiplying the number of entries by including inflected forms as separate entries, and that we are in the process of duplicating on the order of 1,000,000 biological taxa from Wikispecies. Wikipedia has about 6,500,000 articles so even if we created an entry for each of them, we would not run out of database space. Only some of these articles are names; "History of England" is not a name. When a class of items is considered for inclusion using purely lexicographical criteria, the fear of overflood should be put in relation to these numbers. See also the extremely bold User talk:Dan Polansky § Translation dictionary of proper names, Wikipedia and Wikidata.

Policy override edit

Is it acceptable to vote in RFD discussions against policy? Or is it something one should be ashamed of?

First, WT:CFI says: "In rare cases, a phrase that is arguably unidiomatic may be included by the consensus of the community, based on the determination of editors that inclusion of the term is likely to be useful to readers." Thus, invoking "utility" is not per se a policy override, strictly speaking, since it is covered by policy. See also #Utility.

Second, WT:CFI has phrasebook criteria, and one should keep that in mind when a sum of parts phrase is being requested for deletion.

Third, as a matter of fact, Wiktionary has a history of policy overrides:

  • WT:LEMMING was used for over a decade by various editors, and never made it into policy.
  • WT:COALMINE was used as an argument before it was approved as a policy.
  • WT:THUB was used for over a decade under the head "translation target" before it was approved as a policy.
  • The hot words policy was used for multiple years before it was officially approved.
  • Various editors state ad hoc deletion arguments in their RFD nominations with no basis in policy. Sometimes, the productivity of a certain word creation process is used as an argument, despite the "Slippery slope" section of WT:CFI.
  • "Set phrase" was mentioned by multiple editors in RFD discussions.
  • "Term of art" was mentioned by multiple editors in RFD discussions.

WT:EL has flexibility section explicitly making it a guideline, not a set of rigid rules.

Policy overrides are not an unconditional good. They should be well reasoned, and not based on a whim. A hesitation to make them is advisable. On the other hand, they have done the project a lot of good, and are in the spirit of Wikipedia's W:WP:IAR: "If a rule prevents you from improving or maintaining Wikipedia, ignore it."

A vote designed to ban policy overrides failed: Wiktionary:Votes/2014-11/Entries which do not meet CFI to be deleted even if there is a consensus to keep.

Attestation from Twitter edit

OED uses quotations from Twitter as per WT:OED. Wiktionary editors decided to allow Internet quotations (Wiktionary:Votes/pl-2022-01/Handling of citations that do not meet our current definition of permanently archived), but a Beer parlour discussion (Wiktionary:Beer parlour/2022/September § Whether Reddit and Twitter are to be regarded as durably archived sources) yielded 11:8 for Twitter, not 2/3-supermajority for allowing Twitter for the normal standard of 3 attesting quotations spanning a year. It would make sense to require the Twitter quotations to show evidence of "sustained, widespread or accumulated use", using the language of OED and WT:MWO. The quantitative requirements would be left open. See also User talk:Dan Polansky § Using Twitter for attesting quotations.

Attestation from Usenet edit

Traditionally, 3 quotations from Usenet were considered enough. There is some opposition to it: Usenet is not edited and contains much more fringe word and proto-word material than printed publications.

Discussions:

Year-spanning 3-attestation edit

As for attestation, Wiktionary inclusion criteria (WT:ATTEST) cannot be much more lenient than they are. The 3 quotations spanning a year standard approaches bare minimum to achieve independence and span, especially since it covers Usenet. Professional dictionaries are much stricter: see WT:Merriam-Webster and WT:Oxford English Dictionary criteria, using the language of "sustained", "widespread" or "accumulated" use. The current standard is easy to administer and reduces the workload of providers of quotations, compared e.g. to requiring 10 quotations. Reducing the standard to one use would eliminate all independence and allow nonce words invented by creative authors such as James Joyce. Admittedly, one limitations of such nonce words would still be that they need to convey meaning. See also Wiktionary:Votes/pl-2014-03/CFI: Removing usage in a well-known work 3, which mandated moving deleted nonce words to the likes of Appendix:English nonces.

Attributive-use rule edit

This rule used to be in WT:CFI, and said this:

A name should be included if it is used attributively, with a widely understood meaning. For example: New York is included because “New York” is used attributively in phrases like “New York delicatessen”, to describe a particular sort of delicatessen. A person or place name that is not used attributively (and that is not a word that otherwise should be included) should not be included. Lower Hampton, Sears Tower, and George Walker Bush thus should not be included. Similarly, whilst Jefferson (an attested family name word with an etymology that Wiktionary can discuss) and Jeffersonian (an adjective) should be included, Thomas Jefferson (which isn’t used attributively) should not.

It was removed via Wiktionary:Votes/pl-2010-05/Names of specific entities. It was in spirit of OED's inclusion criteria and would lead us to remove almost all names, including place names. One only has to look what names OED includes (almost none, much fewer than Merriam-Webster, which has a whole section for geographic names) to see the likely impact. Compared to that, the current place name criteria are very generous. Almost no opposition showed up in the vote. The exclusionist spirit of the rule survives in WT:FICTION.

Duplication of other wikis edit

A lot of Wiktionary content necessarily duplicates other wikis including Wikipedia and Wikispecies, incompletely so. This is especially true of various classes of proper names. Adding more of them cannot possibly be lexicographical priority. To wit:

  • Entries for biological taxa duplicate Wikispecies.
  • Geographic names including "X County" duplicate Wikipedia. Example: Washington County.
  • Names of laws, theorems and principles duplicate Wikipedia. Example: Pythagorean theorem.
  • Surprisingly, nicknames of individual duplicate Wikipedia's W:Lists of nicknames and related pages.

See also #Dictionary-only lexical information.

See also edit