User:OrenBochman/WiktionaryAnalyzer

Wiktionary Analyzer edit

Goal Produce a Lucene analyzer which based on data extracted from

  • Wiktionary projects
  • Wikipedia projects
  • a few other selected projects

Provide a better, faster, smarter search across the wikimedia projet. (Suggestions, spellings, corrections, etc)


The anlysis would be CPU intensive. To be done in sensible time development would require

  • an integration server (Hudson scales nicely)
  • Wikimedia project dumps (openzim or xml) source and HTML
  • Hadoop cluster runing advanced mahut algorithems(SVM,LDA and others)
  • Iterative production of stonger Lucene analyzers. Bootsraped via simple scripts followed by unsupervised learning cycles (to complete the picture)
  • Since these jobs could easily become intracable (though bugs, bad algorithems)
    • running dev job on wiki subsets
    • job lenght and current progress/cost estimation are design goals.


Lexical Data edit

  1. Lemma Extraction
    1. Scan English wixtionary at the POS section level.
    2. Extract
      1. Lang,
      2. POS,
      3. Lemma information,
      4. Glosses
  2. co-location
  3. proper names
  4. silver bullets
    1. Entropy reducing heuristic -> to induct missing lemmas from free text.
    2. Inducting word sense based on Topic Map 'Contexts'.
    3. Introducing a disambiguation procedure to Semantic/Lexical/Entities.

Semantic Data edit

  1. Word Sense enumeration - Via Grep
    1. Word Sense context collection Via Mahut SVD
    2. Word Sense context description(req an algorithem)
  2. Word nets - synonyms,antonyms,
    1. Entity type
    2. Categories

Entity Data edit

The idea is to generate a database of Enteties referenced in Wikipedia. Enteties are

Bottstraping via article headers and catagories. Once the most obvious enteties are listed one proceeds to train classifiers to find the remaining enteties via NLP.

  1. Catagory Based Classification of:
    1. People, orgonizations, companies, bands, nations, imagined, dead etc
    2. Animals, Species, etc
    3. Places, counties, cities,
    4. Dates, Time Lines, Duration
    5. Events, Wars, Treaties, films, awards,
    6. Chemicals, medicine, drugs
    7. Comodeties, Stocks etc
    8. Publications, Journals, Periodiacals, Citations etc
    9. External Web locations.
  2. Unsupervised aquisition of More Enteties
    1. Train a SVM classifiers Mahut via Hadup using tagging/parsing low ambiguity snippets referenceing wide selection of terms.
    2. Aquire more enteties.
  3. Crosswikify Top enteties.
    1. Cross wiki links.
    2. Run Mahut LDA via Haddop on Articles/Sections with corolated eteties.

Etymology edit

If we trust etymologies in one language we could suggest them for others.

  1. would require a model (loan, analogy, language change).
    1. requires/implies a phonological distance, semantic distance, word sense.
    2. requires/implies a graph of time, language, location.
    3. historical linguistics rules could be used to refine such a model.

MT Translaion Data edit

This is not high priority deliverable since its utility is doubtful.

  • Offline bilingual dictionaries may be of interest. c.f. [1]
  • As wiktionaries improve they could become a significant contibution where statistical methods fall short.
  • Wikipedias clearly contain large volume of text for generating statistical language models.


Filing in the gaps edit

During analysis it may be able to do some extra tasks.

  1. Multilingual context sensative spell-checking the wikis. Both offline and online.
  2. Identify "missing templates" requires lemma to template mapping data struction and a generalization algorithm
  3. Identify "missing pages/section" in the wiktionary.


Language Instruction edit

Language Instruction would benefit from a database of pretaining to language pairs or groups: [i.e. it could help chart an optimal curriculum for teaching a nth language to a speaker of n-1 langugaes by producing a graph of list least resistence.

  • Top Frequency Lexical Charts
  • Topical Word Lists
  • Lexical problem areas
  • Word Order (requires lemma-n-gram fequencies)
  • Verbal Phrase/ Verbal Complement Misalignment.

Compression edit

  1. frequency information together with lexical data can be used to make a text compressor optimized for a specific wiki.
  2. this type of compressed text would be faster to search.

Tagger/Parser edit

the lexical data + frequency + n-gram frequency could be used to make a parametric translingual parser.

Translator edit

  1. Translation matrix
    1. maps source wordsense to a target wordsense.
    2. translate cross language.
    3. simplify a single language text.
    4. make text clearer via a disambiguate operation

Algorithms edit

  1. Semi-supervised acquisition of morphology.
    1. mine lemmas.
      1. collect lemmas from templates categories. (Template extraction)
      2. map templates to morphological-state via "model".
    2. entropy minimizing lemma induction via [heuristic]]s.
      1. from existing lemma knowledge gathered from Wiktionary postulate/induce/proof additional lemmas by induction.

Lema Mining edit

A couple of themes in searching for lemmas. 1. geometry. (hamming for equal length) 2. ngram.

  1. A boot strap known knowledge.
    1. Using hand built table miner, (template) structured text (wiktionary) and free-text miner (wikipedia) collect
      1. Lemma base.
      2. unknown base.
  2. Induct and Generalize from L ={L1 >> L2 >>,..., >> LN} (where >> indicate more frequent in corpus).
    1. parallel multi-pattern matches [2]
      1. affix mode
        1. iteration mode (top frequency) - for most important lemmas and assuming correlation of frequency with morph-state

find for select two. words M1 and M2. Other iteration mode may be more efficient after the top lemmas have been found or a together with an deleted NGRAM lookuptable or other structures.


other iteration modes:

  1. Levinstien clustering using spheres round/between existing lemma and lemma members. lets say R=d(L1,M1) is the max distance between any two known lemma representatives. Words equidistant from L or M within R/2 are lemma candidates.
  2. Generlizing the levinstien distance to complex distance (we could map strings to sort be predicitive.

Data Structures edit

Generalized Morphological State edit

An enumeration of all possible morphological states (in all languages). Each language has a sparse subset depending on it's morphological parameters.

e.g. the hungarian has:

Hungarian Verb edit

Lexical Category {CT:VB; CT:VB_INF; CT:VB_PART CT:VM_NN } respectively: verb, infinitive, participle,verbal noun
Definiteness {DF:0; DF:1 } respectively: indefinite, definite
Mood {MD:I,MD:S;MD:C} respectively: indicative,subjunctive,conditional
Number {NM:1,NM:2} respectively: singular,plural
Person {PR:1,PR:2,PR3} respectively: 1st person,2nd person,3rd person
Time {TM:R; TM:S; TM:F} respectively: present, past, future
Accusative {ACC:1} respectively: true
Polite {PL:1} respectively: true

Hungarian Noun edit

Lexical Category {CT:NN } respectively: noun
Number {NM:1,NM:2} respectively: singular,plural
Possessor {PR:1,PR:2,PR3} respectively: 1st person,2nd person,3rd person
Possessor-Number {PNM:1,PNM:2} respectively: singular,plural
Case {CS:NOM, CS:ACC, CS:DAT, CS:INS, CS:SOC, CS:FAC, CS:CAU, CS:ILL, CS:SUB, CS:ALL, CS:INE, CS:SUP, CS:ADE, CS:ELA, CS:DEL, CS:ABL, CS:TER, CS:FOR, CS:TEM} respectively: Nominative ,Accusative, Dative-genitive,Instrumental,Essive-modal,Translative ,Causal-final ,Illative ,Sublative ,Allative,Inessive ,Superessive ,Adessive ,Elative ,Delative,Ablative ,Terminative ,Formal,Temporal

Hungarian Language edit

Lexical Category {CT:VB; CT:VB_INF;CT:VB_PART; CT:VM_NN;CT:NN;CT:PNN; CT:ADJ; CT:ADV; CT:ART } respectively: verb, infinitive, participle, verbal noun, noun, pronoun, adjective, adverb, article
Definiteness {DF:0; DF:1 } respectively: indefinite, definite
Mood {MD:I;MD:S;MD:C} respectively: indicative,subjunctive,conditional
Number {NM:1;NM:2} respectively: singular,plural
Person {PR:1;PR:2;PR3} respectively: 1st person,2nd person,3rd person
Time {TM:R; TM:S; TM:F} respectively: present, past, future
Accusative {ACC:1} respectively: true
Polite {PL:1} respectively: true
Number {NM:1;NM:2} respectively: singular,plural
Possessor {PR:1;PR:2;PR3} respectively: 1st person,2nd person,3rd person
Possessor-Number {PNM:1;PNM:2} respectively: singular,plural
Case {CS:NOM; CS:ACC; CS:DAT; CS:INS; CS:SOC; CS:FAC; CS:CAU; CS:ILL; CS:SUB; CS:ALL; CS:INE; CS:SUP; CS:ADE; CS:ELA; CS:DEL; CS:ABL; CS:TER; CS:FOR; CS:TEM} respectively: Nominative ,Accusative, Dative-genitive,Instrumental,Essive-modal,Translative ,Causal-final ,Illative ,Sublative ,Allative,Inessive ,Superessive ,Adessive ,Elative ,Delative,Ablative ,Terminative ,Formal,Temporal

Uses edit

  1. Enumeration
    Develop a language independent view of morphology with chief application in MT. In this sense morphology is viewed as a semi group generated as a Cartesian product of its feature subsets.
  2. Compression.
    Since in reality feature availability varies across languages matrix is not only sparse but also within a language feature availability is codependent. (a verb has time but a noun does not) Therefore a (minimal) sparse matrix can be extracted and used to compress morphological state. But to create such a compression scheme to be created it is necessary to collect statistics showing lemma frequency, and feature dependency.
  3. IR.
    By supplying a lemma-id and a morphological-state one provide superior search capabilities in certain languages.