海量文库 文档专家
您现在的位置:首页 > 幼儿教育 > 少儿英语少儿英语

Ontology learning

发布时间:2014-03-01 16:16:15  

Machine Learning Techniques for

Automatic Ontology Extraction from Domain Texts

Janardhana R. Punuru

Jianhua Chen

Computer Science Dept.

Louisiana State University, USA


Concept extraction

Taxonomical relation learning

Non-taxonomical relation learningConclusions and Future Works?????

a conceptualisation of D, or simply, a data modeldescribing D. An OL typically consists of:?A list of concepts important for domain D?A list of attributes describing the concepts

?A list of taxonomical (hierarchical) relationships among these concepts

?A list of (non-hierarchical) semantical

relationships among these concepts

Concepts: person, voter, worker, poll watcher, location, county, precinct, vote, ballot, machine, voting machine, manufacturer, etc.

Attributes: name of person, model of machine, etc.Taxonomical relations:

?Voter is a person; precinct is a location; voting machine is a machine, etc.

Non-hierarchical relations:

?Voter cast ballot; voter trust machine; county adopt machine; equipment miscount ballot, etc.????

Knowledge representation and knowledge management systems

Intelligent query-answering systems

Information retrieval and extraction

Semantic Web

?Web pages annotated with ontologies

?User queries for Web pages analysed at knowledge level and answered by inferencing on ontological knowledge????


Unstructured texts

Ambiguity in English text

?Multiple senses of a word

?Multiple parts of speech –e.g., “like” can occur in 8 PoS:

?Verb: “Fruit flies like banana”

?Noun: “We may not see its like again”

?Adjective: “People of like tastes agree”

?Adverb: “The rate is more like 12 percent”

?Preposition: “Time flies like an arrow”


Lack of closed domain of lexical categories

Noisy texts

Requirement of very large training text sets

Lack of standards in text processing??????

Lack of standards in knowledge representationLack of fully automatic techniques for KALack of techniques for coverage of whole textsExisting techniques typically consider word frequencies, co-occurrence statistics, syntactic patterns, and ignore other useful information from the texts

Full-fledged natural language understanding is still computationally infeasible for large text collections?????

Frequency-based methods

?Text-to-Onto [Maedche & Volz 2001]

Use syntactic patterns and extract concepts matching the patterns

?[Paice, Jones 1993]

Use WordNet

?[Gelfand et. Al. 2004] start from a base word list, for each w in the list, add the hypernyms and hyponyms in WordNet to the list???

Parts of Speech tagging and NP chunkingMorphological processing –word stemming, converting words to root form

stopword removal

Focus on top % freq. NP

Focus on NP with fewer number of WordNet senses?????

General lexical knowledge base

Contains ~ 150,000 words (noun, verb, adj, adv)

A word can have multiple senses: “plant” as a noun has 4 senses

Each concept (under each sense and PoS) is represented by a set of synonyms (a syn-set).

Semantic relations such as hypernym/antonym/meronym of a syn-set are represented?????

15 documents from New York Times


Contains more than 10,000 words

Pre-processing produced 768 distinct noun phrases (concepts)

?329 relevant to electronic voting

?439 irrelevant ???

●POS Tagging:Many/JJ local/JJ election/NN officials/NNS and/CC voting/NN machine/NN

NP Chuking: [ Many/JJ local/JJ election/NN officials/NNS ] and/CC [ voting/NN machine/NN companies/NNS ] are/VBP fighting/VBG [ paper/NN trails,/NN ] in/IN [ part/NN ] because/IN

Stopword Elimination:local/JJ election/NN officials/NNS, voting/NN machine/NN

Morphological Analysis:local election official, voting machine company, paper trail, part, work, ●●●

Take top n% of NP, and select only those with less than 4 senses in WordNet ==> obtain T, a set of noun phrasesMake a base list L of words from T

PE: add to T, any noun phrase np from NP, if the head-word (ending word) in np is in L

POP: add to T, any noun phrase np from NP, if some word in np is in L????




Recall: |T|















frequency threshold












frequency threshold

TF*IDF: Term Frequency Inverted Document Frequency



|D|: total number of documents

|Di|: total number of documents containing term tiTF*IDF(tij): TF*IDF measure for term tiin document djfij: frequency of term tiin document dj








RetrievedR & RelPrecisionRecallF-measure

TNM Corpus: 270 texts in the TIPSTER Vol. 1 data from NIST: 3 years (87, 88, 89) news articles from Wall Street Journal, in the category of “Tender offers, Mergers and Acquisitions”

30 MB in size

183, 348 concepts extracted -only used the top 10% frequent ones in the experiments -manually label the 18,334 concepts: only 3,388 concepts are relevantUse the top 1% frequent concepts as the initial cut????







A taxonomy: an “is-A” hierarchy on conceptsExisting approaches:

?Hierarchical clustering: Text-To-Onto

but this needs users to manually label the internal nodes ?Use lexico-syntactic patterns: [Hearst 1992, Iwanska 1999]“musical instruments, such aspiano and violin … “?Use seed concepts and semantic variants: [Morin & Jacqumin 2003] “An apple is a fruit” fruit juice” ?“Apple juice is ??

3 techniques for taxonomy extraction

?Compound term heuristic: “voting machine” is a machine

?WordNet-based method –needs word sense disambiguation (WSD)

?Supervised learning (Naive-Bayes) for semantic class labeling (SCL) of concepts?

Given: semantic classes T ={T1, ..., Tk } and concepts C = { C1, ..., Cn}

Find: a labeling L: C --> T, namely, L(c) identifies the semantic class of concept c for each c in C.

For example, C = {voter, poll worker, voting machine} and T = {person, location, artifacts}???

Four attributes are used to describe any concept

1.The last 2 characters of the concept

2.The head word of the concept

3.The pronoun following the concept

4.The preposition proceeding the concept?

?Na?ve Bayes Classifier: Given an instance x = <a1, ..., an>, and a set of classes Y = {y1, ..., yk}


NB(x) = argmaxPr(y)?Pr(aj|y)


?622 instances, 6-fold cross-validation: 93.6% prediction accuracy

?Larger experiment: from WordNet

?2326 in the person category

?447 in the artifacts category

?196 in the location category

?223 in the action category

2624 instances from the Reuters data, 6-fold cross-val.produced 91.0% accuracy

Reuters data: 21578 Reuters news wire articles in 1987

We focus on learning non-hierarchical relations of form <Ci, R, Cj>

Here R is a non-hierarchical relation, and Cconceptsi, Cjare Example relations: < voter, cast, ballot>

<official, tell, voter>

<machine, record, ballot>???

Non-hierarchical relation learning is relatively less tackled

Several works on this problem make restrictive assumptions:

?Define a fixed set of concepts, then look for relations among these concepts

?Define a fixed set of non-hierarchical relations, then look for concept pairs satisfying these relations

Syntactical structure of the form (subject, verb, object) is often used???

?Use a pre-defined set of relations

?Extract concept pairs satisfying such a relation

?Use chi-square test to verify the statistical significance?Experimented with the Molecular Biology domain textsSchutz and Buitelaar (2004):

?Also use a pre-defined set of relations

?Build triples from concept pairs and relations?Experimented with the football domain texts?

Kavalec et al(2004)

?No pre-defined set of relations

?Use the following AE measure to estimate the strength of the triple:





?Experimented with the tourism domain textsWe have also implemented the AE measure for the purpose of performance comparisons??

The the framework of our method?


?Domain concepts C are extracted using WNSCA + PE/POP Concept pairs are obtained in two ways:

RCL: Consider pairs (Ci, Cj), both from C, and occurring together in at least one setenceSVO: Consider pairs (Ci, Cj), both from C, and occurring as subject and object in a sentenceBoth use log-likelihood ratio to choose good pairs

Focus on verbs specific to the domain

Filter out overly general ones such as “do”, “is”


|C|: total number of concepts

VF(V): number of counts of V in all domain textsCF(V): number of concepts in the same sentence as V(2)

Verb V VF*ICF(V)

producecheck 25.01024.674

ensure 23.971

purge 23.863

createinclude 23.16023.160

say 23.151

restorecertify 23.08823.047

pass 23.047

?Candidate triples: (C1, V, C2)

?(C1, C2) is a candidate concept pair (by log-likelihood measure)?V is a candidate verb (by VF*ICF measure)?The triple occurs in a sentence

?Question: Is the co-occurrence of V and the pair (Caccidental? 1, C2) ?Consider the following two hypotheses:



S(C1, C2): set of sentences containing both C1, C2S(V): set of sentences containing V





Log-likelihood ratio: Log??LogL(H1)








For concept pair (C1, C2), select V with highest value for?2Lgo??

Recap: E-voting domain

?15 articles from New York Times

?More than 10,000 distinct English words

?164 relevant concepts were used in the experimentsFor VF*ICF validation:

?First removed stop words

?Then apply VF*ICF measure to sort the verbs

?Take the top 20% of the sorted list as relevant verbs?Achieved 57% precision with the top 20%??

Criteria for evaluating a triple (C1, V, C2)?C1and C2are related non-hierarchically?V is a semantic label for either C1?C2or


?V is a semantic label for C1?C

for C2but not


Table II Example concept pairs12(election, official)

(company, voting machine)

(ballot, voter)

(manufacturer, voting machine)

(polling place, worker)

(polling place, precinct)

(poll, security)

Table III RCL method example triples

Table IV SVO method example triples

Table V Accuracy comparisons

?Presented techniques for automatic ontology extraction from texts

?Combination of knowledge-base (WordNet), machine learning, information retrieval, syntactic patterns and heuristics

?For concept extraction, WNSCA gives good precision and WNSCA + POP gives good recall

?For taxonomy extraction, SCL and compound word heuristics are quite useful. The na?ve Bayes classifier works well for SCL

?For non-taxonomy extraction, SVO method has good accuracy, but

?Require using syntactical parsing

?Coverage (recall) not good

?Both WNSCA and SVO are unsupervised method whereas SCL is a supervised one -what about un-supervised SCL??The quality of extracted concepts heavily influences subsequent ontology extraction tasks

?Better word sense disambiguation method would help to produce better taxonomy extraction results using WordNet ?Consideration of other syntactic/semantic information may be needed to further improve non-taxonomical relation extraction ?Prepositional phrases

?Use WordNet

?Incorporate other knowledge

?More experiments with larger text collections

I am grateful to the CSC Department of UNC Charlotte for hosting my visit.

Special thanks to Dr. Zbigniew Ras for hisinspirations and continuous support over manyyears.

网站首页网站地图 站长统计
All rights reserved Powered by 海文库
copyright ©right 2010-2011。