Information Extraction on the Semantic Web

  • Dealing with information in modern times involves users to cope with hundreds of thousands of documents, such as articles, emails, Web pages, or News feeds. Above all information sources, the World Wide Web presents information seekers with great challenges. It offers more text in natural language than one is capable to read. The key idea for this research intends to provide users with adaptable filtering techniques, supporting them in filtering out the specific information items they need. Its realization focuses on developing an Information Extraction system, which adapts to a domain of concern, by interpreting the contained formalized knowledge. Utilizing the Resource Description Framework (RDF), which is the Semantic Web's formal language for exchanging information, allows extending information extractors to incorporate the given domain knowledge. Because of this, formal information items from the RDF source can be recognized in the text. The application of RDF allows a further investigation of operations on recognized information items, such as disambiguating and rating the relevance of these. Switching between different RDF sources allows changing the application scope of the Information Extraction system from one domain of concern to another. An RDF-based Information Extraction system can be triggered to extract specific kinds of information entities by providing it with formal RDF queries in terms of the SPARQL query language. Representing extracted information in RDF extends the coverage of the Semantic Web's information degree and provides a formal view on a text from the perspective of the RDF source. In detail, this work presents the extension of existing Information Extraction approaches by incorporating the graph-based nature of RDF. Hereby, the pre-processing of RDF sources allows extracting statistical information models dedicated to support specific information extractors. These information extractors refine standard extraction tasks, such as the Named Entity Recognition, by using the information provided by the pre-processed models. The post-processing of extracted information items enables representing these results in RDF format or lists, which can now be ranked or filtered by relevance. Post-processing also comprises the enrichment of originating natural language text sources with extracted information items by using annotations in RDFa format. The results of this research extend the state-of-the-art of the Semantic Web. This work contributes approaches for computing customizable and adaptable RDF views on the natural language content of Web pages. Finally, due to the formal nature of RDF, machines can interpret these views allowing developers to process the contained information in a variety of applications.

Download full text files

Export metadata

Metadaten
Author:Benjamin Adrian
URN:urn:nbn:de:hbz:386-kluedo-31763
Advisor:Andreas Dengel, Philipp Cimiano
Document Type:Doctoral Thesis
Language of publication:English
Date of Publication (online):2012/06/21
Year of first Publication:2012
Publishing Institution:Technische Universität Kaiserslautern
Granting Institution:Technische Universität Kaiserslautern
Acceptance Date of the Thesis:2012/05/02
Date of the Publication (Server):2012/06/22
Tag:Information Extraction; Natural Language Processing; Semantic Web
Page Number:238
Faculties / Organisational entities:Kaiserslautern - Fachbereich Informatik
CCS-Classification (computer science):H. Information Systems / H.3 INFORMATION STORAGE AND RETRIEVAL / H.3.1 Content Analysis and Indexing
DDC-Cassification:0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
Licence (German):Standard gemäß KLUEDO-Leitlinien vom 15.02.2012