### Automatische Extraktion von Termhierarchien aus Dokumentenkollektionen für die semantische Strukturierung (Extraction of Term Hierarchies for the Semantic Structuring of Document Collections) (Diplomarbeit)

**Status:**beendet

**Abgabedatum:**2007-01-11

**Abstract:**

For the interactive usage of capacious document sets like browsing and searching a thematic structure within the sets can be helpful. To get an overview over a certain topic, for instance, one might use a hierarical structured document set with a direct access to thematically general documents located in the upper nodes. In the case of a evolving interest in a certain subtopic then there also is a direct access to the more specialised documents covering the specific subtopic.

In order to give a proof-of-concept of providing such thematical structures out of fomerly unstructered document sets algorithmens of structuring documents and extracting information has been evaluated. Before the question which algorithms can be appropriate could be answered

two decisions must been made: Which is the thematic structure (and how is it represented)? And: Which properties of the documents and the document sets are to be used?

The used approach for representing the underlying thematic structure is not a classical ontological one, but a so called term hierarchie, which is a monohierarchy (a graph theoretic tree) of sets of words. The edges of the tree represent the relation of “more general”, that means, that terms associated to a node describe a thematic field more generally and wider than terms associated to a subnode of this node. Terms associated to the same node belong to the same thematic field at the same level of generality and abstractness.

To extract these term hierarchies out of the set of documents statistical properties of the distributions of terms over documents are feasible and the implementation works in a robust way on a high level of independence of language specific formalities like e.g. syntax and flexion and does not presuppose certain markers within the text like headings or keywords. Hence documents are represented by term vectors in a vector space model. The statistical extraction of the document describing terms is based on Zipf’s law and the differential analysis of the term frequency wihtin the document in relation to the term frequency in a reference corpus.

The structuring of the whole document set is done by classifying every document into the term hierarchy which results in a document hierarchy with the same graph theoretical structure like the term hierarchy but with documents sets associated to the nodes.

The proof-of-concept approach leads to the use of simple but representative algorithms of different kinds of processing vector space data. On the one hand the hierachical agglomarative clustering (HAC) as representator of the clustering approach was evaluated. On the other hand the probabilistic latent semantic analysis (PLSA) represents the latent concept approach. Both algorithms are not designed to hierarchically structure sets (if you treat HAC as a pure cluster algorithm with a set of cluster as output, otherwise it wouldn’t be representative for clustering in general). To get a hierarchical structure the algorithms are iterated until the all subsuming root is reached.

The evaluation of the resulting document hierarchies was done against a fomerly known document hierarchy which was “lost” to get the document set as input for the HAC and the PLSA. The evaluation results showed, that the proof-of-concept was successful. Automatic hierarchical thematic structuring of document sets can be done without using language specific knowledge. Several Parameters have a certain influence, but the main point was the kind of text the document were of. With more scientific documents with a higher ratio of exact terminology which discriminates the documents of different thematic fields the results were much better than with documents about all-day events taken from a boulevard magazine.

**Author:**Florian Holz

**Advisor:**Prof. Dr. Gerhard Heyer | Dr. Hans Friedrich Witschel

Thesis