Dies ist die archivierte Webseite der ASV. Aktuellere Informationen finden Sie unter temir.org und über die Suchfunktion auf uni-leipzig.de

16px-feed-icon Theses Diese Seite auf deutsch anzeigen

Finding and Analyzing Social Networks in unstructured web log data using probabilistic topic modeling (Masterarbeit)

Status: beendet
Abgabedatum: 2009-12-23


External Masterthesis in corporation with Max-Planck-Institut for Evolutionary Anthropology. Supervisor at MPI was Colin Bannard.


Web logs and other platforms used to organize a social life online have achieved an
enormous success over the last few years. Opposed to applications directly designed for
building up and visualizing social networks, web logs are comprised of mostly unstructured
text data, that comes with some meta data, such as the author of the text, its
publication date, the URL it is available under and the web log platform it originates
from. Some basics on web logs and a description of such data is given in chapter 1. A
way to extract networks between authors using the meta data mentioned is discussed and
applied in chapter 2. The required theoretical background on graph theory is covered
in this chapter and it is shown that the networks exhibit the Small World Phenomenon.
The main question posed in this theses is discussed in chapters 3 and 4, which is, if
these networks may be inferred not by the available meta data, but by pure natural
language analysis of the text content, allowing inference of these networks without any
meta data at hand. For this, di erent techniques are used, namely a simplistic frequentist
model based on the “bag-of-words” assumption and so called Topic models making
use of Bayesian probability theory. The Topic models used are called Latent Dirichlet
Allocation and, expanding this model, the Author-Topic model. All these techniques
and their foundations are thoroughly described and applied to the available data. After
this, the possibility of predicting the distance between two authors of web log texts in
a social network by comparing term frequency vectors(bag-of-words) or probability distributions
produced by the Topic models in terms of di erent metrics. After comparing
these di erent techniques, a new model, also building on Latent Dirichlet Allocation,
is introduced in the last chapter, together with possible ways to improve prediction of
social networks based on content analysis.

Author: Patrick Jähnichen
Advisor: Prof. Dr. Gerhard Heyer
pdf iconThesis