Finding and Analyzing Social Networks in unstructured web log data using probabilistic topic modeling (Masterarbeit)
Status: beendetAbgabedatum: 2009-12-23
Description:
External Masterthesis in corporation with Max-Planck-Institut for Evolutionary Anthropology. Supervisor at MPI was Colin Bannard.
Abstract:Web logs and other platforms used to organize a social life online have achieved an
enormous success over the last few years. Opposed to applications directly designed for
building up and visualizing social networks, web logs are comprised of mostly unstructured
text data, that comes with some meta data, such as the author of the text, its
publication date, the URL it is available under and the web log platform it originates
from. Some basics on web logs and a description of such data is given in chapter 1. A
way to extract networks between authors using the meta data mentioned is discussed and
applied in chapter 2. The required theoretical background on graph theory is covered
in this chapter and it is shown that the networks exhibit the Small World Phenomenon.
The main question posed in this theses is discussed in chapters 3 and 4, which is, if
these networks may be inferred not by the available meta data, but by pure natural
language analysis of the text content, allowing inference of these networks without any
meta data at hand. For this, dierent techniques are used, namely a simplistic frequentist
model based on the “bag-of-words” assumption and so called Topic models making
use of Bayesian probability theory. The Topic models used are called Latent Dirichlet
Allocation and, expanding this model, the Author-Topic model. All these techniques
and their foundations are thoroughly described and applied to the available data. After
this, the possibility of predicting the distance between two authors of web log texts in
a social network by comparing term frequency vectors(bag-of-words) or probability distributions
produced by the Topic models in terms of dierent metrics. After comparing
these dierent techniques, a new model, also building on Latent Dirichlet Allocation,
is introduced in the last chapter, together with possible ways to improve prediction of
social networks based on content analysis.
Advisor: Prof. Dr. Gerhard Heyer
