16px-feed-icon Veröffentlichungen View this page in English



Arabic language despite its significant cultural, religious, and political impact has received limited attention from modern computational linguistics for developing quality annotated corpora, linguistic tools, and techniques. This deficiency prevents mining, comprehending and analyzing the information repositories in Arabic and further poses a new challenge to studying and analyzing the content generated by a vast population of nearly around 500 million people worldwide. In addition, the limited linguistic resources is a fundamental obstacle for developing more advanced Natural Language Processing (NLP), and Machine Learning (ML) approaches towards more cognitive computing. The Arabic language primarily has three different styles: Modern Standard Arabic (MSA), Classical Arabic (CA), and, Arabic dialects (AD). The existing linguistic resources (although limited) have been developed commonly for MSA and seldom for CA whereas very rarely the realm of AD is targeted by the NLP research community. A reason behind this deficiency might be attributed to unavailability of dialectal content in the past since it was mostly spoken rather than written. However, this unavailability is not valid anymore since AD content easily can be compiled from social media, Web forums, and blogs.

In this dissertation, we contribute to the state-of-the-art resources for the Arabic language in general by developing, and extending quality annotated corpora and tools and particularly for AD. We summarize our contribution as follows (please note that the term ``we’’ refers to the researcher who was the leading person and his supervisors):

- Compiling new dialectal Arabic corpora containing seven main Arabic dialects. Besides, we created a historical (concerning time and provenance) Arabic corpus. Another contribution was to develop a classical Arabic corpus.

- Leading the development of morphological annotated quality corpora for Sanani Yemeni, and Morocco, Ta’izzi Yemen, Najdi, Syrian, Iraqi, Palestinian, and Jordanian dialects. These corpora were utilized for developing morphological analyzers.

- Developing a tool called DIWAN which easily provides a platform for text annotation, speech annotation and generating linguistic resources (dictionaries) for any dialectal corpus. It is an open source tool and easily can be configured for any particular dialect.

- Introducing Xword as an online multi-lingual framework for automatic word expansion. Xword relies on both pre-trained ad hoc word embedding models and n-gram models for the expansion task. Xword currently includes the two languages Arabic and German. Xword represents the results of each model both individually and collectively. Additionally, Xword can filter out the result set based on sentiment and part of speech (POS) tag of every single word.

Developing a tool called “CLARA” which is a multi-class classifier for detecting associated dialect label of a given sentence in AD.

In addition to the main course of the thesis (i.e., AD), we contributed to several other projects dealing with knowledge graph, ontology and data mining since they represent future perspectives for research.
We list our minor contributions as follows:

- Contributing in proposing metrics for evaluating the quality of embedding for ontological concepts and benchmarking the scalable embedding models for assessing their quality against concepts.

- Developing an ontology called CEVO (i.e., a comprehensive event ontology) that categorizes verbs with the shared meaning and syntactic behavior.

- Compiling a corpus which contains harassing content from the social media and then trained embedding models on that with the ultimate goal of developing a tool for detecting online harassment content.

In this thesis, we developed linguistic resources (annotated corpora, tools) which are carefully evaluated using state-of-the-art baselines to assure high quality. In total, our outcome demonstrate effectiveness, high accuracy besides of availability as open source for reproducibility and reusability purposes.

Type: PhD Thesis

Author: Faisal Al-shargi
Title: Exploiting Lexical Resources for Natural Language Processing: Compilation, Annotation, Classifying and Analysis of Arabic, German and English Corpora
School: informatik
Year: 2019
@PHD THESIS{faisalth,
AUTHOR = {Faisal Al-shargi},
TITLE = {Exploiting Lexical Resources for Natural Language Processing: Compilation, Annotation, Classifying and Analysis of Arabic, German and English Corpora},
SCHOOL = {informatik},
YEAR = {2019}