SCS Undergraduate Thesis Topics

2012-2013
Student Advisor Thesis Topic
Hanan Alshikhabobakr Kemal Oflazer Unsupervised Arabic Word Segmentation and Statistical Machine Translation

Word segmentation is a necessary step for Natural Language Processing (NLP) for morphologically rich languages, such as Arabic. In this thesis we experiment with unsupervised word segmentation systems proposed in the literature to perform word segmentation on Arabic, and then couple word segmentation with Statistical Machine Translation (SMT). Our results indicate that unsupervised segmentation systems turn out to be inaccurate and do not help with improving SMT quality. However, although minimal human post-processing greatly improves the translation accuracy, word baseline accuracy turns out to be better. We conclude that semi-supervised systems have more potential to improve Arabic to English translation in SMT.


Close this window