SCS Undergraduate Thesis Topics

Desai Chen Noah Smith Bilingual Part of Speech Tag Induction with Markov Random Fields

This paper explores unsupervised learning with undirected graphical models. We focus on the problem of bilingual part of speech (POS) induction, which considers the POS induction problem when parallel training data is available [Snyder et al., 2008]. Because we use undirected models, there are no restrictions on the structure of the graphs and we can incorporate many overlapping features, such as sublexical features (this is in contrast to previous work which made use of directed, generative models). Although our undirected model is quite flexible in terms of being able to add new features, the unsupervised learning problem turns out to be quite challenging, and analysis determines that the non-convex objective we are attempting to optimize has many local optima which causes problems for learning. We show that performance can be improved by using an alternative objective based on contrastive estimation [Smith and Eisner, 2005b].

Close this window