SCS Undergraduate Thesis Topics

Student Advisor(s) Thesis Topic
William Gronim Latanya Sweeney Methods for Extracting Names from Websites Containing Lists of People

For my thesis I have developed an automated system for extracting people's names from websites containing lists of people. The contents of these websites describe attributes common to the people listed. This public information has strategic value, such as demonstrating who tends to appear at similar events. Unlike traditional named entity recognition (NER) we are extracting names embedded in HTML without natural language context. We use a hidden markov model (HMM) to segment the document's HTML source in order to extract entire names. Engineering features for this classifier led us to several general types of features useful for segmenting text in structured documents. Rosters may order first and last names in many ways. A first/last classifier determines the ordering used by each document using dictionaries to provide partial knowledge of the distribution of names across token positions. The first/last classifier uses the two dimensional coordinates of text as it would appear when rendered by a browser in order to abstract away the HTML. The HMM segmenter was able to achieve 95% precision and 91% recall while the first/last classifier achieved 84% precision and 82% recall, on average in a corpus of 37 documents containing approximately 10,000 names.

Close this window