Thesis Presentation
| A.Y. 2003-2004 | ||
| Student | Advisor | Thesis Topic |
| Aditya Agarwal | Scherlis | Navigating, Learning and Capturing the Latent Sematic Pathways in an Email Corpus |
E-mail, while originally designed for asynchronous communication, now serves a host of other overloaded purposes including task management, informal rolodexing and archival storage. Many users suffer from excessive email and attempt to alleviate the problem with a personal categorization or foldering scheme. However, given the sheer volume of email received, manual categorization does not serve as a viable solution. Any attempt to redesign email communication to better suit its current tasks will be in tension with the legacy epistemology that a user has of her Inbox. I propose a system that will enable multi-dimensional categorization, two example dimensions being social networks and action items. The system attempts to discover latent semantic structures within a user's corpus and uses it to perform email categorization. A user's social network is an example of an underlying semantic structure in an email corpus. The unsupervised message classification scheme developed is based on discovering this social network structure. The system extracts and analyzes email header information contained within the user corpora and uses it to create a variety of graph based social network models. An edge-betweeness centrality algorithm is then applied in conjunction with a ranking scheme to create a set of participant clusters and corresponding message clusters. Having an explicit mapping between a participant and message cluster allows the user to mold the system to fit in with the legacy epistemology and to train it for further use. In addition to this, the system can evolve with time and adapt to new semantic structures. Initial results for the classification scheme are highly encouraging. Novel methods of navigating through an email corpus are also explored. Latent semantic indexing and other similarity measures are used as the basis for an interactive system that will allow the user to extract underlying semantic structure from a corpus and capture it for later use.
Thesis Committee:
William Scherlis, Chair
James Herbsleb