SCS Undergraduate Thesis Topics

2009-2010
Neel Shah Noah Smith Predicting Risk from Financial Reports with Supervised Topic Models
     

Forecasting from analysis of text corpora is an exciting research area, one that has potential for application to a variety of fields such as finance, medicine and consumer research. We apply techniques from NLP to predicting real-world continuous quantities associated with a forward-looking text's meaning. In particular, we study Financial Reports because of the presence of a large text corpus that is highly standardized and widely studied by financial analysts in industry. In conducting our analysis we use a class of generative probabilistic models known as Topic Models. In such a model, documents are a mixture of topics, where a topic is defined as a probability distribution over words. These models are interesting because they provide a simple probabilistic procedure for generating documents. Such a procedure can be inverted using standard statistical techiques, allowing us to infer a set of topics from which a particular document was generated. The problem then is to associate the inferred topic distributions with real-world quantities such as company-level financial indicators for the prediction task.


Close this window