Predicting Risk from Financial Reports with Supervised Topic Models
Forecasting from analysis of text corpora is an exciting research area, one that has potential for application to a variety of fields such as finance, medicine and consumer research. We apply techniques from NLP to predicting real-world continuous quantities associated with a forward-looking text's meaning. In particular, we study Financial Reports because of the presence of a large text corpus that is highly standardized and widely studied by financial analysts in industry. In conducting our analysis we use a class of generative probabilistic models known as Topic Models. In such a model, documents are a mixture of topics, where a topic is defined as a probability distribution over words. These models are interesting because they provide a simple probabilistic procedure for generating documents. Such a procedure can be inverted using standard statistical techiques, allowing us to infer a set of topics from which a particular document was generated. The problem then is to associate the inferred topic distributions with real-world quantities such as company-level financial indicators for the prediction task.