Lexical feature selection #
Natural Language Processing

The main question is this: how can we identify terms that are overrepresented in a given set of document?

Methods #

Log odds ratio informative Dirichlet prior #

When you want to contrast two corpora (e.g. Democrats vs. Republican). Needs a (big) background corpus. Seems to work really well in many cases.

An interesting application to restaurant menus: [](Narrative framing of consumer sentiment in online restaurant reviews)


tf-idf #

Pointwise mutual information #

