Sentiment Analysis, Opinion Mining & neophyte basics
For more than a decade now, researchers from Text and Data Analytics, Computer Science, Computational Linguistics and Natural Language Processing, among others, have been working on technologies that could lead to analyze how people feel or what people think about something. In the current period, lots and lots of commercial offers have been built on what I think one should still call a Research Program. Here are some basic clues to get an idea of how this kind of content analysis technologies work.
One of the major issues dealing with huge amounts of User-Generated Content published online – also referred to as UGC – implies mining opinions, which means detecting their polarity, knowing the target(s) they aim at and what arguments they rely on. Opinion Mining/Sentiment Analysis tools are, simply put, derived from Information Extraction (such as Named Entities detection) and Natural Language Processing technologies (such as syntactic parsing). Given this, simply put, they work like an enhanced search engine with complex data calculation habilities and knowledge bases.
But dealing with the data emphasizes the fact that understanding “how does sentiment analysis work” is more a linguistic modelization problem than a computational one. The “keywords” or “bag-of-words” approach is the most commonly used because it underlies a simplistic representation of how opinions and sentiments can be expressed. It would consist, in its most simplistic form, in detecting words in UGC from a set of words labeled as “positive” or “negative” : this method remains unable to solve most of “simple” ambiguity problems (here is an example that illustrates this quite well, I guess).
Most of Opinion Mining tasks focus on local linguistic determination for opinion expression, which is partly constrained by external ressources and thus often deals with problems such as dictionaries coverage limitations, and at a higher level, domain-dependance. Contextual analysis stil is a challenge, as you will find in the following reference book : Bo PANG, Lillian LEE, Opinion Mining and Sentiment Analysis, Now Publishers Inc., 2008, 135 pages, ISSN 1554-0669.
As a temporary conclusion, I would say that accuracy remains the major challenge in this industry development. In fact, in such analysis systems, some “simple” linguistic phenomena still are problematic to modelize and implement, for example the negation scope problem, which is how to deal with negative turns of phrases. Another problem for systems accuracy is the analysis methodology itself. Fully organic methods are costly, but fully automated ones are innacurate : you need to define a methodology where the software and the analyst collaborate to get over the noise and deliver accurate analysis.