TWITA
  • about
  • people
  • downloads
  • Sentiment Analysis
  • Sentiment Analysis

    As part of the TWITA project investigation, we developed a lexicon for Sentiment Analysis called Sentix (Sentiment Italian Lexicon). The lexicon is the result of the alignment of several resources:

    • WordNet - a large lexical database of English
    • MultiWordNet - a multilingual lexical database
    • BabelNet - A very large multilingual ontology
    • SentiWordNet - a lexical resource for opinion mining
    Sentix is available for download from the downloads page.

    An entry in our lexicon consists in an italian lemma, part-of-speech (noun, verb, adjective, adverb), WordNet synset ID, a positive and a negative score from SentiWordNet, a polarity score ranging from -1 to 1, and an intensity score ranging from 0 to 1. Here is an example entry:

    lemma POS Wordnet synset ID positive score negative score polarity intensity
    naturale A 00074346 0,75 0,125 0,789 0,76

    Polarity and Intensity

    In SentiWordNet the positive p and negative n polarity scores are assigned to synsets independently, that is, the sum of positive, negative and neutral score is 1, but no constraint is posed on the relative size of p and n. Geometrically, a synset in SentiWordNet is a point in the cartesian space where its y coordinate is the positive score and the y coordinate is the negative score. Since x+y≤1, the sentiment plane is restricted to a triangle:
    geometric sentiment space

    Intuitively, synsets that fall into the P region are "positive", synsets that fall into the N region are "negative", and synsets that fall into the U region are "neutral". However, the three regions are not clearly delimited.

    Formally, given a cartesian-to-polar coordinates transformation:
    geometric sentiment space
    We define Polarity (π) and Intensity (ι) as:
    geometric sentiment space
    Polarity (π) ranges from -1 (totally negative) to 1 (totally positive). Intensity ranges from 0 (totally neutral) to 1 (totally polarized). Polarity and Intensity are functions of the positive and negative scores found in SentiWordNet, thus they don't add any informationi to it. However, they are intuitive measures of lexical sentiment.

    Polypathy

    When it's not possible to perform Word Sense Disambiguation, then all senses of a word must be considered (or one can use heuristics such as taking the most frequent sense). Many words are polysemous, that is, they have multiple meanings. The meanings of a word can convey sentiments with opposite polarities. The polypathy of a lemma is a measure of the spread of its senses on the polarity scale.

    One way of compute the polypathy of word is the standard deviation of the polarity of its senses, but this method does not take into account the intensity of the synsets. An alternative method is to take the mean distance of all synsets from their centroid in the sentiment plane. Two lists of Italian and English words with their polypath (computed as standard deviation of polarities) are available for downloads in the downloads page.

    Statistics

    Resource Part of speech Lemmas Synsets
    Sentix noun 52,257 12,228
    adjective 3,359 1,805
    verb 2,775 1,260
    adverb 1,351 750
    all 59,742 16,043
    Sentix (only aligned with MultiWordNet) noun 10,042 4,779
    adjective 3,359 1,805
    verb 2,775 1,260
    adverb 1,351 750
    all 17,527 8,594
    Sentix (only aligned with Babelnet) noun 45,517 11,567
    adjective 0 0
    verb 0 0
    adverb 0 0
    all 45,517 11,567