new! the TWITA collection is now available for download.

About

TWITA is a collection of tweets identified as being written in the Italian langauge.

This collection of tweets has been harvested using a two-pass language identification, aiming for general Italian language. We used cURL to download from the Twitter Streaming API searching for a list of representative words:

vita Roma forza alla quanto amore Milano Italia fare grazie
della anche periodo bene scuola dopo tutto ancora tutti fatto

The list consists of the most frequent lemma in the ItWaC corpus; all words that are frequent in other languages (English, Spanish and Portuguese) are filtered out (e.g. come). As a second step, the tweets are input to the language identification software langid.py to detect Italian language.

new! Statistics

155,583,306 total tweets
from February 2012 to June 2013
3,866,662 tweets have geo-location information (coordinates)

Processing

URLs (http://xyz.net), hashtags (#xyz) and mentions (@xyz) have been replaced respectively with URL, HASHTAG and MENTION.
The tweets are tokenized using Ucto: Unicode Tokenizer and POS-tagged using TreeTagger.

Some frequency lists of hashtags found in TWITA are available for downloads in the downloads page.

Publications

Valerio Basile, Malvina Nissim (2013): Sentiment analysis on Italian tweets. Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp 100–107, Atlanta, United States [PDF] [BibTeX]

back to Valerio's homepage.