TWITA is a collection of tweets identified as being written in the Italian langauge.
This collection of tweets has been harvested using a two-pass language identification, aiming for general Italian language. We used cURL to download from the Twitter Streaming API searching for a list of representative words:
vita Roma forza alla quanto amore Milano Italia fare grazie della anche periodo bene scuola dopo tutto ancora tutti fattoThe list consists of the most frequent lemma in the ItWaC corpus; all words that are frequent in other languages (English, Spanish and Portuguese) are filtered out (e.g. come). As a second step, the tweets are input to the language identification software langid.py to detect Italian language.
Some frequency lists of hashtags found in TWITA are available for downloads in the downloads page.