Crawling del.icio.us for data

Here is the dataset I've got by crawling del.icio.us. Details follow.

Crawling process

First, I set up a script which reads del.icio.us' news feeds. The script run for about 1.5 months, gathering approx 1.3M tags. The problem with del.icio.us' news feeds is that they only contain data about the first time a user tags a document, so the majority of data is lost.

The second step was to take every single document in the downloaded data and simply download all its tags, along with the users who tagged it and the timestamps of the tagging event. This process took about a week using ten different machines.

Stats

Some raw number about the dataset:
1280686 documents
21408652 total tags (16.7 tags per document on average)
1205958 unique tags
491702 users
7034524 tagging events

A chart showing the distribution of number of tags per document:

The 30 most frequent tags:

Frequency Tag
277414design
195461blog
147274tools
135187inspiration
129119imported
124871tutorial
124798programming
122118art
119211webdesign
114229reference
113939education
113082software
103713video
94744web
94365music
92504photography
83826development
79686howto
79615resources
75723linux
74114javascript
69075free
67085recipes
66254via:packrati.us
66108shopping
63386business
62953food
62458research
62145science
60547technology

A chart showing the raw frequency of tags. For readability the chart is cut, showing only frequencies >>2000.

Format

The dataset is in XML format. Here's a sample:

<tags t="1229008773" u="gregloby" href="005cb474bfc10f41036b543f042ae791">
	<t>jquery</t>
	<t>webdesign</t>
	<t>navigation</t>
</tags>

Every tags tag represent a tagging event: "a user has tagged a document with zero or more tags"*. The attributes of tags are:
* tagging documents is not the only del.icio.us' feature, a user can ie. make a review or a comment; in these cases the number of tags in a "tagging event" can be zero.

Downloads

del.icio.us dataset (XML format, Bzipped) 182.3 MB
del.icio.us dataset corpus (for Distributional Semantics, see below, Bzipped) 53.2 MB

Distributional semantics

Word Space Models were built from the del.icio.us dataset. The idea is treating the tags associated to a document as a document itself, made by just the (randomly ordered) list of tags. In this way a corpus of "documents" is created which can tehn be used for exploring aspects of the del.icio.us folksonomy using methods from NLP such as Distributional Semantics.

So far two models were build, namely a LSA model (100 dimensions) and a Random Indexing model (4000 dimensions). Both models were made with the Semantic Space package.

Publications

If you intend to use this data for research, please cite the following article:

Valerio Basile, Silvio Peroni, Fabio Tamburini, Fabio Vitali (2015)
Topical tags vs non-topical tags: Towards a bipartite classification?
Journal of Information Science.

Bibtex:
@article{basile_topical_2015,
	title = {Topical tags vs non-topical tags: {Towards} a bipartite classification?},
	volume = {41},
	url = {http://dx.doi.org/10.1177/0165551515585283},
	doi = {10.1177/0165551515585283},
	number = {4},
	journal = {J. Information Science},
	author = {Basile, Valerio and Peroni, Silvio and Tamburini, Fabio and Vitali, Fabio},
	year = {2015},
	pages = {486--505},
	file = {55a4f9eb08ae81aec91327f8.pdf:files/253/55a4f9eb08ae81aec91327f8.pdf:application/pdf}
}

Contacts

Send me an email.

back to my homepage.