I recently uncovered a simple Re-Identification attack on a large flickr data set by this group http://socialnetworks.mpi-sws.org/. The dataset is from this paper http://portal.acm.org/citation.cfm?id=1397735.1397742/ (citations : 13)

The group distributes an anonymized flickr social network dataset. The data is anonymized by representing users, pics and groups by unique id’s. It contains information about which pic was favorited, uploaded and commented by whom. Additionally there are timestamps associated with each information, such as when the picture was uploaded, when a user commented on it and when it was marked as favorite.

The timestamps were a source of the vulnerability in the anonymization scheme. The procedure of attack is simple.

  1. Choose a certain date in 2007
  2. Use flickr Search feature to find the pics uploaded on that day alone.
  3. Rank the picture using “Interestingness” (There is such an option.)
  4. filter the pics in the dataset uploaded on the same day and have large number of comments or marked favorite by many people.
  5. By matching the number of favorites and comments in pictures from dataset to those obtained from above search results, few seed pics can be easily identified.
  6. Using seed pictures from above step uncovering the remaining network is straightforward.

I suggested a simple time stamp normalization procedure to make the attack difficult, which they seem to have implemented.

Given the emergence of real-time search/web, I guess in future anonymization schemes (If at all they work or are used.) should also anonymize the timestamps in addition to other identifiers.

Related Websites:

To Quote from the paper

“For online social networks, the data can be collected
by crawling either via an API, or “screen-scraping” (e.g.,
Mislove et al. crawled Flickr, YouTube, LiveJournal, and
Orkut [MMG+07]; anonymized graphs are available by
request only).We stress that even when obtained from public websites, this kind of information—if publicly released—
still presents privacy risks because it helps attackers who
lack resources for massive crawls.”

If you have any comments please mail me at : akshayubhat [at] gmail.com

————

Akshay Bhat

http://www.akshaybhat.com