See below: the methodology alone is a complete mindfuck, and should provide some sense of the nature of scientific studies in the age of big data.
Researchers say readers' identities can reveal much about content of articles
Aug 12, 2013
Articles that people share on social networks can reveal a lot about those readers, but a new study reverses the proposition: What can be learned about an article from the attributes of its readers?
To find out, the CMU researchers, along with colleagues at the University of Washington, analyzed almost 3 million news articles and the public profiles of the people who shared those articles on Twitter.
This enabled them to generate a few thousand "badges" that characterized the content of the shared news articles and also could be used to analyze any subsequent article, including those that had never been shared or even read.
In order to train their model, the team began by looking at three months of tweets—from September of 2010, 2011 and 2012—and selecting those that included links to mainstream news articles and came from a user who had filled out a Twitter profile.
[collect major news outlet's articles that have been tweeted about]
Each news article was then downloaded and the most meaningful, unique words were extracted, creating a "bag of words" for each article; similar to a visual word cloud, these bags give greater weight to more important words. Likewise, from each user's Twitter profile, a set of descriptive words, or badges, was extracted.
[each article, as well as each twitter-user-profile, gets a weighted wordcloud]
By comparing the bags of words with badges from the people who shared the articles, the researchers were able to create a dictionary that associated each badge with its characteristic words. For example, people who self-identify with the music badge in their profiles are likely to share articles with words such as "band," "album" and "song." Different dictionaries were created for each year to compensate for interests or topics that change over time. These dictionaries were then used to encode new articles, leading to a document representation based on attributes of potential readers.
[the article wordclouds and the user-profile wordclouds of the users who tweeted those articles are cross-correlated to create a "dictionary", or rather a "predictionary", if you will, that predicts who will share what, or what will be shared by whom]
Case Study Example:
New York Times columnist Maureen Dowd had readers who tended to be progressive. This association was notable because Dowd never explicitly uses the word "progressive" in the articles analyzed by the researchers. Rather, the algorithm detected that the words Dowd uses in these articles correspond to the type of content self-described progressives tend to share on Twitter.
-Carnegie Mellon University