I previously wrote about the novel use of Twitter to predict geographic happiness, citing a study that used a geo-tagged data set of over 80 million words generated in 2011 on Twitter to estimate the happiest and saddest states (shown in red and blue, respectively, on the map). Authors describe their methodology as follows: “To measure sentiment (hereafter happiness) in these areas from the corpus of words collected, we use the Language Assessment by Mechanical Turk (LabMT) word list…assembled by combining the 5,000 most frequently occurring words in each of four text sources: Google Books (English), music lyrics, the New York Times and Twitter. A total of roughly 10,000 of these individual words have been scored by users of Amazon’s Mechanical Turk service on a scale of 1 (sad) to 9 (happy), resulting in a measure of average happiness for each given word … For example, ‘rainbow’ is one of the happiest words in the list with a score of , while ‘earthquake’ is one of the saddest, with . Neutral words like ‘the’ or ‘thereof’ tend to score in the middle of the scale, with and 5 respectively.”
Recently I saw an even more exciting application of this concept: the use of psychological data on Twitter to estimate heart disease risk in various geographical areas. As study authors note, important psychological risk factors for cardiovascular disease, such as hostility and stress, are very difficult to measure at a community level. Typical measurements involve phone surveys and/or household visits to gather psychological data, but these are costly and imprecise. So, authors turned to Twitter, using data from 1,347 U.S. counties with published heart disease and mortality statistics and at least 50,000 available tweeted words. The study area included more than 85% of the U.S. population. Analyses showed indeed that “language patterns reflecting negative social relationships, disengagement, and negative emotions—especially anger—emerged as risk factors; positive emotions and psychological engagement emerged as protective factors.”
More importantly, “a cross-sectional regression model based only on Twitter language predicted AHD [atherosclerotic heart disease] mortality significantly better than did a model that combined 10 common demographic, socioeconomic, and health risk factors, including smoking, diabetes, hypertension, and obesity.” The included visual demonstrates true mortality rates from AHD vs. predicted heart disease mortality rates from Twitter language. Authors comment that since the typical Twitter user is younger and is NOT the typical person at risk for/dying of heart disease, it is unclear why Twitter language should track heart disease mortality. They surmise that “the tweets of younger adults may disclose characteristics of their community, reflecting a shared economic, physical, and psychological environment…the language of Twitter may be a window into the aggregated and powerful effects of the community context.”
It’s fascinating, isn’t it? That those 140 character emotive blasts, which we tweet without consideration or conscious debate, could collectively represent our community-wide psychological health and heart disease risk?