Proceedings

EPJ Data Science Highlight - Twitter’s tampered samples: Limitations of big data sampling in social media

Details: Published on 16 January 2019

Social networks are widely used as sources of data in computational social science studies, and so it is of particular importance to determine whether these datasets are bias-free. In EPJ Data Science, Jürgen Pfeffer, Katja Mayer and Fred Morstatter demonstrate how Twitter’s sampling mechanism is prone to manipulation that could influence how researchers, journalists, marketeers and policy analysts interpret their data.

(Guest post by Jürgen Pfeffer, Katja Mayer and Fred Morstatter, originally published in the SpringerOpen blog)

Despite the many scandals surrounding social media companies and their practices of data sharing, they are still central platforms of opinion formation and public discourse. Therefore, social media data is widely analyzed in academic and applied social research. Twitter has become the de facto core data supplier for computational social science as the company provides access to its data for researchers via several interfaces. One of these – the “Sample API” – is promoted by Twitter as follows:

Screenshot of the Twitter developer website, as of Jan 14, 2019

Twitter’s Sample API provides 1% of all Tweets worldwide for free, in real-time – a great data source for researchers, journalists, consultants and government analysts to study human behavior. Twitter promises “random” samples of their data. The randomness of a sample – each element has an equal probability of being chosen – is of high importance for social scientific methodological integrity as a sample selected randomly is regarded as valid representation of the total population. Even though Twitter shares (parts of its) data with potentially everybody (unlike other social media companies), the company does not reveal details about its data sampling mechanisms.

We set up experiments to test the sampling procedure of the Sample API by inducing tweets into the feed in such a way that they appear in the sample with high certainty. In other words, while a Tweet should have a 1% chance to be part of the Twitter’s 1% sample data, it is easily possible to increase that chance to 80%. Consequently, finding 100 Tweets in the 1% sample related to a certain topic might not result from a random sample of 10,000 Tweets but just from a manipulated sample based on 125 Tweets.

This figure illustrates the effect of a Tweet injection experiment during the Nov 2016 US presidential election campaigns using the hashtag #trump. The gray area represents Tweets in 1% sample from 328 million users, red represents the induced tweets, the black line illustrates the 1% Sample API Tweets. One hundred accounts were enough to manipulate the data stream for a globally important topic.

We also developed methods to identify over-represented user accounts in Twitter’s sample data and show that intentional tampering is not the only way Twitter’s data can get skewed. For instance, automated bots can accidentally be over-represented in the data samples or be invisible at all. The authors also show evidence that corporate Twitter users seem to be allowed by Twitter to send many more Tweets than regular users, which will automatically inflate their position in the data.

Our study lists potential solutions both for the architectural flaws and the regaining of scientific integrity. The latter could be achieved by making sampling methods transparent and cooperating with social media researchers more closely to create open interfaces as well as the possibility to better assess the data at hand. At a time when decision making is based increasingly on the analysis of social data, also industry should do everything to enhance public trust in the methodologies at hand.

Even though some big data evangelists state that sampling is “an artefact of a period of information scarcity”, reality makes sampling a central necessity in times of information abundance. Researchers have to trust Twitter to supply them with methodologically sound samples while dealing with all kinds of other problems, such as bias and ethical issues (see some here, some here and some here).

All news

Submit a proposal

This was our first experience of publishing with EPJ Web of Conferences. We contacted the publisher in the middle of September, just one month prior to the Conference, but everything went through smoothly. We have had published MNPS Proceedings with different publishers in the past, and would like to tell that the EPJ Web of Conferences team was probably the best, very quick, helpful and interactive. Typically, we were getting responses from EPJ Web of Conferences team within less than an hour and have had help at every production stage.
We are very thankful to Solange Guenot, Web of Conferences Publishing Editor, and Isabelle Houlbert, Web of Conferences Production Editor, for their support. These ladies are top-level professionals, who made a great contribution to the success of this issue. We are fully satisfied with the publication of the Conference Proceedings and are looking forward to further cooperation. The publication was very fast, easy and of high quality. My colleagues and I strongly recommend EPJ Web of Conferences to anyone, who is interested in quick high-quality publication of conference proceedings.

On behalf of the Organizing and Program Committees and Editorial Team of MNPS-2019, Dr. Alexey B. Nadykto, Moscow State Technological University “STANKIN”, Moscow, Russia. EPJ Web of Conferences vol. 224 (2019)

More testimonials

Conference announcements

International Conference on Phenomena in Ionized Gases
June 20-27, 2025
Aix-en-Provence, France

15th European Conference on Atoms Molecules and Photons (ECAMP)
June 30 to July 4, 2025
Innsbruck, Austria

Joint Annual Meeting of ÖPG and SPS
August 18-22, 2025
Wien, Austria

111th Italian National Society Congress
September 22-26, 2025
Palermo, Italy