Bias in Twitter's Search API and Stream API during Hong Kong Protests, October 13th and 14th, 2014

Team Members

Giovanna Salazar (10848223)

Agnieszka Walewinder (10848290)

María Belén Muñoz Román (10848673)

Blanka Szamos(10706623)

1. Introduction

Since its foundation in 2006, Twitter has positioned itself as one of the most popular microblogging tools that is used daily by the Internet users. By microblogging, we mean a “new form of communication in which users can describe their current status in short posts distributed by instant messages, mobile phones, email or the Web” (Kwak et al., 2010). As of June 30, 2014, it counted around 271 million active users on a monthly basis, and every day about 500 million Tweets are published.

Throughout its existence, Twitter has already transitioned from being a banal, ambient friend-following platform that posed the following question to users: “what are you doing?”, this period ran from 2006 to 2009, to an event/news-following type of platform interested in knowing “what’s happening?” from 2009-2012 (Rogers 2013). From 2012 onwards, Twitter changed its question to a generic tagline that reads “compose new tweet”. Each one of these periods is described by Richard Rogers as Twitter I, Twitter II, and Twitter III, in order to being able to conceive Twitter as an object of study ( see Rogers 2013).

The main research project in which we participated during the Digital Methods Winter School 2015 was lead by Daniela Stockmann, Assistant Professor with Tenure Department of Political Science at Leiden University, and belongs to the strand of research which may be called Twitter impact studies, meaning, that it focuses on the study of the “role” of the platform in a particular event. The focus of the main research project is studying the Twitter activity related to the Hong Kong protests that happened between October 1st, 2014 and October 15, 2014, by collecting and analyzing different data sets from the Twitter Streaming API, the Twitter Search API and the Firehose API, related to the following hashtags: #hongkong #occupycentral #umbrellarevolution #occupyadmiralty #hk929 #hkstudentstrike. They particularly seek to identify bias in the information by analyzing the data sets they retrieved from the different APIs with which they work with, taking into consideration the ways the APIs work.

Twitter’s Streaming API returns “real-time” data from Twitter by providing some parameters, such as keywords, user IDs or geographical parameters; it provides non-historical data with a 1% rate limit, meaning that the data set must be below 1% of the whole volume of Twitter traffic; the data sets returned through this API, is catalogued by Twitter as being conformed by the most relevant tweets. On the other hand, Twitter’s Searching API provides historical data sets from Twitter by querying a username or a hashtag. The Firehose API is a private data set that allows access to 100% of all public tweets, without including deleted accounts and deleted tweets.

It is worth noting that the main project also retrieved data sets from the Sina Weibo and the Tencent Weibo, which are microblogging platforms in China, similar to Twitter, nevertheless, the present research report is focused only on the data sets from Twitter, specifically from the Streaming and the Searching APIs, and the data they both registered for October 13th and 14th, 2014, mainly because this is the data and assigned dates that were given to the present sub-group.

Research Questions

In the light of the above, the present sub-group focused on the following research question: What are the main differences/matches that can be identified from the data sets of the Streaming API and Searching API from Twitter, between October 13 and 14, 2014? What can be inferred from the results?

Methodology

The Searching API and Streaming API datasets were retrieved by using the Digital Methods Initiative toolset named DMI Twitter Capture and Analysis Toolset (DMI-TCAT), which is a tool intended for capturing and analyzing Twitter data (Borra and Rieder 262).

The datasets used were: “HongKongProtests”, which corresponds to the Streaming API data, and the “hongkonglookups”, which, in turn, contains the Searching API data. In both cases the following query was introduced: #hongkong OR #occupycentral OR #umbrellarevolution OR #occupyadmiralty OR #hk929 OR #hkstudentstrike in the Query field, and the date parameters were set from October 13th, 2014 to October 14th, 2014.

Among the myriad of options regarding data that could be obtained, the following datasheets were extracted: (i) ‘Hashtag frequency’, which displayed the top used hashtags; (ii) ‘User visibility (mention frequency)’, which showed statistics on users who were mentioned the most in other users’ tweets; (iii) ‘Identical tweet frequency’ (RT), which ordered the results based on the content that was retweeted the most; and (iv) ‘User stats (individual)’, which showed an overall information about the single users who tweeted about the Hong Kong protests, but only the data regarding the total amount of tweets per user was used.

Due to time constraints, and taking into account the fact that this research dealt with big data, the subsequent analysis focused on the top 10 results of each category (hashtags, mentions, retweets and users) within the Search and Streaming API data in the given dates (October 13th and 14th, 2014).

Findings

In this section the findings of the analysis will be described. The top 10 hashtags will be the first category introduced, one column with contain the results from the Streaming API, another one the results of the Searching API, along with their frequency. The differences and the similarities will be described, such findings are the main focus of the report. This same procedure will be used for the following categories: mentions, retweets and users.

HASHTAGS

Comparison between top 10 hashtags on the 13th (with frequency)

SEARCH API - October 13th // STREAMING API - October 13th

SEARCH 13 HASHTAGS.jpg STREAMING 13 HASHTAGS.jpg

In both APIs the first four places remained the same. 5th and 6th switched place, meaning that #hk and #umbreallamovement has different place on the APIs. The Streaming API has #HongKongProtest while the Searching one does not include it, instead of that it has #香港 (Hong Kong) on the 10th place but it is not on the list of the Streaming API. Due to this, the list has a slide from the 7th place onward. The frequency in general shows a higher result in the streaming API on the hashtags.

Comparison between top 10 hashtags on the 14th (with frequency)

SEARCH API - October 14th // STREAMING API - October 14th

SEARCH 14 HASHTAGS.jpg STREAMING 14 HASHTAGS.jpg


The first 7 hashtags are the same for both at the Search and Streaming APIs in terms of order and the names: HongKong, OccupyCentral, UmbrellaRevolution, OccupyHK, Argentina, HK and UmbrellaMovement. The hashtag HongKongProtests appears only in the search API whereas the hashtag Admiralty appears only in the streaming API. Hashtag China is listed on both APIs, but in a different order – 8th place in the search API and 9th place in streaming API.

The frequency of hashtags is apparently higher in the streaming API than in the search one, for example: hashtag hongkong has appeared 13,491 times on Twitter according to search API and 14,366 times according to streaming API which gives us a difference of 875. Another example of hashtag is occupyhk which appeared 3034 times according to search API and 3,229 times according to stream API which shows a difference of 195.

#Argentina appears on the 14th because of a soccer game that took place in Hong Kong on the same day as a protest. Since the end was 7-0 for Argentina, several tweet appeared with the hashtag combination of Hong Kong and Argentina, leading it to the top 10 hashtag list.

MENTIONS

Comparison between top 10 mention frequency within Search and Streaming APIs on the 13th

Search: Streaming:
1. hkdemonow (1034) 1. hkdemonow (1101)
2. youngposthk (864) 2. youngposthk (901)
3. fion_li (688) 3. fion_li (720)
4. SCMPVideoMoJo (451) 4. SCMPVideoMoJo (470)
5. BBCBreaking (439) 5. BBCBreaking (466)
6. leungfaye (418) 6. galileo44 (448)
7. galileo44 (414) 7. leungfaye (432)
8. PenguinSix (341) 8. freakingcat (366)
9. freakingcat (309) 9. PenguinSix (356)
10. Zuki_Zucchini (290) 10. Zuki_Zucchini (301)

Both Twitter API’s top 10 lists include the exact same mentions, following the same order from the first place to the fifth. The main difference is the switch between the places of the 6th and 7th, and 8th-9th between the Streaming and Search APIs. The ones that were switched are leungfaye and galileo44, also, the pair of PenguinSix and freakingcat. The frequency is higher in the case of the Streaming API.

Comparison between top 10 mention frequency within Search and Streaming APIs on the 14th

Search: Streaming:
1. hkdemonow (1765) 1. hkdemonow (1866)
2. freakingcat (840) 2. freakingcat (866)
3. fion_li (778) 3. fion_li (812)
4. SCMP_News (667) 4. SCMP_News (702)
5. youngposthk (568) 5. youngposthk (592)
6. JournoDannyAsia (398) 6. JournoDannyAsia (420)
7. Zuki_Zucchini (398) 7. Zuki_Zucchini (362)
8. nytchinese (301) 8. nytchinese (312)
9. wilfredchan (300) 9. wilfredchan (311)
10. george_chen (286) 10. george_chen (308)

Both the Search and Streaming APIs include the same mentions within the top 10 lists, in exactly the same order, although also in this case, the Streaming API’s frequency is higher compare to the Search one.

RETWEETS

In order to make the comparison as visible as possible, this section will include descriptive tables.

Comparison between top 10 retweet within Search and Streaming APIs on the 13th

Streaming API

Retweet Frequency
RT @BBCBreaking: #HongKong #OccupyCentral - police say their goal is to clear road blocks to restore traffic & not to clear demonstrators 253
RT @cnnireport: It’s been two weeks and protesters are still sleeping in the streets of #HongKong: http://t.co/QHZs6RVZyI http://t.co/yaeb 210
RT @aguerosergiokun: #HongKong http://t.co/19Gr41GJc5 202
RT @BBCBreaking: #HongKong police begin removing barricades erected by pro-democracy protesters 190
RT @BBCNewsAsia: Clashes between #HongKong pro-democracy activists and #OccupyHK opponents http://t.co/3gpMw1Q9Ly http://t.co/qhb6f9iFHs 169
RT @gloomynews: 香港の民主占拠デモ現場にマスク姿の反占拠デモ隊が大量乱入、反占拠派タクシー集団がバリケード突破目指し突入、親北京デモ隊が大規模行進開始との情報。 RT @SCMPVideoMoJo #HongKong #OccupyCentral http:// 166
RT @adamnajberg: When the Hong Kong police take away metal barricades #OccupyCentral protesters build their own. http://t.co/3cra1RL3AJ 157
RT @BBCWorld: Clashes in #Hong Kong as masked men move in on #OccupyCentral protesters http://t.co/wcqtgpTtuX 131
RT @JeromeTaylor: These are the kind of people Chinese state media have called radicals & thugs #HongKongProtests http://t.co/8yFfghAlu7 129
RT @SCMPChinese: #OccupyCentral 【學聯港府明對話或難實現 學聯望港府今午後決定】有港媒引述消息稱,雙方在明天下午實現對話的可能性很微,學聯常委方志信表示,希望港府在今天下午之前有所決定。http://t.co/80ieCMhzbu http:// 122

Search API

Retweet Frequency
RT @BBCBreaking: #HongKong #OccupyCentral - police say their goal is to clear road blocks to restore traffic & not to clear demonstrators 241
RT @cnnireport: It’s been two weeks and protesters are still sleeping in the streets of #HongKong: http://t.co/QHZs6RVZyI http://t.co/yaeb 185
RT @BBCBreaking: #HongKong police begin removing barricades erected by pro-democracy protesters 177
RT @aguerosergiokun: #HongKong http://t.co/19Gr41GJc5 169
RT @BBCNewsAsia: Clashes between #HongKong pro-democracy activists and #OccupyHK opponents http://t.co/3gpMw1Q9Ly http://t.co/qhb6f9iFHs 163
RT @gloomynews: 香港の民主占拠デモ現場にマスク姿の反占拠デモ隊が大量乱入、反占拠派タクシー集団がバリケード突破目指し突入、親北京デモ隊が大規模行進開始との情報。 RT @SCMPVideoMoJo #HongKong #OccupyCentral http:// 161
RT @adamnajberg: When the Hong Kong police take away metal barricades #OccupyCentral protesters build their own. http://t.co/3cra1RL3AJ 147
RT @SCMPChinese: #OccupyCentral 【學聯港府明對話或難實現 學聯望港府今午後決定】有港媒引述消息稱,雙方在明天下午實現對話的可能性很微,學聯常委方志信表示,希望港府在今天下午之前有所決定。http://t.co/80ieCMhzbu http:// 123
RT @BBCWorld: Clashes in #Hong Kong as masked men move in on #OccupyCentral protesters http://t.co/wcqtgpTtuX 120
RT @WilliamsJon: Perhaps most incredible photo of #HongKong you will ever see: protests last night via @hkdemonow http://t.co/hSuYMXHTCF 109

The first two position in the case of retweets remained the same in both APIs during the observation. The retweets on the third and fourth places were switched between the search and the Stream APIs. The fifth, sixth and seventh places stayed the same on the lists.

The eighth place from the Stream API appears on the ninth place on the Search API (@BBCWorld). The ninth place of the Stream API is not on the list of the Search API (@JeromeTaylor). The tenth place in case of the Stream API had a switch to the ninth place on the Search API (@SCMPChinese). The tenth place of the Search API is not included in the list of the Stream API (@WilliamsJon).

Comparison between top 10 Retweets on Search and Streaming APIs from the 14th

Streaming API

Retweet Frequency
RT @nytchinese: 周一,香港数百反“占中”人士试图拆除路障时与示威者爆发冲突,并指责美国背后指使。By @ChuBailiang @PekingMike #OccupyCentral #HongKong http://t.co/Ppc7EqqxHf 132
RT @JohnSaeki: Barricades give the finger in #hongkong http://t.co/Q5YJ2b4PnF 103
RT @fion_li: Police removing reinforced barricades at Queensway #OccupyCentral #occupyhk http://t.co/bUFGdQ97A3 84
RT @george_chen: BREAKING: Pro-Beijing anti-#OccupyCentral protesters tried to block @nytimes HK distributions http://t.co/M3mTkGWf3P http 84
RT @VivienneChow: Together #HKers guard their city. Pic via Wan Leung #OccupyCentral #UmbrellaMovement #art #culture #hope #HongKong http: 78
RT @tomgrundy: Barriers being reinforced in tunnel #OccupyHK #Occupycentral http://t.co/tvQXF5kQ8p 71
RT @arabthomness: #Hongkong: scene from Hong Kong tonight after police tried to take down road blocks. #OccupyHK #UmbrellaRevolution http:/ 70
RT @wilfredchan: just happened: protesters successfully hold off riot police in Lung Wo Road with umbrellas barricades #OccupyCentral http 64
RT @PhelimKine: #China govt mouthpiece People's Daily gives #HongKong #OccupyCentral a #Tiananmen era warning http://t.co/SNHuI6B0Qr http:/ 62
RT @cronicaweb: ¡Goool de #Argentina! Paliza al poderosísimo #HongKong. Gaitán con un terrible zurdazo pone el partido 3-0... #Ohhhhhhhh 61
Search API

Retweet Frequency
RT @nytchinese: 周一,香港数百反“占中”人士试图拆除路障时与示威者爆发冲突,并指责美国背后指使。By @ChuBailiang @PekingMike #OccupyCentral #HongKong http://t.co/Ppc7EqqxHf 128
RT @JohnSaeki: Barricades give the finger in #hongkong http://t.co/Q5YJ2b4PnF 98
RT @christineparis9: Это гениально ❗️Баррикады в Гонконге 😂 #Гонконг #HongKong @UmbrellaRevHK http://t.co/deuZwnBsBW 87
RT @george_chen: BREAKING: Pro-Beijing anti-#OccupyCentral protesters tried to block @nytimes HK distributions http://t.co/M3mTkGWf3P http 81
RT @fion_li: Police removing reinforced barricades at Queensway #OccupyCentral #occupyhk http://t.co/bUFGdQ97A3 78
RT @VivienneChow: Together #HKers guard their city. Pic via Wan Leung #OccupyCentral #UmbrellaMovement #art #culture #hope #HongKong http: 76
RT @tomgrundy: Barriers being reinforced in tunnel #OccupyHK #Occupycentral http://t.co/tvQXF5kQ8p 68
RT @arabthomness: #Hongkong: scene from Hong Kong tonight after police tried to take down road blocks. #OccupyHK #UmbrellaRevolution http:/ 65
RT @wilfredchan: just happened: protesters successfully hold off riot police in Lung Wo Road with umbrellas barricades #OccupyCentral http 63
RT @cronicaweb: ¡Goool de #Argentina! Paliza al poderosísimo #HongKong. Gaitán con un terrible zurdazo pone el partido 3-0... #Ohhhhhhhh 58

The first two positions on both top 10 lists show the same retweets, namely the ones from nytchinese and JohnSaeki . The rest of the retweets appear on both APIs, but in different order in some cases. The 10th position in both lists shows a retweet from the same user, namely cronicaweb. There are two exceptions:

  • Retweet from christineparis9 (3rd position) only on the Search API

  • Retweet from PhelimKine (9th position) only shows up on the Streaming API

Just as at the case of hashtags, also here there is a retweet that appears because of the soccer game between Argentina and Hong Kong.

USERS

Comparison between top 10 Users on Search and Streaming APIs from the 13th

Search API Stream API
from_user_name (tweets in data set) from_user_name (tweets in data set)
1. BoomboomFengur (344) 1. BoomboomFengur (448)
2. hongkongcang (263) 2. rightnowio_feed (267)
3. FollowHKNews (255) 3. hongkongcang (263)
4. rightnowio_feed (247) 4. FollowHKNews (255)
5. hk928umbrella (241) 5. hk928umbrella (242)
6. GodBlessFreedom (193) 6. iamthor_us (199)
7. iamthor_us (193) 7. Daoish (198)
8. askabear81 (182) 8. askabear81 (182)
9. tax_free (173) 9. tax_free (175)
10. kelvw (172) 10. kelvw (172)

As it is visible, there is a 90 percent match in case of users on the 13th, comparing to the Streaming and Searching API. Out of the one different user, the rest remained the same and only the order of them changed.

The first user remained the same at both cases. The second user of Search API appears on the third place of the Streaming one. Because that, a slide appeared and the third place of the Search API takes place at the fourth place at the Streaming one. The fifth place is the same at both APIs. The sixth place’s user only appeared at the Search API, the sixth place of the Streaming APis at the seventh place of the Search API. The8th, 9th and 10th places are the same in the orders at both APIs.

Comparison between top 10 Users on Search and Streaming APIs from the 14th

Search API Stream API
from_user_name (tweets in data set) from_user_name (tweets in data set)
1. hk928umbrella (389) 1. hk928umbrella (393)
2. freakingcat (338) 2. BoomboomFengur (339)
3. FollowHKNews (324) 3. freakingcat (338)
4. BoomboomFengur (288) 4. FollowHKNews (324)
5. hongkongcang (208) 5. rightnowio_feed (255)
6. tax_free (192) 6. hongkongcang (208)
7. hkdemonow (189) 7. hkdemonow (194)
8. kelvw (182) 8. tax_free (193)
9. akiba2013 (176) 9. kelvw (184)
10. Winter_IceCream (171) 10. akiba2013 (178)

As it can be seen, 9 out of 10 users are the same for both search and streaming APIs. The user hk928umbrella appears as the first one on both lists. There is only one user name Winter_IceCream that is shown by Search API and not by the Stream and one user named rightnowio_feed which is not shown by the Search API, but appears on Streaming API.

Conclusions

The main purpose of the present report was to identify the differences among the datasets from the Streaming API and the Searching API, that were retrieved for the main research project on Hong Kong protests. We were assigned the dates October 13th and 14th, 2014. Our analysis showed that the datasets from the APIs from the 13th and 14th did not have relevant differences.

In the particular case of hashtags, for October 13th, the Streaming and the Search API showed that nine out of ten hashtags remained the same but they appeared in different orders. On the 14th, seven out of ten hashtags were the same, they also appeared in different order. We were also able to notice that the Streaming API showed much higher frequency than on other days. In regards to mentions, both days matched and showed the same results. Nevertheless, on the 13th, two positions were switched, otherwise, everything else remained in the same order. In the case of retweets the majority remained in the same place of the top 10 list in both APIs. Finally, same results appeared in the case of users. 90 percent of the users remained the same during the 13th and 14th, so there was only one change in the top 10 users. Apart from this, there were small changes in the order within the list but it does not lead to main differences between the APIs. The streaming API’s frequency remained higher, even though the difference in general was smaller.

Discussion

Since the datasets from the Streaming API and the Search API show mostly the same results in terms of frequency regarding hashtags, mentions, retweets and users, it can be inferred that both datasets share a correlation, thus both can be considered as reliable sources. In addition, although it is unknown the way in which Twitter algorithm works in order to prioritize results shown in its Streaming API, future research on this topic may want to consider the fact that the main marker that drives Twitter activity would be hashtags, as the aforementioned results show a prioritization concerning hashtags, then mentions, retweets, and lastly users.

Bibliography

Borra E. and Rieder B., (2014) " Programmed method: developing a toolset for capturing and analyzing tweets." Aslib Journal of Information Management, 66(2): 262 - 278.

Kwak, H., Lee, Ch., Park, H. and Moon, S. What is Twitter, a social network or a news media? Proceedings of the 19th international conference on World Wide Web. New York, 2010. 591-600.

Rogers, R. (2013). "Debanalizing Twitter: The Transformation of an Object of Study." Proceedings of ACM Web Science 2013. Paris: May 2013.

Twitter. “Twitter Reports Second Quarter 2014 Results”. Investor Twitterinc. December 2, 2014. January 15, 2014 < https://investor.twitterinc.com/releasedetail.cfm?ReleaseID=862505>.

This topic: Dmi > DmiWinter2015Projects > BiasTwitterSearchStream
Topic revision: 18 Jan 2015, BM
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback