Denis Teyssou, Kalina Bontcheva, Bertrand Goupil, Valentin Porcellini, Muneerah Patel, Yan Cong, Ziwei Zhang, Nadia Murady, Sayat Rahman Chowdhury, Ángeles Briones
According to a survey conducted within the European project vera.ai, archiving appearances of disinformation is one of the most cumbersome tasks fact-checkers are performing, due to anti-scraping measures taken by Facebook, Instagram, TikTok, and other platforms.
This may result in less documented fact checks (with fewer appearances, and less evidence of disinformation traces). Facebook users may remove posts that have been fact-checked or make them private.
There is also a trend among some fact-checking organizations since Covid to avoid publishing disinformation appearances links to prevent spreading it more towards the public. Fact-checkers sometimes collect links into spreadsheets but do not publish them.
Therefore, many links to content disappear (erased by platforms, by end-users, or kept within private groups after debunks) and make it difficult for fact-checking memory and also for social scientists to evaluate the scope of disinformation on platforms.
The first goal of this project is therefore to understand the extent of the fact-checking ‘memory loss’.
What are fact-checkers current archiving practices to preserve disinformation traces?
How much of the fact-checked Facebook posts are now missing?
What kind of difficulties (and error messages from platforms) fact-checkers are facing when archiving disinformation content?
By performing data analysis, we have mapped the current fact-checkers’ archiving practices on Facebook and other platforms, in order to better understand the scope of the problem, and to ideate possible solutions.
Our analysis of the “War in Ukraine” fact checks dataset published by the European Digital Media Observatory (EDMO) allowed us to better understand the scope of the problem of fact-checking ‘memory loss’, across 26 fact-checking organizations.
To analyze this dataset of 1991 fact checks, we used a variety of tools such as Minet, Hyphe, Python with Beautiful Soup library to extract more than 41k links from which we gathered a sub-corpus of 6002 archives unique links.
As error messages from the archived links were often difficult to retrieve, we also used webrecorder archives and machine learning powered OCR from Sheffield University NLP Gate team, to identify several archiving errors such as entry barriers (like a Log in Facebook or Not Logged In Facebook) not available in http headers.
Finally, we performed the data visualization and analysis by using RawGraphs.
At least 15% of the archived content links on EDMO’s War in Ukraine dataset are badly archived
The “memory loss” is even more important as many errors due to unplayable videos have been identified during a manual analysis of a sample of 100 links
Content is missing in 23% of the non-archived Facebook links in the same dataset
This preliminary study on evaluating the ‘memory loss’ of disinformation archives links is limited so far to one dataset on the War in Ukraine, a major fact-checked event since the Russian invasion of Ukraine a year ago.
This dataset of almost 2000 fact checks at the time of undertaking the study gathers the work of 26 organizations under the curation of the EDMO partners with a wide variety of languages.
This first approach has allowed us to build several python scripts that will permit us to expand this work in future months and carry out the same experiments with further datasets.
While bootstrapping this mapping of ‘memory loss’ of disinformation in fact checks, we faced several difficulties:
retrieving archive content errors is a difficult task. Firstly, platform links frequently remain in a http 200 status despite removal of content, secondly the error messages are not present in http headers and must be extracted from images.
retrieving archiving services content programmatically is also difficult and quite slow. Platforms can detect automation and prevent it by adding extra layers of security such as ReCAPTCHA pages preventing access to the content.
To overcome those difficulties, thanks to the involvement of UvA students, we managed to analyze two sub-corpus of randomly selected 100 archived links and another of randomly 100 fact checks links (while keeping the ratio between countries) in order to:
Retrieve error messages with this sub-corpus and take screenshots of them in order to get a first typology of errors and bootstrap a ML classifier. In this analysis, we found that up to 43% of archived links had errors with a lot of them stating that videos cannot be replayed through the archive.
Assess whether the archived links are appearances of disinformation (therefore preserving the suspicious content) or evidence of disinformation (archived links providing clues or proofs of the falsifiability of the appearances links). This initial dataset will be reused in vera.ai to bootstrap a ML classifier and build a supervised detector to automate the annotation of fact checks.
The automated process we designed is limited because the platforms studied have highly-dynamic interfaces and algorithms. While we were able to identify the current user interfaces that display information or present the lack of content, our scripts may only work if the platforms do not radically change their designs or features.
Despite limitations and technical difficulties, we managed to perform a thorough analysis of the EDMO War in Ukraine fact checks dataset, by organizations, by countries, by scope of archived content, by type of archive services being used collectively by the fact-checking community.
Three main services (archive.today and its satellite sites with 44.1% in the dataset; Archive.org (29.2%) and perma.cc (26.6%)) are currently dominating the field. While archive.today relies on advertising, perma.cc is a freemium and commercial service built at Harvard University while the US Wayback Machine / Internet Archive remains the free-access web archive. Their use by fact checking organizations rely often on their ability to archive content from platforms and especially from Facebook, due to anti-bots and anti-scraping measures.
While platforms pages seem to remain accessible (http 200 status), we found that a consequent number of them display archived warnings (of which a sample is available through screenshots in the project poster) illustrating different types of archiving issues from login barriers to unplayable video content.
This study will be completed in a few weeks and we plan to extend it to more dataset by reusing the methodology elaborated during the Winter school data sprint.
Gomes, D., Miranda, J., Costa, M. (2011). A Survey on Web Archiving Initiatives. In: Gradmann, S., Borri, F., Meghini, C., Schuldt, H. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2011. Lecture Notes in Computer Science, vol 6966. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24469-8_41
Banchik, A. V. (2021). Disappearing acts: Content moderation and emergent practices to preserve at-risk human rights–related content. New Media & Society, 23(6), 1527–1544. https://doi.org/10.1177/1461444820912724
Ben-David, A. (2020). Counter-archiving Facebook. European Journal of Communication, 35(3), 249–264. https://doi.org/10.1177/0267323120922069
EDMO Dataset War in Ukraine : https://edmo.eu/war-in-ukraine-the-fact-checked-disinformation-detected-in-the-eu/#