The Social Life of a Crawler
Anat Bot-David, Anne HelBot
, Jeroen Jonkbot, MarcBot
Tuters, Botscar Coromina, Samuel Zwanbot, Simeona Botkova
Anat Ben-David, Anne Helmond, Jeroen Jonkbot, Marc Tuters, Oscar Coromina, Samuel Zwaan, Simeona Petkova
Is the web an increasingly closed space for crawlers?
Is there something like inclusion or exclusion policies for crawlers through different webspaces?
Is it possible to map the spaces that crawlers can/can't reach?
Which are the most marginalized bots?
To study if there are some kind of exclusion/inclusion policies towards crawlers we focused on 5 different spaces:
- News (50 websites listed in Google Directory)
- Dutch Blogosphere (Dejaap List)
- UN websites (list provided by UN)
- Gov Websites (wikipedia list of .gov sites from U.S)
- Social Networking Websites (Most Popular according to wikipedia list)
- Edu websites (queried google: site:.edu and selected top 100)
We checked if there was a robots.txt file for each of the websites and readed it with crawler eyes.
- Some websites doesn't have a robots.txt file, so all the content can be crawled without limitations.
- Some websites use robots.txt use it to explicitly allow all crawling for all robots.
- Robots.txt is also used to avoid crawling through specific content.
- There are also whitelists (they grant access to some bots disallowed places for the rest of the crawlers) and blacklists (bots not allowed to crawl, presumably for being behaving badly).
- There is some kind of poetry on robots.txt comments.
A Large-Scale Study of Robots.txt
. Yang Sun, Ziming Zhuang, and C. Lee Giles. 16th International World Wide Web Conference (2007). Publisher: ACM Press, Pages: 1123-1124Analysis of the usage statistics of robots exclusion standard
. Alay, S., and J. Ekanayake. IADIS International Conference WWW/Internet 2006