news.google scraper (BETA)
The script located at tools.issuecrawler.net/beta/googleNews
queries and scrapes http://news.google.com
and returns the results on screen and as a tab separated text file. This textfile can then be used to do analysis by e.g. importing it into MS-Excel for analysis with ReseauLu
Description of the input form:
- Number of results specifies the maximum number of results you wish to retrieve. Google outputs a maximum quantity of 100 returns, so if you specify a value greater than 100, multiple queries will be performed until the maximum number of results is reached or until google does not return any more results.
- All text inputs can have multiple queries, separated by a comma (,). For each query a new google search is performed.
- All select forms can have multiple selected values. For each selected value in 'Google Version' and in 'Language,' a new google query is done. To select multiple values hold down 'ctrl' on windows and linux or 'apple' on mac while clicking on the values.
- There are 5 fields which can have multiple queries: 'Search for', 'Return only articles from the news source named', 'Return only articles from news sources located in', 'Google Version' and 'Language'. If multiple queries are specified every combination of those queries will be executed. E.g. if 'Search for' contains 'bush, kerry' and in 'Google Version' usa and uk are selected there will be four google queries: 'bush in google version usa', 'bush in google version uk', 'kerry in google version usa' and 'kerry in google version uk'.
- Between every query an interval of 3 seconds is taken into account. Elaborate queries will take some time.
- In the 'Search for' input you can enter boolean queries and you can group terms by quotes ("). See this page for a description on how to correctly formulate queries. Note that the script uses comma's (,) to separate queries.
- Google News normally does not offer the ability to restrict search results by language. Also, not all languages in the script seem to be supported by google.
- The script works with filter=0 which means all results are queried, as if you would do a query with similar results included.
- In 'What to output' you can specify which fields are to be displayed as output. Take care in selecting what you wish to output. If you search with images, also include the result number, for images are referenced to the result number. If you search with different languages, select languages. Etc.
- Next to 'What to output' you can specify if you would like to output to the screen, to a file (as a tab-separated list), or both.
- In 'What to output' you will find a special selectable value 'same images'. When selected, all the thumbnails found for your query are compared by using a normal UNIX diff function. Only if the files (images) are exactly the same the output will be true. It may be the case that you see pictures which appear to be the same. This means that the files are slightly different. Next to each image you will find the result nr. of the article it belongs to.
There will be an extra column in the result called 'same images' which gives you a list of result numbers with the same thumbnail.
- All google results which have a date of the form 'x hours ago' or 'x minutes ago' (in any language) are calculated as the current time of the server (UTC+2) minus the time given by google. All dates are translated into the form day/month/year.
- Every US stated will mapped to USA in the output
- If you have selected thumbnail or 'same images' in 'what to output' the thumbnails are stored for future reference.
- All output is returned in UTF-8.
- Only results that have been stored to a file can be retrieved through the 'Previous results' link. For previous results which have 'same images' as output, the diff is calculated again to give you a nice overview.
We recommend firefox
as your browser but any browser should do as the html is w3c compliant
If something doesn't work as expected please send an email
and specify the exact time and date as well as your timezone.