You are here: Foswiki>Issuecrawler Web>FAQ (14 Mar 2013, ErikBorra)Edit Attach

Issue Crawler FAQ

How many URLs can I enter to start with?

2-20 starting points suffice. The maximum is up to 300 for cartographers (most users) or up to 650 for site authors and administrators. The more starting points entered, the longer the processing time.

How does the harvester work? Is there a relationship between how the harvester works and how the crawler works?

The harvester strips out URLs from text and code. The URLs it strips are these types: http://www.govcom.org, http://govcom.org and www.govcom.org. If the URL is just govcom.org, the harvester does not retain it. The Issue Crawler itself analyzes hyperlinks in html and other code, and in html and other code hyperlinks must have http://. Thus the Issue Crawler deals with all hyperlinks, and not just hyperlinks that begin with http:// and www. For your information, the Issue Crawler, like all crawlers we know of, doesn't handle javascript. The Issue Crawler also does not strip links from pdf's, but pdf's may be in the final results.

What is the 'iteration' setting?

The iterations apply to how many times a co-link analyis is done. An iteration of 1 means that the external urls of the pages pointed at by the starting points are fetched. After that a co-link analyis is performed, whereby pages receiving at least two links are retained. For an iteration of 2, after the previous step, the external urls of the pages pointed at by the co-linkee set are fetched and a new co-link analysis is performed anew. The latter (fetching of urls and co-link analysis) happens one more time for an iteration of 3.

What is the 'depth' setting?

The depth applies to following links. So if you have depth 3, first the links are fetched from the pages pointed at by the starting points, then the links of the pages pointed at by those links (from depth1) are fetched and then the links of the pages pointed at by those links (from depth2) are fetched. This happens for each iteration except that for iterations > 1 the links in the co-linkee set 'become the starting points'. (Note however that links to hosts outside the starting points are not crawled until the next iteration, regardless of depth settings -- see the 'limits' section below.)

What are 'starting points'? I launched a crawl, and would like to see the starting points again. Where are they? How do I retrieve them?

Starting points (SP) are the URLs you start your crawl with. To retrieve the starting points after the crawl has completed, on the network details page click "retrieve startingpoints and network urls." Alternatively, open the xml file associated with your crawl, which is located on the network details page (at the bottom, where the actor matrix list and the raw data files are, too). At the end of the xml file you will see a list of your starting points, starting point=. Copy and paste this list into the Issue Crawler harvester, harvest and save results for a cleaned up list of your starting points.

A co-link analysis is done after each iteration. The algorithm is roughly:
1. Find all external links from each site
2. Sort links alphabetically
3. Scan through all lists returning those sites which exist in the external links of two or more sites.

First a list is build of all external links (links not pointing to the host where the links are extracted from) from each site. Then each link from each site is compared to all links of the other sites. If the same link is found it is put in a set of co-linkees with which the next iteration starts. Also see the 'limits' section in this FAQ.

What is the 'privileged' setting?

With privileged starting points turned on, the starting points are added as seed URLs again (together with the co-linkees) for the second iteration, and not in any subsequent iteration of method. Thus, if privileged starting points are turned on, the starting points will be retained in crawled results after one iteration (with all other co-linkees). Normally, privileged starting points are only used when one selects one iteration of method, whereby the the starting points are kept in the results if they receive co-links. The results contain no isolates.

What is the 'by page' setting?

Co-link analysis is performed on the most specific or 'deep' pages, and the Issue Network is thus comprised of pages as opposed to sites (or hosts). Generally, selecting the 'by page' setting is meant to avoid 'issue drift', or the location of multiple-issue networks.

What is the 'by site' setting?

A site is defined as the source_host. For example, http://www.example.com/ex/am/ple/index.html is a page and http://www.example.com/ is the source_host. The co-link analysis is performed on the hosts and not on the deep pages.

Links from the network are those received from nodes depicted in the network.

Links to the network are links from this node to other nodes depicted in the network.

What are core and periphery sites? Are there ever isolates?

The difference between core and periphery is in whether they are co-linked (receive links from at least 2 other sites) or not. The visualization of a network never includes peripheral sites. However, a core site X can be in the network (visualization) as an isolate when two distinct peripheral sites Y and Z, sites which themselves do not receive inlinks, link to site X.

How is the size of the nodes in the network measured?

The size of the nodes is drawn by the indegree count.

By what measure are the cluster maps generated?

Spring-based cluster maps are generated on the basis of indegree centrality.

What does a crawl with the following settings actually do?

  • By page, 1 iteration, depth of 1, SP not privileged
    First all the external links of the pages pointed at by the starting points are retrieved. All these links, for all these starting points are then co-link analysed. From the set of co-linkees (in this case links referenced by at least two SP) a map is drawn. The resulting nodes are webpages. Note: the privileged starting point setting does not play a role here.
  • For more scenarios of use follow this link: http://www.govcom.org/scenarios_use.htm

What is the difference between crawl, fetch, and identify?

Crawling is the process of fetching (getting / downloading) webpages. Identifying is looking for some particular content (e.g. a link - denoted by a '<a href=' html tag). You can also fetch a link from a page.

How do you make only one class of nodes visible in the map?

Click on a colored node in the legend.

Turn off links in the legend, and click the node for which you would like to display inlinks and outlinks. Click the node again for just inlinks, again for just outlinks, and again for no links. One final click brings back inlinks and outlinks.

What makes a node (on a circle map) a peripheral node?

Those actors not quite receiving enough links to make it into the network. Note that this changes if you choose advanced setting and change the node count.

What is the best way to transfer the issue maps into documents for publication?

On the map, lower right hand corner, you'll notice a drop down menu, where you can export and save maps in a variety of formats. For print, tif is a standard. It has the highest resolution and the largest file size, but it is not 'editable' in ways lay-out people like. It's good for importing into word, though. Lower quality ones are the jpg and the png. For submitting to a journal (which will resize and place the figure), the pdf is the suggestion. These days pdf's may be used just like .eps, which is the standard. Maps (or graphs) may be saved with and without legend.

What is the best final format for printing?

The pdf works well. If you'd like to print a pdf, use the setting 'actual size' and set to landscape.

Is there any way to retrieve data from a crawl that has too few nodes to render a map?

Yes, on the network details page you'll see the xml file. Open that in your browser. On some browsers (like safari for the mac), you'll have to subsequently 'view source'. See XmlFileFormat for more info on the xml file. You can also access the data through the 'raw data' on the network details page.

What is the threshold for "Too few nodes to render"?

One node or fewer doesn't render. If you see no truncated URLs under your crawl result in the archive/network manager, it means that the results are 'empty' and that's there's in fact 'no network' (which is a finding). Occasionally no network is caused by a bug, but this is rare. If you look at the xml file, the top of it, under [info], and it indicates an error of one kind or another, then please relaunch your crawl.

Some people have reported seeing maps rendered differently from the same crawl at different times.

The absolute positioning of the nodes on the map may change, moving around the four quandrants on the map, but the relative positioning remains the same. This is to remind people that what's important is the relative positioning not the absolute, i.e., the map could be turned on its head, or rotated 90 degrees, and mean the same. If you reload the page a couple of times eventually you will get the same rendering again.

Is there a way to access the data in binary form?

There are options

What is the format of the xml file?

The XmlFileFormat is described in a seperate topic. The XML file can be loaded by clicking on 'xml source file' in the network details screen.

What is the format of the raw data?

The RawDataFormat is described in a seperate topic. The raw data file can be loaded by clicking on 'raw data' in the network details screen.

UCINET

UCINET is software for the analysis of social network data.

UCINET data file format.

The UCINET data file is in Full matrix DL format. The labels are the hosts from the issuecrawler network. The numbers depict the amount of (page)links between two hosts.

How to import the UCINET data file?

To import the UCINET data file download the file to your computer and start UCINET. Click data -> import -> dl. Select the file you just downloaded and click ok.

How do I get the ranked actor lists in UCINET?

To obtain the ranked actor list by site peform the following steps. Import the UCINET data file into UCINET. Click network -> 'Ego Networks' -> density. Select the .##h version of the file you just imported. Select 'IN-NEIGHBOURHOOD' as the 'type of ego neighbourhood'. Click ok. The 'Size' column will now have the same numbers as the ranked actor list by site.

To obtain the ranked actor list by page just sum the column of a particular site.

The information for 'links received from crawled population' is not present in the UCINET data file.

Why are there more sites in the UCINET data file than in the ranked actor list or the svg?

In the issuecrawler xml file there is a distinction between core and peripheral sites. Peripheral pages and sites only send links to the network but receive none. Core pages and sites send and receive links to/from the network. Only sites and pages which are in the core network will be displayed in the issuecrawler network and in the ranked actor list as peripheral pages have no inlinks. All extra sites in the UCINET data file will thus be peripheral sites.

What is the actor list?

The actor list shows which hosts link to which hosts. The rows are 'from', the columns 'to'. No quantity is displayed. There are two versions of the actorlist. The first one, core network, only shows the interlinking you would see in a clustermap (i.e. hosts and links in the core network AND links from external pages that link to hosts in the core network). The second one, core network and periphery, also includes links from external pages - those pages that do not get links from the network but do link to the network.The non-matrix versions convey the same information but in a different (more printer friendly format). The first column is the source host and the second column is the target host.

What is the ranked actor list?

The ranked actor lists reflect the data in the full network.

The ranked actor list lists the actors on the amount of inlinks they receive. There are three options:
  • by page: this will list how many deeplinks a particular host received
  • by site: this will list how many different hosts linked to this host
  • crawled population: this will list the total amount of pages linking to this host, found during the crawl. Note that not all pages or hosts found during the crawl are in the final network as each iteration and colink analysis removes links from the total population.

What is 'page list with their interlinkings (core and periphery)'

This script looks at all pages which receive links. The script counts the links (also external links, i.e. periphery) of a deep-page, displays all the pages witch link to it and the nr of links received. This script makes it easy to see which pages get deeplinks from which other pages.

Where can i find the newest svg plugin?

Use the adobe svg viewer plug-in svg viewer. Native SVG built into the Firefox browser doesn't render adobe svg very well as of yet. Windows users are advised to use Internet Explorer with adobe svg viewer plug-in.

Note: To use SVG in Firefox on Mac Intel machines, make sure you installed the latest version of firefox and that you launch firefox in Rosetta. To do this perform the following steps: Go to finder -> Applications. Right click Firefox. Go to the 'General' submenu. Enable Rosetta, click ok, launch firefox.

Limits / Ceilings and default settings

There are certain pre-defined limits in the crawler.

Default settings:
  • Maximum number of starting points: 300
  • Obey robots.txt: yes
  • Download time out - time in seconds after which a http request is aborted (e.g. for a host that is not reachable): 300

The following limits may be raised by the user.
  • Co-link Analysis Limits: Specifies the number of sites and pages permitted to be found after co-link analysis. Each of these can be 0 which means unlimited. If unlimited is set, the crawl may run for days and days.
    • sites: 100
    • pages: 100
      The limit here means that only the first 100 distinct sites (with the great quantity of occurences) are taken into account for the co-link analysis. On the one hand this parameter is influential as often more than 100 distinct sites are encountered. On the other hand it is not that influential as the most important (greatest nr of occurences) are taken into account.
  • Crawl limits: Limits the number of urls crawled during each iteration. These settings can be very influential as it is a pretty arbitrary stop in the process of detecting links. The current settings will have great impact if you specify pages with a lot of links (more than 500), or if you crawl a set of pages with recurring hosts.
    • urls per host: 500
      This means that only the first 500 urls of a host (=site) are crawled and processed.
    • total urls: 40000
      This means that if 40000 urls have been crawled this iteration, the crawler stops fetching pages and performs a colink analysis.
  • Crawl 'domain' limits (in Crawler.java) set to 'local': regardless of depth settings, links to other hosts than the starting points for this iteration are not themselves crawled. Each iteration thus provides a sphere exactly one host deeper around the starting points of the last iteration.

How can I exclude sites from the network?

When launching a crawl you can specify sites to be excluded from the network. By default this blacklist is a site/page exclusion list that excludes software download pages and the like.

The exclusion list works by simple sub-string matching. E.g. the exclusion string 'www.amazon.com' will match only URLs from the international Amazon while the exclusion string 'www.amazon.' will match also all local language versions such as www.amazon.fr

How can I load the GEXF export into Gephi?

See GexfExport for step by step instructions.
Topic revision: r39 - 14 Mar 2013, ErikBorra
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback