XmlFileFormat < Issuecrawler

The 'xml source file'

This topic explains the format of the XML file that can be downloaded from the network details view. I will describe the format by walking through an actual example of such a file.

<?xml version='1.0'?> <!DOCTYPE IssueNetwork PUBLIC "-//OneWorld International//DTD IssueNetwork 0.1//EN" "http://www.issuecrawler.net/lib/dtd/IssueNetwork_0_1.dtd">

These lines describe the format of the xml file. The first line declares that the file is xml, the second line gives the actual dialect. The file http://www.issuecrawler.net/lib/dtd/IssueNetwork_0_1.dtd contains the formal definition of the syntax of the file in the form of a document type definition (dtd).

The next block gives information about the network:

   <Title>bewaarplicht nieuw 1</Title>    <Author name="New Media" email="rogers@hum.uva.nl" />    <Description>       <![CDATA[       [INFO]Mon Jan 24 06:23:06 CET 2005: Starting crawl dbPrimaryKey: 297989 SeriesID: 297811 SeriesIndex: 5 Title: 'bewaarplicht nieuw 1'       [INFO]Total bytes downloaded: 8410474       [INFO]Bytes per second: 25442.879       [INFO]Performing colink analysis. Mode: 1, priviledgeStartingPoints: true, isFinalIteration: false       [INFO]Total bytes downloaded: 137073807       [INFO]Bytes per second: 84090.91       [INFO]Generating network info...       ]]>    </Description>

The Title and Author fields are self-explanatory, the Description field merits more explanation. We find here the output of the crawler backend. It prints various details: when the crawler started crawling this network, what the networks unique id in the database is. SeriesID and SeriesIndex are tied to the scheduler. It also displays the statistiscs of each iteration, in this case we see there are two iterations, one where 8410474 bytes (8Mb) have been downloaded and the second where 137073807 bytes (137Mb) were downloaded. If there are any serious errors during the crawling, these will also be included in the Description field.

   <PageList>

This tag announces the start of the list of pages. Following are several blocks between <Site> tags, one for each site in the network. Note that external pages (pages that link into the network but don't get linkes from the network) are mentioned seperately later on.

      <Site URL="" host="netkwesties.nl" name="" category="NL" inlinks="65">          <Page URL="http://www.netkwesties.nl/editie109/artikel2.html" ID="28" datestamp="2004-10-28 10:27:48">             <Link TargetPageID="11" />             <Link TargetPageID="12" />          </Page>       </Site>

A block such as this is present for each site in the network. Mentioned are the hostname ('netkwesties.nl' in this case), the 'category' (basically, the TLD domain), and the number of inlinks (on a page-by-page basis). The URL attribute to the Site tag is currently unused.

For each site (read: section between Site tags) there are 0 or more pages defined with Page tags. The opening Page tag contains the URL of the actual page, the datestamp given by the web server when this page was retrieved and an ID. Each page has a unique ID.

Between the Page tags, we see Link tags. Each link tag defines a link from the given page to another page with the page given in the TargetPageID attribute.

   </PageList>

This tag marks the end of the page list.

   <InwardLinks>

With this tag, we start the list of external pages (the 'periphery' of the network). It is similar to the PageList we saw above, but it contains the sites/pages that only link into the network but do net get a link back from the network. Instead of several Site tags, it defines a list of ExternalSite tags:

      <ExternalSite URL="" host="bof.nl" name="" category="NL">          <ExternalPage URL="http://www.bof.nl/auteursrecht.html">             <Link TargetPageID="54" />          </ExternalPage>       </ExternalSite>

The only difference between ExternalSite and Site tags are that Site tags have an inlink count defined. Since ExternalPages by definition do not receive inlinks, the attribute is omitted here. The difference between ExternalPage and Page tags is that Page tags have an ID, ExternalPage tags don't. Although inconvenient for reasons beyond the scope of this topic, the rationale is that since they don't receive links they don't require id's.

   </InwardLinks>

This signifies the end of the external sites/pages list.

Next we find the list of starting-points:

   <StartingPoints privilege="0">            <StartingPoint URL='http://www.bewaarplicht.nl' />            <StartingPoint URL='http://www.bigbrotherawards.nl/verkeersgegevens.html' />            <StartingPoint URL='http://www.bof.nl' />            <StartingPoint URL='http://www.bof.nl/nieuwsbrief/nieuwsbrief_2004_22.html' />            <StartingPoint URL='http://www.ispo.nl' />            <StartingPoint URL='http://www.netkwesties.nl/editie114/artikel1.html' />            <StartingPoint URL='http://www.netkwesties.nl/editie115/artikel1.html' />            <StartingPoint URL='http://www.netkwesties.nl/editie115/artikel3.html' />            <StartingPoint URL='http://www.webwereld.nl/nieuws/20182.phtml' />            <StartingPoint URL='http://www.webwereld.nl/nieuws/20196.phtml' />            <StartingPoint URL='http://www.xs4all.nl/nieuws/bericht.php?id=588&taal=nl&msect=nieuws' />    </StartingPoints>

This is pretty self-explanatory. The privilege attribute signifies whether the co-link analysis is using privilidges starting points. If this is the case, the attribute value is 2, otherwise it is 0.

   <Statistics>       <CrawlStart datestamp="2005-01-24 06:23:06"/>       <CrawlEnd datestamp="2005-01-24 06:56:31"/>       <CrawlTimeouts>227</CrawlTimeouts>       <PagesDownloaded>5051</PagesDownloaded>       <ExcludedPages></ExcludedPages>    </Statistics>

This block gives us various statistics of the network/crawl. We see when it started and finished, and we also see how many page requests timed out (HTTP timeout) and how many were actually downloaded. It should also display the number of pages that were downloaded, but matched one of the hosts in the exclude list (and thus, were excluded).

   <Settings>       <Diversity>0</Diversity>       <Authority>0</Authority>       <Depth>2</Depth>       <ExclusionsList>                    <ExcludeSubstring>download.cnet.com</ExcludeSubstring>                    <ExcludeSubstring>download.com</ExcludeSubstring>                    <ExcludeSubstring>download.net</ExcludeSubstring>                    <ExcludeSubstring>netscape.com</ExcludeSubstring>                    <ExcludeSubstring>www.google.</ExcludeSubstring>                    <ExcludeSubstring>microsoft.com</ExcludeSubstring>       </ExclusionsList>       <CoLinkAnalysisMode mode="page" />       <CrawlIterations>1</CrawlIterations>    </Settings>

This block of settings concludes the xml file. The diversity defines the minimum number of domain categories this network must contain, it is always 0. The authority is also always zero, and gives the minimum number of inlinks a node must receive to be in the network. Depth (0, 1, 2 or 3), ColinkAnalysisMode (page or site) and CrawlIterations (0, 1, 2 or 3) are the settings from the crawl start view. The ExclusionList gives the list of exclusion strings, ie. all url's containing those strings are not considered.

-- KoenMartens - 24 Jan 2005

This topic: Issuecrawler > FAQ > XmlFileFormat
Topic revision: 23 Sep 2010, CatrinSmith

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback