IssueCrawlerToDo < Issuecrawler

You are here: Foswiki>Issuecrawler Web>IssueCrawlerToDo (13 Jun 2005, ErikBorra)Edit Attach

Phases of work on the Issuecrawler

done - Set up the server. Install FreeBSD with one issuecrawler jail with http and ssh support. see IssueCrawlerServer
done - Dump the current system on the new server in the issuecrawler jail.
Patch bugs in the current code
Think of nice additions and improvements
done - Set up a second jail or server for development (and set one up for prioritized users)
started - Make a community
Make it open source
License it under GPL

How to go on with the project?

In short: Review buglist, think of new features, how to combine those two into a non frankensteinian working package? --> will we make a second generation (complete rewrite) or will we try to mold this crawler into one package and making improvements and additions. Last option is preferred.

future plans / ideas
- get out http://bugzilla.issuecrawler.net bugs, prioritize open bugs
- get out websphinx
- rewrite visualisation module + visualise from database (skipping 1 - maybe 2? - xmlsteps) -> find out (frontend) dependencies on xml files
- define statistical measures like brokers, paths, betweenness (who are the best information brokers), overal cluster coefficiency. See also IssueCrawlerSVGMapGraphTheoryEnhancements
- currently we are probably using degree centrality for the clustermaps, we have to know this for sure and make a study about other possibilities / centrality measures
- comparison of networks over time: (andrei is busy with it by means of scatter plots, since end of july)
  - Basic Comparison of Networks
    User should be able to select a series of networks (probably on the same issue over time, but other scenarios are possible). The program should then return (for each pair of networks): who has joined; who has left; whos gained or lost links (and by how many); whos become more or less knowledgeable (and by how much).
  - Visual Comparison of Networks
    The user should be able to select a series of networks, and be able to (in some way) view all of them at once / flick through them.
- textual analysis (what kind)
  - how is the network currently formatting its issue
    - is there a dogma, slogan?
    - what are the premisses / argument objects: what's this networks argument and what is it based on
  - crawl / scraper pdf's
  - answer questions like: do these sites refer to the same article, author or alert
- enter url and find all links from this url on somehost or site
- make a snowballing option (just keep on adding links and do a colink analysis in the end or just do a snowball with the results of colink analysis of each iterations)
- advanced crawler statistics
- think about other visualisation options, e.g. geological depictions
- what about issueatlas.net?
- IssueCrawlerNetworksOverTime

issuecrawler /social networking community

with wiki about crawler internals (Crawlerspecs)
with wiki about methodology and case examples
explain methods used for retrieval and analysis
explain svg map options
explain what kind of investigations can be done
with forum to discuss maps

General ideas / suggestions / remarks

Politics in structure:
- Show whether the organisations site is structured like the organisation itself
- A possible add-on could be a program that tries to find regularities in the structure of the sites (or networks) with a particular domain type (e.g. gov, com, org). This could be done by trying to learn a grammar, which describes the domain. I cannot guarantee at all that this is feasible.
Politics of exclusion: Design a smart ´dumb´ crawler that crawls a sites excluded areas (by robots.txt or robots META tag) and shows what is being excluded from search robots. see http://tools.issuecrawler.net/robots/
- find out legal issues!
Do sites use a reference to the public (http://www.ensmp.fr/~latour/expositions/002_parliament.html 4 ZKM exhibition)?
- This is based on theories of no public and asks which issues use the public as a resource? A possible way to do this would be with contextual analysis (KWIC KeyWord In Context). Possible questions to be solved in this way are How do particular organisations use X across different issue spaces? This is mainly dictionary work compare terms in issue spaces.

-- ErikBorra - 24 Mar 2004 -- ErikBorra - 14 Feb 2004 -- ErikBorra - 10 May 2004

Look at the crawler code to improve it performance wise (both processor use and mainly memory use), as java programmers tend to look at programs in an object oriented way, ignoring many computational and efficiency considerations. Judging on what I heared so far, there is i think a lot of room for optimization. At least it seems there are some bugs that need to be eliminated.
A clearer division of labour, more modularization, so that results of one stage are not lost if another stage fails (enter our db?).
In general, Java is really not the ultimate platform for these kind of applications, something like perl is more straightforward for processing texts and looking for patterns, it has less overhead and runs more smoothly, even with large datastructures. The ability to interface with a lower-level language like C in crucial sub-routines is a further plus. As the current software appears to be mainly a 'proof of concept', it might be worthwhile to look into these issues to come up with a 'production' implementation of the underlying algorithm.
David Heath: re: java vs some scripting language .. agree this would be a profitable line to persue. I don't think there is much more milage in optimising the current java crawler. The websphinx library on which it is based was never designed for large scale performance. Better to split out crawling, process, etc into separate unix processes which communicate via an RDBMS
As there is or will be a need for distributed crawling (ie. multiple hosts that run the crawler), a more central role for the database may be envisionaged.

-- KoenMartens - 14 Feb 2004

REMARKS from Recognos

A data abstraction layer (DAL) should be implemented as a foundation for database access. There should be no direct access to the database (because direct access is an obstacle in the way of scalability) and, as much as possible, most database queries should be performed by means of stored procedures (or functions, as defined by PostgreSQL). There is a lot of literature on the advantage of using stored procedures versus text queries, but the main advantages are: speed, encapsulation, security, well-defined interfaces.
David Heath: don't agree that use of stored proceedures is the only good way to go about this, although arguably it is possibly a good way. But stored procedures have their own disadvantages as well, e.g. maybe it's easier to implement your business logic in a language you're familiar with, rather than learning plpgsql. "Good practice" would suggest using a DAL but you could do this in the php program in a very good way if you wanted.
The current implementation is by no means fit to serve the initial purpose: to provide users with a clean and easy way to crawl networks and serve them with accurate and clear maps. The following is abstracted from IssueCrawlerCurrentArchitecture
1. The crawler itself was written in Java (a robust technology with proper support for large amounts of data), the entire website was implemented using PHP, which although a good technology when it comes to small and medium sized websites, cannot handle the huge amounts of data provided by the crawler.
  David Heath: disagree with the comment about PHP, the amount of data which can be handled depends on the algorithms used, not the programming language Erik and Koen agree 01-04-04
2. Once the crawler is done and data is entered into the database, the PHP code takes over, which is the less fortunate part of the system. This is because the crawler produces a considerable amount of data, which is XML formatted. While an XML result from the parser would usually account to about 7-10 megabytes, when this data is unpacked, it will take much more memory (sometimes up to 100-200 megabytes). Since PHP was not built to cope with such amounts of data, it will very often crash and render the system unusable.
  David Heath:this problem has been fixed by rewriting the publishing process to use a sax-based XML parser
3. The application is not scalable in any way_. Since PHP was never designed to run on distributed systems, support for scaling a PHP website is virtually non-existent. Considering the massive amount of data that the system handles and assuming that the number of users will increase in time, it makes sense to consider that lack of scalability is an issue.
  David Heath:disagree, PHP was definitely designed for scalability and running on distributed systems. PHP follows the "Shared Nothing" approach to building scalable web applications, wheras java follows a shared memory approach. In some ways, the "Shared nothing" approach reduces the programming complexity without losing scalability. In the shared nothing approach, all persistent data must be stored in the database, user sessions or filesystem across requests, wheras in Java's shared memory approach, data can be passed between concurrently executing threads and can persist across user requests. Erik and Koen agree 01-04-04

I	Attachment	Action	Size	Date	Who	Comment
xls	Domains_and_Subdomains.xls	manage	52 K	24 Mar 2004 - 14:37	ErikBorra

Topic revision: r24 - 13 Jun 2005, ErikBorra

Issuecrawler Web

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback