DMI Tools | Digital Methods | Project Overview | FAQ | Tag Cloud
The Digital Methods Initiative is a contribution to doing research into the "natively digital". Consider, for example, the hyperlink, the thread and the tag. Each may 'remediate' older media forms (reference, telephone chain, book index), and genealogical histories remain useful (Bolter/Grusin, 1999; Elsaesser, 2005; Kittler, 1995). At the same time new media environments - and the software-makers - have implemented these concepts, algorithmically, in ways that may resist familiar thinking as well as methods (Manovich, 2005; Fuller, 2007). In other words, the effort is not simply to import well-known methods - be they from humanities, social science or computing. Rather, the focus is on how methods may change, however slightly or wholesale, owing to the technical specificities of new media.
The initiative is twofold. First, we wish to interrogate what scholars have called "virtual methods," ascertaining the extent to which the new methods can stake claim to taking into account the differences that new media make (Hine, 2005). Second, we desire to create a platform to display the tools and methods to perform research that can, also, take advantage of "web epistemology". The web may have distinctive ways of recommending information (Rogers, 2004; Sunstein, 2006). Which digital methods innovate with and also critically display the recommender culture that is at the heart of new media information environments?
Amsterdam-based new media scholars have been developing methods, techniques and tools since 1999, starting with the Net Locator and, later, the Issue Crawler, which focuses on hyperlink analysis (Govcom.org, 1999, 2001). Since then a set of allied tools and independent modules have been made to extend the research into the blogosphere, online newssphere, discussion lists and forums, folksonomies as well as search engine behavior. These tools include scripts to scrape web, blog, news, image and social bookmarking search engines, as well as simple analytical machines that output data sets as well as graphical visualizations.
How Different are Digital Methods?
The Web archiving specialist, Niels Brügger, has written: "[U]nlike other well-known media, the Internet does not simply exist in a form suited to being archived, but rather is first formed as an object of study in the archiving, and it is formed differently depending on who does the archiving, when, and for what purpose" (Brügger, 2005). That the object of study is co-constructed in the means by which it is 'tamed' or 'corralled' by method and technique is a classic point from the sociology and philosophy of science and elsewhere. For example, when one studies the Internet archive, what stands out is not so much that the Internet is archived, but how it is. Unlike a Web search engine, at archive.org's wayback machine, one queries a URL, not a key word. Moreover, one cannot 'surf' or search the Web as it was at some given date. In other words, a series of decisions was taken on how to build the archive, and those decisions constrain the type of research one can perform. One can study the evolution of a single site (or multiple sites) over time by collecting snapshots from the dates that a page was indexed. One also can go back in time to a Website for evidentiary purposes. Is that how one may wish to study the history of the Web? What kinds of research questions may be asked fruitfully and not asked, given the constraints? Digital methods perhaps begin with coming to grips with given forms of objects under study.
Brügger seems to go a step further, however, in arguing that methodological standardisation is unlikely if not impossible. To Brügger the form assumed by the object of study depends on its creator, in this case the particular archivist. Does such a thought imply that digital method, if method remains the right word, is more of an art than a science, where the tacit knowledge and skill are paramount? Can there be no instruments, only tools? Data are less gathered, than they are sculpted, or 'scraped', as the term is known. Perhaps 'data-mining' is appropriate in the sense that there is always some waste, or slurry, that runs off. Digital methods may have to have more patience with the lack of exhaustiveness in data sets than would be the norm in other sciences.
FAQ: "I study virtual methods. How would I relate to what you are doing with 'digital methods'?"
The origins of 'virtual methods' may lie in the U.K. virtual society?
research program of the late 1990s (Woolgar, 2002). In particular, the virtual society question mark was emphasized. The research challenged the then dominant division between the real and the virtual realms, empirically demonstrating instead the embeddedness of the Internet in society. The desire to innovate methodologically saw perhaps its greatest challenge in ethnography, with the desire to put forward and defend a new strain of scholarship, 'virtual ethnography', that combined the terrains of 'the ground' with the online (Hine, 2000; see also Slater/Miller, 2000). Special skills, and methods, were developed to gain entry to and study communities now rooted both in the offline and the online. How should the introductory email message be written, and to whom? How should the online survey be designed? Questions revolved around how to adapt standard methods from social science to the online environment.
If one were to contrast the challenges of virtual methods with those of digital methods, one could begin by thinking about the embeddedness of society in the Internet. Thus the important question mark from the earlier research program shifts: virtual? society. The methods and skills developed here strive to put society on display. How can the Internet be made to show what's happening in society?
In this respect, the Dutch newspaper, the NRC Handelsblad recently published an in-house study of home-grown right-wing Websites over the past few years (NRC Handelsblad, 2007). The remarkable line in the article, which seemed unusual for those accustomed to reading at least implicit distinctions between 'the real' and 'the virtual', read: "The Internet reflects the increasing hardening [of the right-wing] in the Netherlands."* Thus here the Web becomes the site to study social trends. The 'digital methods' question becomes, how to collect and analyze the data to distill such trends from the Web?
"Internet lijkt (...) de weerslag van de maatschappelijke verharding in Nederland."
NRC Handelsblad, "Opkomst en ondergang van extreemrechtse sites," 25 August 2007, http://www.nrc.nl/binnenland/article757820.ece/Opkomst_en_ondergang_van_extreemrechtse_sites
. Also here
Digital Methods by Theme
On the Web, sources compete to offer the user information. The results of this competition are seen, for instance, in the 'drama
' surrounding search engine returns. The focus here is on the prominence of particular sources in different spheres (e.g. blogosphere, news sphere, images), according to different devices (e.g. Google, Technorati, del.icio.us). For example, how far are climate change skeptics from the top of the news? For comparison sake, how far are they from the top of search engine returns? The answer to this and similar 'cross-spherical' inquiries goes a way towards answering the question about the quality of old versus new media.
- Rogers, Richard. Information Politics on the Web. Cambridge, MA: MIT Press (2004). pdf
- Van Couvering, Elizabeth. "New Media? The Political Economy of Internet Search Engines". Presented at the Annual Conference of the International Association of Media & Communications Researchers. Porto Alegre, Brazil, July 25-30, 2004. pdf
- Van Couvering, Elizabeth. "Is Relevance Relevant? Market, Science, and War: Discourses of Search Engine Quality". Journal of Computer Mediated Communication Vol. 12, Issue 3 (2007). http://jcmc.indiana.edu/vol12/issue3/vancouvering.html
The Web has an ambivalent relationship with time. At one extreme, time is flattened as older, outdated content stands side by side with the new. This is perhaps most apparent in the way some content, whether an interesting statistic or a humorous website, tends to 'resurface' periodically, experienced as new all over again. At the other extreme, there is a drive to be as up-to-date as possible: as epitomized by blogs and RSS feeds. Content producers and consumers may increasingly be said to have a perceived freshness fetish
. One aim of the research has been to make temporal relationships on the Web visible.
Social networking sites such as Myspace, Facebook and Hyves in the Netherlands have stirred anxiety about the public display of the informal. Researchers of social software have concentrated on what especially a non-member - a mother, a prospective boss or a teacher - can see about a person. Here the research concerns how social software has reacted to public concerns, at once allowing and cleansing. In particular attention is paid to what may still be 'scraped' and analysed after media attention has faded.
"There is growing demand for the ability to determine the geographical locations of individual Internet users, in order to enforce the laws of a particular jurisdiction, target advertising, or ensure that a website pops up in the right language. These two separate challenges have spawned the development of clever tricks to obscure the physical location of data, and to determine the physical location of users—neither of which would be needed if the Internet truly meant the end of the tyranny of geography."
"Internet lijkt (...) de weerslag van de maatschappelijke verharding in Nederland" http://www.nrc.nl/binnenland/article757820.ece/Opkomst_en_ondergang_van_extreemrechtse_sites
Is it big in Japan? Normally such a question would have been asked in reference to fans, based on local sales data. Nowadays the quantity of fans could conceivably be measured differently. For any given song or You_Tube video, one could strive to geo-locate its fan base. The recent project to ascertain favorite brands of the some 4 million Hyves users in the Netherlands revealed the locality of the locale. Japanese brands were hardly present. Dutch and Western brands prevail. See graphics
Referred to as the 'global village' (a term Mcluhan used to describe electronic mass media), the Web has been advertised as a space for globalization. Internet browsers (with telling names such as ‘Internet Explorer’ or ‘Safari’) offer part of the equipment one needs to navigate the unknown. With the revenge of geography, there is a renewed interest for the geo-location of various online objects, such as the user, the network, the issue, the data. The question “Where is the user based?” could be answered by looking at the user’s ip-address, the registration of his/her homepage or the online profile in a social networking site. The question: “Where is the issue based?” requires a different approach. In scraping of various spheres (the news, the web, the blogosphere), one may find out where the issue resonates most. This approach was applied in the Issue Animals projects
, which reveals which animals that are endangered by climate change, are most often referred to (in both text and image) on the Web, in the news and in the blogosphere. Finding out where the network is, calls for co-link analysis of various actors around a certain topic.
In 1965, Ted Nelson proposed a file structure for "the complex, the changing and the indeterminate". The hyperlink was not only an elegant solution to the problem of complex organization, he argued, but would ultimately benefit creativity and promote a deeper understanding of the fluidity of human knowledge. High expectations have always accompanied the link. More recently, with the work of search engines, the link has been revealed as an indicator of reputation: researchers must now account for the reorganization of the link itself as a symbolic act with political and economic consequences.
- boyd, danah. "The biases of links," apophenia. http://www.zephoria.org/thoughts/archives/2005/08/07/the_biases_of_links.html
- Bush, Vannevar. "As we may think," The Atlantic Monthly, Vol. 176, No. 1 (1945): 101-108.
- Elmer, Greg. "Hypertext on the Web: The Beginnings and Ends of Web Path-ology." Space and Culture, 10, 1-14
- Landow, George. Hyper/Text/Theory, Baltimore, MD: John Hopkins University (1994). http://cyberartsweb.org/cpace/ht/jhup/contents.html
- Nelson, Theodor. "Complex information processing: a file structure for the complex, the changing and the indeterminate". ACM/CSC-ER Proceedings of the 1965 20th national conference, New York: ACM Press (1965), 84-100.
- Nelson, Theodor. "Xanalogical structure, needed now more than ever: parallel documents, deep links to content, deep versioning, and deep re-use." ACM Computing Surveys (CSUR). Vol. 31, Issue 4 (December 1999) http://www.cs.brown.edu/memex/ACM_HypertextTestbed/papers/60.html
- Shirky, Clay. "Ontology is Overrated: Categories, Links, and Tags" Clay Shirky's Writings about the Internet. http://www.shirky.com/writings/ontology_overrated.html
An important aspect of researching the natively digital is related to the gathering of data. On the web this data is usually presented by using a web language like HTML. Understanding the underlying structure of the content of the web makes it possible to gather the data for research and analysis. This gathering of data sets from the web is often referred to as web scraping or information retrieval. This Scraping can reveal hidden data (patterns) and/or reattribute it in a way that it can be viewed and used within another context. This is data usually which is hard or even impossible to retrieve outside the web because much of the data on the web is natively digital in that it is specific to the medium. Understanding the medium is an important aspect of what is called digital methods. Skills related to the medium are for instance: understanding the structure of a Uniform Resource Locator (URL
), the underlying object structure of HTML called Document Oriented Model (DOM
) and the distribution and management of the Internet Protocol (IP
) and its relating infrastructure. An important part of digital methods is applying these skills in building tools and scripts to resolve those aspects of the web that are behind what is presented on the screen.
What to scrape
This reliance on specific skills and tools does however have its drawback as mentioned earlier. In scraping the web, specific decisions have to be made concerning what will be scraped, which ultimately affect the outcome and constraints how this data is and can be used in further research. An example of such a decision for instance deals with definitions. To scrape a website or a blog one must first determine what it is. What characterizes a blog or website? Must it have an RSS feed or comments and can a blog also be a website or visa versa? And if so, how could one distinguish between the two? On another level there is the question of the data itself. In relation to researching the natively digital, scraping demographic data might for instance not provide the desired information. What functionality a device has included into its application, how this is presented to the users and how the users actually uses these functionality might say more about the digital then the age of the users or the size of the company. Does locality really provide new insight when both users and online companies are no longer bound to it and does the ip address provide better insight or do we need to scrape and analyse content to see what locality is talked about?
How to scrape
Scraping itself must however also be scrutinized, as the issue is not only what to scrape but also how. Scientists seeking exhaustive data sets are confronted by Web device related issues. For example, engines and other aggregators often offer API
s (Application Programming Interfaces) that promise to serve the data you seek. One issue is that the results returned are limited per day or time frame as well as per quantity. Flickr for instance provides an API to retrieve machine tags (currenly a new development in the tagosphere) form its site but limits the output to only sixteen results. What this shows is that the research question might differ from what is actually offered, thus many projects really require the construction of a personalized scraper which is specifically developed to facilitate the needs for research in question.
Scrape and Scrapability
By building your own scraper in combination with digital skills, a lot of those device related issues can be circumvented. An example of this is the functionality for users of the Hyves social network to choose and create their preferred brand. The brands that are displayed are those brands which have the most users associated with them. By looking at the URL query, the existence of a brand_id is revealed. Adding a new brand and looking at the newly created id gives the complete range of ids in the system.
Using this information it is possible to scrape de site for all brands displaying a huge amount of brands previously hidden or concealed. An example of this projects can be vieuwes here
Turnning back to the example of the Flickr machine tag, one could decide to scrape a certain part of the site by using its DOM structure to get only the data from an exact location in the different pages. There are several options from which one could choose to build a scraper. PHP
are some well known programming languages often used to build scrapers. Their are also applications available on the web which enable non-programmers to be able to build their own scrapers. Because they focus on what they reffer to as the mashup of data, they are often reffered to as MashUp
s. Within DMI this way of scraping is explained in more depth in the projects called WeScrape
Building scrapers thus can make the data much more exhaustive. Not all data is however accessible for scrapers. Google for instance limits its search results and does not supply all the search results it has found. There is thus a max to exhaustiveness on the web.
Another example of limitations is Hyves. As they use an AJAX
call to display all the connection/friends on their website and shift the order in which they are presented, some data will keep getting lost. Running the scraping process at night to some extend solves this problem. This shows that not only the what and the how, but also the when is relevant when scraping.
The web is a dynamic place, in a constant state of flux, and this poses important restrictions on the scrapability of the web. Webmasters constantly update their sites, incorporate new technologies and coding languages and restrict the content by legal use agreements. Scrapers need to be constantly adjusted to both these external factors as critically be aware of its own choices and restrictions such as what will be the input, the output or the query and why.
Although scraping relies much on techniques and skills, by keeping it closely linked to digital methods and theories, the limits and restrictions will continue posing more and different research questions and insights and findings in the natively digital.
The projects related to this digital methods theme are meant to be examples of the possibilities and restriction related to building, maintaining and using scrapers.
- Calishain, tara & Kevin Hemenway. Spidering Hacks. Sebastopol: O'Reilly & Associates (2004).
for an overview of the tools and utilities used in the Digital Methods Initiative.