Wiki Analytics Workshop

This workshop report is a stub, please help improve it if you participated in the workshop.

Introduction by Esther Weltevrede

The Digital Methods Initiative (DMI) is a collaboration between Mediastudies at the University of Amsterdam, Mediastudies and GovCom, an organization to develop political tools for the web. DMI aims to develop novel tools en methods to do research with the web. A large number of tools are build on top of devices, such as Wikipedia. Erik Borra will present the DMI Wikipedia tools later today. The core research group consist of about 9 people, 4 of which are here today.

Why this day? The workshop is in anticipation of the CPOV Amsterdam conference, in particular the Wiki Analytics session on Saturday March 27. We thought to seize the opportunity with all these Wikipedia researchers in town to send out an invitation to present and discuss methods, tools and data among fellow Wikipedia researchers. We will specifically focus on analytics, as opposed to other academic approaches to Wikipedia. We are very happy to have six presentations dealing with this topic here today.

Wikimedia in figures - Erik Zachte (Wikimedia)

What can we use Wikistats for? Data is publiished in exportable, computable format, eg csv. Stats look at all wikis hosted by the foundation, largest are the Wikipedias. 754 Mediawikis. Also graphical data, analysis. Also contributed. One stop place to see metrics.

Where does the data come from?
  1. Webalizer, opensource tool. But in 2004 it burned out due to high traffic. Too much traffic to process.
  2. The dark days: 2004-2007 during which Wikimedia used Alexa. "Our daily dose of Alexa" Wikimedia addicted to stats (like all of us). But only ranking was unsatisfactory.
  3. In 2008 ComScore donated free access. (Can we access CS data? Are you making it public?) The data is used in the reports Wikimedia publishes themselves and in their report cards.
  4. In 2008 built their own software to track stats from traffic meta data that come from their squids. Squid logs. Log data is reprocessed by the foundation and volunteers, for example: We can see a spike in data about Sarah Palin, Michael Jackson - Wikipedia as a tool for Realtime Web? Also an interesting finding is that the Russian Wikipedia is a fast grower.

First statistics focussed on content. This was very useful in the beginning, but there was a corruption factor, Wikipedians competing with each other to achieve a higher ranking in the stats. Now a focus on the community - who are the readers and where are they from?

Q & A
Q: History of EN Wikipedia not accessible from Dec.
A: It took several months to make a backup: including every revision. The dump, in compressed format (280 Gig) is now available:

Q/A: Maturity of wikipedia defined by the number of bytes, a basic metric.

Q: Different uses practices: what is an edit?
A: Seven different roles of editors. New report. Two weeks ago.

Q: Quality debate. How do you determine quality?
A: System: manually rating the articles. In EN 1/2 million articles have been rated.

Victor Grishchenko - Accretion and page growth

To check generation of pages, to see how they evolve. It follows log normal distribution, they form a perfect power law. Majority of pages have just one or two outgoing links. As it matures, it gathers in/out links (within Wikipedia) How do articles evolve over time? In 2001 indegree larger than outdegree? It's an logarithmic scale so linear intuition doesn't work here.

Related article: Assessing the Value of Cooperation in Wikipedia by Dennis M. Wilkinson, Bernardo A. Huberman.

Current research: revision control (to make contribution exchange a simple process among peer wikipedias). Exchanging content within Wikipedias. Port advanced source code revision control methods to Wikis.

Prof. Tanenbaum - working on a distributed Wiki.

Mayo Fuster - Research Digital Commons Governance. Methodological design and lessons learned

Mayo Fuster Morell

I will present the methodological design and lessons from my Ph.D research on "Governance online creation communities for the building of digital commons" which I am finishing at the European University Institute.

My methodological reflections regard to case comparison; that is research based on comparing Wikipedia to other cases.

My unit of analysis are online creation communities (OCCs); which I define as "a form of collective action performed by individuals that communicate, interact and cooperate; in several forms and degrees of participation;

mainly via a platform of participation in the Internet; with the common goal of knowledge-making and sharing; which result in a digital common, that is, an integrated resource of information and knowledge (partly or totally) of collective property and freely accessible to third parts".

Other concepts to define this type of collective action are Open collaboration, common-base peer production.

Wikipedia is one of the examples. But also others comunities around the building of divers information pools such as s oftware package (I.e: Debian, Plone, Drupal and Facebook Development Team); Guides or Manuals like (I.e. Wikihow or Wikitravels); or Multimedia archives (I.e. video YouTube or articles (libraries) Plos).

From the OCC I analyse the governance: However, while the literature on the analysis of the governance of OC mainly focus on the intercatio among the participants; I decided to take also into consideration the role of infraestructure provider. Example, Wikimedia Foundation is the provider of Wikipedia or Yahoo the provider of Flickr.

More in concrete I look to answer the question: How the type of provider related to the community generated in terms of commuinity size, type of collaboration and self-governance?.

In terms of methodology:

Firtly, the empirical research was based on a multi-method approach.I combined a large-N stadistical analysis of 50 cases with an ind-deep case study comparison of four cases.

The combination of these two methodologies was very useful in term of questioning and reinforcing the results of one method with the results of the other method.


For the large-N analysis, I adapted a political science research trend called web analysis of democratic quality of political actors' websites .

Steps for the large - N:

1) Design of a sample of 50 cases

2) Elaborate a codebook.

3) Data collection

4) Calcule descriptive stadistics and correlations between the variables.

1) Sampling:

I developed a snowball search , specifically by exhausting the search through these means:

i) Search in documentation and literature;

ii) Follow the hyper-links between the websites;

iii) Use general search engines (i.e. Google).

After a balanced sample of 50 cases was built, I designed a codebook.

2) Codebook

The codebook consisted of a set of 100 indicators related to the questions/variables I wanted to analyses.

For example, in term of self-governance I looked if the policies are defined by the community or not.

Or in regard of type of provider I looked to the type of legal entity associated to the community, among others.

3) Data collections

I fulfilled the codebook for each case visiting and observing the website of the OCC.

The estimated time was 40 minutes to one hour.

Some main problems I found in the data collection derived from the plurality of the OCCs.

The same indicators were not valid for all the cases, so at some point of the coding process, I had to review the indicators and define the indications “conceptually”, not in specific forms.

The indicators of the participation mechanism were particularly problematic because they vary greatly depending on each OCC and particularly depending on the type of solfware used.

With a sample more homogeneous in terms of using the same technological platform (such as comparing between wikis), the data collection would be easier.

Some eemarks:

I sent an e-mail informing that I was doing the research, but I collected data that was generated in the datly life of the OC, without requering any intervention from the participants. That is, using digital threads.

Using digital threads, the data collection can be developed in two ways: through “human” identification or through a program. Such as the work of Viegas for Wikipedia.

Human identification is when a person checks if an indicator is present or not in the website; program identification is a program that is designed to automatically.

Initially I planned to build a program for the data collection and analysis of the indicators, which it would serveto significantly reduce the time-consuming activity of web analysis. Furthermore, it would facilitate the building of a tool for the actors themselves analyze their websites.

However, programming is costly and I could not develop the program for lack of funding in my Ph.D program to cover the technical programming costs.

Furthermore, it required the creation of groups with coverage of a plurality of skills and resources. In the frame of a Phd research this requirements are not facilitated. In order to make profit of this frontier, it is in the benefit of research center to build alliances and create the conditions for the technological support of the research.

4) Statistical analysis

SPSS. I wanted to use R program, because I priorise free solfware in my research, however I couldn't find none in my University that could introduce me to the program.

I looked to descriptive stadistics (such as frecuency or percentage of use of copyleft licences or frequency of type of legal entities asocieted to the community). Then I also look to correlation between variables. Such as, are the bigger communities the ones hosted that commercial providers? Or, do non-profit providers generate larger collaboration between the participants?

Some initial considerations on the large-N:

* Large-N was adequate due to the novelty of the OC phenomenon. It helped me to more preciecly conceptualise and describe the OCCs.

* Apart of the data collection for the stadistical analysis of correlations between my variables, the exercice was very useful in terms of "online ethnography", that is to increase understanding by observing the OC. A “field notes” was kept during the data collection.

* Large-N analysis was useful in terms of the need to go beyond in the literature to only case studies and consider not only experience of success, but also of failer.

Finally, the large - N helped me to identify cases and hypothesis for the case studies.

Case studies comparison

The case study of OCCs are used in order to extract a more in-depth understanding.

Steps of case studies:

1) Selection of case studies:

From the large-N emerged four main models of provision or infraestructure governance an so I choose one case for each model for the in-deep case study comparison.

My case were:

+ Flickr provided by a big coorporation, Yahoo;

+ Wikihow provided by an enterprise;

+ Wikipedia provided by a non-profit foundation

+ Social forums memory project provided by an open assembly composed by a self-selected group of participants.

2) Case study methods

I combined several methods on the case studies.

Remark: I did not follow the exact same plan for each case. For example, before starting the research I was already familiar with the Social Forum case study, but not with the other cases. In this regard, I developed fewer interviews for the Social Forum.

The methods used were:

  1. * Virtual ethnography of the online platforms

  2. * Digital threads analysis of participation data: O nly for the Social forums case; for the other cases, such as Wikipedia, I used data on participation already availeble.

  3. * Observation of participation in physical encounters and headquarters

  4. * Review documentation of the cases

  5. * (Structured and unstructured) i nterviews to participants and consultation to experts >>> In total, I conduced 80 interviews.

    + To secure interviews with OCC participants, the more effective procedure was on the one hand to go to face to face meeting and on the other hand, to ask the people I interviewed to put me in contact with other people I wanted to interview. // To me the major response of the informants in physical encounters is mainly related to gaining trust and attracting the attention of the informants. With other forms of gaining trust with the informants and attracting their attention, the developing of the case study only using online methods might also work.

    For the Wikimedia and Flickr data collection, I did a fieldwork internship in the San Francisco Bay Area and a trip to the east cost. In terms of collection of interviews was also importantly developed at Wikimania or meet-up of the communities.

Another tip concerning the interviews: do not start the interviews with the people more difficult to get; as you also madurate the interview as far as you develop more and more interviews.

+ During the interviews a visualization technique was used based on asking the person to “draft” the relationship between the providers and the community according to how he/she conceive it and the asking him/her to comment different drafts representing the mentioned relationship.

+ Finally, the transcripts of the interviews were time-consuming but were also essential. The level of understanding grows exponentially with the transcription.

  1. * Organization of group discussions with participants and specialist

+ As part of the research, I contributed to the building of a collaborative space, the project Networked Politics, on the research of a large area of topics (new forms of political organising), but which is related to my research question.

+ This collaboration has been of great value for the research development in terms of providing feedback on the emerging research and getting to know relevant literature.

+ Furthermore, with the support of Networked Politics, I organized collective discussions (seminars) with participants and informants of my case studies and with experts in the area.

* To design and guide these group discussions, a methodology of focus groups was adapted.

* I consider facilitate reflexivity among actors and contributing to building relationship among them a resulting impact of the research. It was also useful in this regard.

Main problem of case studies comparison:

* Case comparison was not a problem for the case studies, as I commented previosly on the limitations of equal indicators on the large-N.

* The process of data collection has been characterized, more than for a “lack”of data, for an overloading of data available.

Q: What type of software use different types of organisations/platform providers.

Johanna Niesyto - Experiences with tools across the EN and DE language versions

The objective of my research is the exploration of the interrelations between knowledge production and the political in the context of the appropriation of the social web by peer-to-peer-networks. Since Wikipedia has become ‘mainstream’ and is one of the major actors of knowledge production – at least in German and English speaking web - the English and German language versions of Wikipedia (the two biggest versions are taken as example to explore both the politics of knowledge production and political knowledge production.

Politics of knowledge production_: What are the rules and power games of knowledge production on Wikipedia? What norms and values are agreed upon and what is contested across and within the chosen language versions? Do differences in terms of policies and politics exist with regard to the language versions or are differences rather to be found due to the type of conflict/controversy?

Political knowledge production_: Which kind of political knowledge is produced in the English and German language versions of Wikipedia? How and in which ways is knowledge produced on the platform? What commonalities and differences can be found by comparing the two language versions? What is deemed to be ‘official’/marginalized political knowledge on Wikipedia across the chosen language versions?

To approach these questions, mainly qualitative research methoda are used. In order to select the article sequels to be analyszed, quantitative tools are uses. My research is currently at the stage of data collection.

Four levels of analysis on en and de

  1. Institutional framework
  2. Fundamental principles (Five pillars & NPoV)
  3. Collective epistemic practices
  4. Outcome of epistemic practices (Genticallymodified food & Muhammad)

Tools/methods/data not used:
  • History flow (IBM) difficult to use as a 'regular' end-user. Does not deal well with umlauts in German. Takes a long time to render.
  • Don't want to use database dumps, want to use with recent revisions which means manual analysis:

Tools/methods/data used:
  • How "important" are the chosen articles (Genetically modified food & Muhammed). Determined by stats via Q: Why is importance of an article only determined by pageviews? This assumption comes from political science where public display is important. From a political knowledge production point of view it is still important.

  • Article/Talkpage history is shown by downloading article's article history and then use Editeur to markup data > Visualize with Pivot tables in Excel.

  • Find participants for interviews by looking at:
    Who edited the article?
    Amount of edits in Wikipedia:
    Q: What is more interesting? More edits on the talk page or peaks on the edit page. Tendency to do more on the talk page than the edit page.

Tools/methods that might be of help:
  • WikiDashboard: Providing social transparency to Wikipedia
  • Google search/google scraper to look for the distribution of certain sources, e.g. greenpeace
  • Looking for the history of an article, integrating Template:Merge and Template:Split in the analysis

DMI's Wikipedia tools - Erik Borra

Topic revision: r10 - 27 Mar 2010, WikiGuest
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback