Mapping the Local Clickgraph for Fun and Profit

In a pleasant example of graph theory having applicability to the life of a webmaster, we construct the graph of the union of all clickpaths on a given webhost. Then, we note that even for a relatively piddling website we get a graph with 8591 nodes and 20920 edges. We then try and deal with the problem of presenting the data in an understandable fashion, coming to the conclusion that a hybrid layout with one layout style for local nodes and another for remote nodes is probably the best choice.

An Introduction to HTTP Server Logs and the Properties of the Webgraph Hidden Therein

When a web browser submits an HTTP request, it is standard for it to include a referrer field specifying what URL it came from. This field is not included if the user typed in the URL directly, but most HTTP requests (98% in our sample) are the result of following links and include this header. Both the requested and referring URLs are saved in the webservers log. We then mine this log by viewing each URL as a node in a graph, and a request for URL T that includes a referral from URL F as a directed edge from F to T in that graph.

We then take this unweighted multigraph and turn it into a weighted digraph by creating a new graph in which there is an edge from F to T iff in the new graph there was an edge from F to T in the old graph. The weight of this F-T edge is equal to the number of F-T edges in the old graph.

A fat tailed degree distribution
Figure 1: The distribution of degrees. Indegree, outdegree, and their sum. The data at 100 is the number of vertices with a degree of 100 or greater. Click for the full picture.

This new graph has many interesting properties. It seems to be scale-free, and the degree distribution is fat-tailed [1]. Almost every node is part of one huge weakly connected component, and it is very difficult to tell beforehand what nodes will be popular. Looking at the degree distribution the fat-tailed nature becomes clear - especially when it is pointed out that the data at 100 includes all vertices with a degree greater than 100 as well. Furthermore, the fact that the degree and the indegree are extremely close at 1 indicates that there are a lot of things that are only linked to from one location that do not link to anything else - like images. There are also other URLS with an outdegree of 1, but not nearly as many. Furthermore these URLs are mostly offsite URLs and not under the control of the webmaster.

All that statistics gathering is all well and good, and is even kind of interesting! But in order to be useful to a webmaster, we need to deal with the popularity and in-degree of specific URLs. In order to do that, we will need to find a good method of graphing the huge graph.

Graphing This Giant Data Set

In order to figure out what way of viewing this data is the best, we need to establish some goals - some will be quantitative and measurable, while others will be more qualitative and nebulous. We want the resulting picture to have the following properties:

  1. Easy to pick out the most popular links
  2. Easy to figure out which node corresponds to which URL
  3. The heirarchy of URLs is represented somehow
  4. Less popular links don't obscure more popular links
  5. Easy to distinguish between local and remote URLS

Before we break into custom stuff, let's establish that this graph is a little too big for existing tools to render as a static graph. The program dot (part of the Graphviz suite from AT&T) took almost 2 hours and then crashed. The Large Graph Layout engine got tried next. LGL output

Figure 2: The output of Large Graph Layout. Click on the image for the full-size version.
It laid out the data in a lot less time, and the resultant graph had some nice features. The resultant graph clearly showed that there were somepieces of the graph that didn't connect to anything else. The main issue with the LGL output was that the implcit heirarchy of URLs was not visible at all, and that the popular nodes were all but indistinguishable. Further issues included the fact that LGL could only handle undurected graphs. Nonetheless, it seems that it could be a valuable tool for analyzing the sparse little subgraphs that don't connect to the main body.

Various layout strategies

Circle

Figure 3: Circle Layout. Click for the large imagemap.

Spiral

Figure 4: Spiral Layout. Click for the large imagemap.

Concentric Circle by URL depth

Figure 5: Concentric circle/outline layout. Click for a large imagemap.

Implementation Details

We wanted a nice vector drawing system, and we wanted it to be clickable. So we had 4 options for output:

  1. PDF
  2. Flash
  3. SVG
  4. PostScript + client side imagemap
PDF proved to be too tricky to generate on the fly like we wanted, Flash (.SWF) is also tricky, SVG support is not there yet, leaving the PostScript solution. That way we could emit postscript, and not worry about rendering, because Ghostscript has solved that problem for us. Similarly, we wrote the Postscript so that, when rendered, it also printed out the imagemap to stdout. It felt pretty cool, and it allowed us to avoid worrying about keeping the image map and postscript in sync.

Files

If you would like to use this code to graph your own local clickgraph, then you will need to change all references to soy.dyndns.org in the code to the hostname under test. After that, once you have generated the postscript file (via a command line that looks like

( sudo cat /var/log/apache/access.log /var/log/apache/access.log.1; sudo zcat /var/log/apache/access.log.*.gz ) | ./analyze.py outputFileWithoutDotPS treelayout 1
you can generate both the raster image and the corrseponding client side image map via convert file.ps file.png > file.html. The .png will then be the image you want, and the .html will contain your image map. These instructions are admittedly pretty rudimentary - really if you want to use this, I will be more than happy to help you out. Send me email at peter@cs.uoregon.edu or an instant message on AIM to "jongleur peter" and I will be more than happy to help you out.

Conclusion

Even small websites can create complex clickgraphs, and the best way of laying things out generally pays more attention to URLs than it does to their connection. Strong links should be emphasized, possibly even to the point where little-used links get completely ignored. Any algorithm with a complexity greater than O(n lg n) just takes too long on data sets of this size, and the problem would only be compounded on websites with more URLs. Datasets this large must be handled carefully if any meaning at all is to be derived from the picture besides it looking cool.

References and Further Reading

  1. Scale free networks and fat-tailed distributions are a whole research are all their own. I learned what I know from Evolution of Networks by S.N. Dorogovtsev and J.F.F. Mendes, but there are many other good books on the subject. The most readable of which is probably Linked by Albert-Laszlo Barabasi
2 comments
James H 12:28 am 6 April 2004
This is a lame and not very helpful comment, but you should have your files available in a tarball.
~JMH
Peter 12:26 pm 6 April 2004
Tarball is now available.
Your name/email?
Your comment?
To foil some spambots that have decided to post to my website, please enter a monkey's favorite food into this box
Note - there is no HTML allowed in comments. Newlines become <br/> tags.