In a pleasant example of graph theory having applicability to the life of a webmaster, we construct the graph of the union of all clickpaths on a given webhost. Then, we note that even for a relatively piddling website we get a graph with 8591 nodes and 20920 edges. We then try and deal with the problem of presenting the data in an understandable fashion, coming to the conclusion that a hybrid layout with one layout style for local nodes and another for remote nodes is probably the best choice.
When a web browser submits an HTTP request, it is standard for it to include
a referrer
field specifying what URL it came from. This field is not
included if the user typed in the URL directly, but most HTTP requests (98% in
our sample) are the result of following links and include this header. Both
the requested and referring URLs are saved in the webservers log. We then mine
this log by viewing each URL as a node in a graph, and a request for URL T that includes a referral from URL F as a directed edge from F to
T in that graph.
We then take this unweighted multigraph and turn it into a weighted digraph by creating a new graph in which there is an edge from F to T iff in the new graph there was an edge from F to T in the old graph. The weight of this F-T edge is equal to the number of F-T edges in the old graph.
This new graph has many interesting properties. It seems to be scale-free, and the degree distribution is fat-tailed [1]. Almost every node is part of one huge weakly connected component, and it is very difficult to tell beforehand what nodes will be popular. Looking at the degree distribution the fat-tailed nature becomes clear - especially when it is pointed out that the data at 100 includes all vertices with a degree greater than 100 as well. Furthermore, the fact that the degree and the indegree are extremely close at 1 indicates that there are a lot of things that are only linked to from one location that do not link to anything else - like images. There are also other URLS with an outdegree of 1, but not nearly as many. Furthermore these URLs are mostly offsite URLs and not under the control of the webmaster.
All that statistics gathering is all well and good, and is even kind of interesting! But in order to be useful to a webmaster, we need to deal with the popularity and in-degree of specific URLs. In order to do that, we will need to find a good method of graphing the huge graph.
In order to figure out what way of viewing this data is the best, we need to establish some goals - some will be quantitative and measurable, while others will be more qualitative and nebulous. We want the resulting picture to have the following properties:
Before we break into custom stuff, let's establish that this graph is a
little too big for existing tools to render as a static graph. The program
dot (part of the Graphviz suite from
AT&T) took almost 2 hours and then crashed.
The Large Graph Layout engine
got tried next.
We wanted a nice vector drawing system, and we wanted it to be clickable. So we had 4 options for output:
analyze.pycircle.pygraphLogs.shproduce_graphviz.pyproduce_lgl.pypsMaker.pyspiral.pyspringBob.pytreelayout.pywebmap.tar.gzIf you would like to use this code to graph your own local clickgraph, then
you will need to change all references to soy.dyndns.org in the
code to the hostname under test. After that, once you have generated the
postscript file (via a command line that looks like
( sudo cat
/var/log/apache/access.log /var/log/apache/access.log.1; sudo zcat
/var/log/apache/access.log.*.gz ) | ./analyze.py outputFileWithoutDotPS
treelayout 1 you can generate both the raster image and the
corrseponding client side image map via convert file.ps file.png >
file.html. The .png will then be the image you want, and the .html will
contain your image map. These instructions are admittedly pretty rudimentary -
really if you want to use this, I will be more than happy to help you out.
Send me email at peter@cs.uoregon.edu or an instant message on AIM to "jongleur
peter" and I will be more than happy to help you out.
Even small websites can create complex clickgraphs, and the best way of laying things out generally pays more attention to URLs than it does to their connection. Strong links should be emphasized, possibly even to the point where little-used links get completely ignored. Any algorithm with a complexity greater than O(n lg n) just takes too long on data sets of this size, and the problem would only be compounded on websites with more URLs. Datasets this large must be handled carefully if any meaning at all is to be derived from the picture besides it looking cool.
~JMH