data and visualization

Prof. Kapfhammer received a grant from the Awesome Foundation, and he is running a contest regarding the visualization of data titled a Forty Day Visual Feast

At the least, this lab will explore the use of regular expressions to process webserver and/or Sparkle log data. At the most, you will explore the use of Python to process and visualize data, and perhaps get a head start on producing something visual, creative, and great awesome for the Feast.

[ Update 20100201 ] This was a lot for one week, in an exploration of Python. Learning takes time. For this time around, what was an A is now an A+, the B becomes an A, and the C a B. New due-date is Monday. Note, also: there is now a reduced-length Apache logfile that you can use for testing with. 

where all the data comes from: logfiles

As one of our member pointed out, the challenge with criterion-referenced labs is that you really need things to scale nicely. It is possible to receive an A on this laboratory by working with one of the most common sources on the planet that systems administrators must always wrestle with: log file data.

If you want to tackle something straight-forward, you might start here. Mostly, it just involves a little bit less creativity. Pick one of two log files to get started.

apache logs

Apache is a very popular open source webserver. This webpage is served by Apache. To get started with this lab, you can download the logfile for this particular domain, rockalypse.org. If you look, you'll find web hits from Allegheny College.

Here are the access logs from the domain rockalypse.org. This is a 60MB file. There is a shorter 1000 line file that may make life easier while you are developing your code.

sparkle logs

Another data source you can use is data from Sparkle. Sparkle is a library that Mac developers can freely integrate into their applications. Once included, it handles the update process in an invisible and robust way. We use this in our programming environment for the Arduino so we can get a sense for who is using our software, where, and when. (Well, that's a side-effect: we use it because it makes it easy for us to provide updates to our users.)

Here are the (limited) logs from our Sparkle framework. There are both Apache-style logs and CSV logs; I'd rather you use the Apache-style.

just the text

An average submission for this week's homework takes your logfile, uses regular expressions to pull it apart, and prints out some summary data to the console. 

For example, you might report how many times the domain rockalypse.org was hit by computers from Allegheny College. (How would you know they were being viewed from Allegheny?) You might report how many different pages were viewed. You might count how many times Google indexed the site. 

The list goes on and on. There are straight-forward statistics you can report about an Apache log file (or the Sparkle log file) that you can report using plain text.

An average submission for this week's work includes:

  • a Python script that can be executed from the command line
  • the script should take the name of the logfile as an argument (ie, you should be able to invoke your script as follows):

    > python myscript.py thelogfile.txt

  • Your functions should all include a comment of the form:

    # addTwoNumbers : number number -> number

    which tells me that the function addTwoNumbers takes two variables (both numbers) and it returns a number.
  • A short report (OpenOffice) that describes what your script does and the process you went through in developing it. Highlight the regular expressions you used to extract information from the logfiles as well as any challenges you faced and how you overcame those challenges.

a bit of visualization

To go beyond work that is average, I'd like to see you do a bit of visualization.

For example, you might count how many times each web page in the site is visited. Then, print a histogram to the console that depicts this information. It might look like this:

to_do/   xxxxxxxxxxxx

/        xxxxxxx

blog/    xxx

or similar. Put simply, instead of reporting things numerically, give us a simple visual representation using an additional loop or two. This makes it easier for your user to see what is going on, instead of having to read through a bunch of numbers and make sense of it in their head.

going further

For a strong submission, I'd like to see you actually produce graphical output from your explorations. You can do this in one of two ways.

You could print out a file that gets processed by another program. For example, your Python program could actually print a file that is intended to be consumed by GNUPlot, R, or GraphViz. After your Python program is done, you would then run one of these other programs on that output, and see the pretty picture that results.

Another way would be to use a Python library for visualization. RPy is one example. (I will see about having this installed on our machines. Because David is awesome, he's adding it this afternoon.)

You will have to do some research on your own in either event. I will provide some hints to get you started along these lines, but it isn't terribly difficult: you must read documentation, patiently explore, and ask for help when you need it.

achieving awesomeness

If you want to use this lab as an opportunity to take part in the Forty Day Visual Feast, you may want to do the following:


  1. Find some interesting data. You could use the logs I provided, as they're very common. Or, you can find something more interesting to you. Infochimps has a ton of really cool, freely available datasets you can download. Additional resources can be found via the ReadWriteWeb.

  2. Invent an interesting visualization. You may want to do some Googling around the terms "data visualization" and the like. Edward Tufte's work is considered authoritative in this space. The best way to be inspired in art is to view lots of art, so don't be afraid to look and explore.

  3. Find the right tools. VPython might work for you, if you want to do 3D. The Python Imaging Library might make things easier. Python(x,y) has a lot of resources. Sage, also, combines a lot of horsepower surrounding math and visualization in a tasty, open, Python-friendly package.

If you want to dive in and start working on a visualization for the Feast, talk to me first. We can negotiate deadlines or expectations regarding this homework if you're targeting the development of a visualization for the Feast. 

resources

what's the point?

The point of this lab is to continue exploring Python in an authentic context. Specifically, we're using regular expressions to pull apart textual data, and then we are representing that in some way that converts a ton of text into something digestible. 

This is a very common task in computing: the processing, conversion, and representation of information. Use this homework as an opportunity to explore these ideas. 

PS. Do not click here.

Creative Commons License This work is licensed under a CC BY-SA 3.0 License.