Sunday 25 April 2010

Using a blog as a Logbook

Last productivity post for a while I promise, then back to proper physics.

I'm trying out using a private blog (secured and unlisted etc) as a logbook. Logbooks are so important, the first time you realise you need one it's too late. My reasoning for going online goes:
  1. I can access from anywhere, including emailing in posts from my phone. For example I could take a photo of a whiteboard discussion and send it to the blog so I won't forget about it.
  2.  It's safely backed up on the servers of whoever's hosting it.
  3. I'm more likely to actually make entries because it's easy.
  4. A blog is much like a logbook anyway so it's naturally suited.
I can see some downsides but they're all pretty minor. Security could be an issue but how sensitive is the information you put in your logbook? Well mine isn't much, I don't want to accidentally broadcast my latest idea but it wouldn't cause a new climategate or anything. Besides, I think it's pretty secure.

I've chosen to use WordPress for my logbook for one simple reason. The LaTeX integration is fantastic. It's so good I'd consider moving this blog if it wasn't such a pain (for the record I otherwise like blogger). In wordpress you do this:

I will now insert an equation here, $latex E=mc^2$, inline with the text.

which would look like
although the superscripts do appear to have messed up the alignment... Otherwise it does a brilliant job at interpreting the tex and inserting the image. If you need a lot of LaTeX then there are programmes that convert between regular .tex files and the wordpress format.

There are similar things available for blogger but I think you lose your source code in a more drastic way. Anyway, I'm going to see how it goes.

Wednesday 21 April 2010

Pipes and Python

I spent ages writing a post about some tricks I use to do quick analysis of data but it got incredibly bloated and started waffling about work flows and so on. Anyway, I woke up from that nightmare so I thought I'd just bash out a couple of my top tips.

This is a pretty nerdy post, you may want to back away slowly.

Pipes
Pipes are, in my opinion, why the command line will reign for many years to come. Using the pipe I can quickly process my data by passing it between different programmes gradually refining it as it goes. Here's an example that makes a histogram (from a Bash terminal):

> cat myfile.data | awk 'NR>100 {print $5}' | histogram | xmgrace -pipe

The first command prints the data file. The | is the pipe, this redirects the output to the next programme, Awk, which here we are simply using to pick out the 5th column for all rows over 100 and print the result. Our pruned data is piped down the line to a programme I made called histogram which does the histogram and outputs the final result to my favourite plotting programme to have a look at it.

So we've used three programmes with a single "one liner" (some of my one-liners become ginormous). Once you start getting the hang of this sort of daisy chaining it can speed things up incredibly. One bit that took me a while the first time was the histogram programme. This took an annoying amount of time to set up because I used C.

This is where Python now comes in.

Python

I won't even try to give a Python tutorial. I'm a decade late to the party and have barely scratched the surface. However, I've found that for relatively little effort you can get access to thousands of functions, libraries and even graphics. Most importantly you can quickly write a programme, pipe in some data, and do sophisticated analysis on it.

With the scipy and numpy libraries I've done root-finding and integration. The pylab module seems to provide many of the functions you'd get in MatLab. Python is a bit of a missing link for me, it's much lighter than huge programmes like Mathematica or MatLab and I just get things done quickly. Here's that histogram programme, Python style.


#! /usr/bin/env python
import sys
import pylab
import numpy

# Check the inputs from the command line
if len(sys.argv)!=3:
   print "Must provide file name and number of bins"
   sys.exit(1)

# Read in the data file
f = open(sys.argv[1],'r')
histo=[]
for line in f.readlines():
   histo.append(map(float, line.split()))

dimension = len(histo[0])

if dimension == 1:
   pylab.hist(histo, bins=int(sys.argv[2]))

   pylab.xlabel("x")
   pylab.ylabel("N(x)")
   pylab.show()

elif dimension == 2:
   # Need to chop up the histo list into two 1D lists
   x=[]
   y=[]
   for val in histo:
      x.append(val[0])
      y.append(val[1])

   # This function is apparently straight out of MatLab
   # I killed most of the options
   pylab.hexbin(x, y, gridsize = int(sys.argv[2]))

   pylab.show()


Which conveniently detects how many dimensions we're histogramming in so you don't need two programmes. This is pretty short for a programme that does what it does.

I hate wasting my time trying to do something that my brain imagined hours ago. I wouldn't say that these techniques are super easy, but once you've learned the tools they are quick to reuse. I'd say they're as important to my work now as knowing C. Got any good tricks? Leave a comment.

Something less nerdy next week I promise.

Wednesday 7 April 2010

Bootstrapping: errors for dummies

The trouble with science is that you need to do things properly. I'm working on a paper at the moment where we measured some phase diagrams. We've known what the results are for ages now, but because we have to do it properly we have to quantify how certain we are. Yes, that's right. ERRORS!

I've come on a long way with statistics, I've learned to love them, I defy anyone to truly love errors. However, I took a step closer this month after discovering bootstrapping. It's a name that has long confused me, I seem to see it everywhere. It comes from the phrase "to pull yourself up by your boot straps". My old friend says it's "a self-sustaining process that proceeds without external help". We'll see why that's relevant in a moment.

Doing errors "properly"
Calculating errors properly is often a daunting task. You can spend thousands on the software, many people make careers out of it. This will often involve creating a statistical model and all sorts of clever stuff. I really don't have much of a clue about this and, to be honest, I just want a reasonable error bar that doesn't undersell, or oversell, my data. Also, in my case, I have to do quite a bit of arithmetic gymnastics to convert my raw data into a final number so knowing where to start with models is beyond me.

Bootstrapping
I think this is best introduced with an example. Suppose we have measured the heights of ten women and we want to make an estimate of the average height of the population. For the sake of argument our numbers are:

135.8 145.0 160.2 160.9 145.6
156.3 170.5 192.7 174.3138.2
in cm

The mean is 157.95cm, the standard deviation is 16.88cm. Suppose we don't have anything except these numbers. We don't necessarily want to assume a particular model (Normal distribution in this case), we just want to do the best with what we have.

The key step with bootstrapping is to make a new "fake" data set by randomly selecting from the original (allowing duplicates). If the measurements are all independent and randomly distributed etc, then the fake data set can be thought of as an alternate version of the data. It is a data set that you could have taken the first time if you'd happened to get a different sample of people. Each fake set is thought equally likely. So let's make a fake set:

156.3192.7160.9135.8135.8
156.3156.3170.5156.3192.7
Mean=161.36cm, standard deviation = 18.5935

As you can see, there's quite a bit of replication of data. For larger sets it doesn't look quite so weird. On average you keep about 60% of the original data and the rest is replicated. Now let's do this again lots and lots of times (say 10000) using different fake data sets each time, generating different means and standard deviations. We can make a histogram


From this distribution we can estimate the error on the mean to whatever confidence interval we like. If it's 67% (+/- sigma) then we can say that the error on the mean is +/-5.2cm. Incidentally that's nearly what we'd get if we'd assumed a normal distribution and done 16.88/sqrt(10). Strangely the mean of the means is not 157.95 as the input data was, but 160.2. This is interesting because I drew the example data from a normal distribution centred at 160cm.

We can also plot the bootstrapped standard deviation.
What's interesting about this is that the average is std=15.2 whereas the actual standard deviation that I used for the data was 19.5. I guess this is an artefact of the tiny data set. That said 19.5 looks within "error".

So, without making any assumptions about the model we've got a way of getting an uncertainty in measurements where all we have is the raw data. This is where the term bootstrap comes in; the error calculation was a completely internal process. If it all seems a bit too good to be true then you're not alone. It took statisticians a while to accept bootstrapping and I'm sure it's not always appropriate. For me it's all I've got and it's relatively easy.

To make these figures I used a python code that you can get hereData here.

Update: It's been pointed out to me that working out the error on the standard deviation is a bit dodgy. I think that the distribution is interesting - "what standard deviations could I have measured in a sample of 10?" - but perhaps one should be a little careful extrapolating to the population values. Like I said, I'm not a statistician!