Python: Aug 24 2012

(probably the last post written last year which I never got around to publishing)

Well there was a flurry of activity in processing, and then grant proposal written and submitted. Before and after the grant proposal, have been learning Python from the Google Python course. This evening I worked on the exercise of taking Social Security’s baby name data (popularity ranking of boy and girl names for the US, for each year http://www.ssa.gov/oact/babynames/) and writing a Python program to use regular expressions magic to convert the html pages into cleaned up simple text files containing only:

2006
Aaliyah 91
Aaron 57
Abagail 895
Abbey 695
Abbie 650

Most of the work is done by the line

     tuples = re.findall(r'<td>(d+)</td><td>(w+)</td><td>(w+)</td>’, text)

thus ignoring all the chaff, zeroing in on the lines of form

    <tr align=”right”><td>1</td><td>Michael</td><td>Jessica</td>

 

and converting them to (1, ‘Michael’, ‘Jessica’) tuples and the like.

Then in the command window one can enter

  python babynames.py –summaryfile baby*.html

to convert many files all at once. The instructor (three cheers for Nick Parlante! ) then suggests seeing patterns over time using shell commands:

  grep ‘Juliet ‘ *.summary

gives an output in the command window like this:

… baby2007.html.summary:Juliet 519
baby2008.html.summary:Juliet 453
baby2009.html.summary:Juliet 318
baby2010.html.summary:Juliet 285
baby2011.html.summary:Juliet 252

showing the evolution of the rank of Juliet as a baby name per year. I was inspired to download the baby name rank files for the top 1000 names for the last 100 years (saving each year’s file one by one by hand as I haven’t learned how to automate such a process yet; it took 20 minutes), and then write another python program that would, for a given name, give a text file with the summary


2007 519
2008 453 etc.

I then turned to Processing since I do not yet know a quick way to deal with graphics in Python, and similarly used Photoshop as the quickest way to add labels. I found to my surprise that Paula, which I thought was an uncommon first name, actually was at rank 38-100 in the years 1943-1974. It has declined greatly since then though. Juliet (my mother’s name), which I hadn’t realized was a rare name, was much more uncommon than Paula through most of the 1900’s, but pulled ahead in this century. Ahmed (my husband’s name) first appeared as a top 1000 name in the US in 1974, and also pulled ahead of Paula around 2000. Finally, for some time around 1980, Misty (our cat’s name) was the most popular of all. Unfortunately, neither Paolo (my father’s name) nor Tortoiseshell (their cat’s name) were ever in the top 1000 in the US. Here is the graph:

namegraph

Another thing that would be interesting to do is to integrate this information over time, to see, over a century, the popularity of names. One couldn’t do it with the data I have used so far, since how do you aggregate when you know rank, not number of ppl with each name. However, I have revisited the site, and found that in addition to the top 1000 ranking data, they also release the yearly data (in one nice zip file with all the little files from 1880 thru 2011 neatly packaged together, with far cleaner formatting… oh well, nice exercise Nick, but this is much nicer data) of NUMBER of people for each name, for all names with at least 5 (for privacy reasons; and apparently this accounts for about 70% of the population) ppl in that year. So in that case Paolo’s appear (85 in 2000, for example) tho still no Tortoiseshell’s…