Homework 1: Data Scraping
Karin Hansen <karinehansen@gmail.com>
Part 1: Regular Expressions
1. a*b*c* - this could also match an empty string
a+b*c*|a*b+c*|a*b*c+ - more complex but would not match empty string
2.
a. (0|1)*111(0|1)*
b. (0|1)*110(0|1)*
c. (0|1)*1101100(0|1)*
d. (0|10)*1* (or !~/110/ in Perl)
3. ^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$
(note, the ^ and $ are to specify word)
Part 2: Scraping Text Data with Python
I chose to visualize the text of famous speeches by John F. Kennedy. The web page The History Place contains many famous speeches, including several by John F. Kennedy such as his inaugaral speech and his "go to the moon" speech. One advantage of this site is that speeches are formated in a manner that makes them straightforward to extract.
The python script SpeechScraper.py, adapted from LectureScraper.py, first retrieves the URLs of the JFK speeches from the index page. The URLs all start with the string "jfk-" allowing them to be extracted from the list. Then each speech page is read and the speech text extracted from the page and printed out.
To run the program, switch to the DataCollection directory and type python SpeechScraper.ph > output.txt at the command line.
The speech text (output.txt) was then uploaded to the Many Eyes website and used to create the following visualization using the Wordle tool.

Figure 1 - Word cloud representing the most frequently used words in President John F. Kennedy's most famous speeches.
Figure 2 - Interactive word cloud representing the most frequently used words in President John F. Kennedy's most famous speeches
Part 3: Your First Processing Sketch
Data Source
I chose as a data source the viewer ratings for episodes of the various Star Trek francise TV shows. The source of the data is the TV.com website. I chose this data because it is interesting to see how others rated the various different shows and how consistant the quality of the individual episodes was.
The ratings were scraped using the Python script TVScraper.py. To run the script, change to the data directory and type python TVScraper.py > star-trek.tsv at the command prompt.
I chose to have the program scrape only ratings for the number of episodes that the show with the least number of total episodes had (the Original I think). That way the X axes would be consistent across the various tabs.
Design Decisions
- I changed the colors because grey is boring :-).
- I showed more than three sets of data.
- I changed the axis labels.
- I added space between the tabs so they stand out more.
- I changed the x and y axis label intervals to values appropriate for this data set.
- I added the line to the scatter points (easier to view IMHO).
- I added a function to FloatTable called getColumnAvg(int) that calculates the average rating for a given data column. I then draw a line on the graph corresponding to the average and print the average value for the selected column to the right of the graph as a label.
Process Notes
- Scraping data - Finding data to scrape was the most complicated step. Modifying the Python script from lecture was straightforward. I had to stop myself from spending too much time perfecting the script to deal with all possible error conditions since this script only really needs to run once.
- Processing - Since I have programmed a lot in Java, understanding and modifying the Processing code was also fairly straightforward. I find the IDE for Processing not as useful as Eclipse, though.
Output
Figure 3 - Viewer ratings of episodes of the various Star Trek franchise series from tv.com.
Submitted February 18, 2009