Homework 1: Data Scraping

Karin Hansen <karinehansen@gmail.com>


Part 1: Regular Expressions

1. a*b*c* - this could also match an empty string
a+b*c*|a*b+c*|a*b*c+ - more complex but would not match empty string

2.
 a. (0|1)*111(0|1)*
 b. (0|1)*110(0|1)*
 c. (0|1)*1101100(0|1)*
 d. (0|10)*1* (or !~/110/ in Perl)

3. ^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$
(note, the ^ and $ are to specify word)


Part 2: Scraping Text Data with Python

I chose to visualize the text of famous speeches by John F. Kennedy. The web page The History Place contains many famous speeches, including several by John F. Kennedy such as his inaugaral speech and his "go to the moon" speech. One advantage of this site is that speeches are formated in a manner that makes them straightforward to extract.

The python script SpeechScraper.py, adapted from LectureScraper.py, first retrieves the URLs of the JFK speeches from the index page. The URLs all start with the string "jfk-" allowing them to be extracted from the list. Then each speech page is read and the speech text extracted from the page and printed out.

To run the program, switch to the DataCollection directory and type python SpeechScraper.ph > output.txt at the command line.

The speech text (output.txt) was then uploaded to the Many Eyes website and used to create the following visualization using the Wordle tool.

Word cloud from JFK speeches

Figure 1 - Word cloud representing the most frequently used words in President John F. Kennedy's most famous speeches.

Figure 2 - Interactive word cloud representing the most frequently used words in President John F. Kennedy's most famous speeches


Part 3: Your First Processing Sketch

Data Source

I chose as a data source the viewer ratings for episodes of the various Star Trek francise TV shows. The source of the data is the TV.com website. I chose this data because it is interesting to see how others rated the various different shows and how consistant the quality of the individual episodes was.

The ratings were scraped using the Python script TVScraper.py. To run the script, change to the data directory and type python TVScraper.py > star-trek.tsv at the command prompt.

I chose to have the program scrape only ratings for the number of episodes that the show with the least number of total episodes had (the Original I think). That way the X axes would be consistent across the various tabs.

Design Decisions

Process Notes

Output

This browser does not have a Java Plug-in.
Get the latest Java Plug-in here.


Figure 3 - Viewer ratings of episodes of the various Star Trek franchise series from tv.com.


Submitted February 18, 2009