Homework 1: Data Scraping

Due date: Thursday, February 11th at 11:59pm EST



The goal of this homework is to learn how to scrape data from the web using Python (or the tool of your choice) and to start getting comfortable with Processing. You will practice writing Python scripts, and then write a small program to scrape information from a website of your choice. Next, you will visualize your data using the online system ManyEyes. Finally, you will start to dig into Processing, and modify a Processing sketch to visualize numerical data you scrape from the web. You can find submission instructions here

 


Grading

The homework will be graded according to the guidelines from the syllabus:


5 = Exceptional / above and beyond (we will only give out maybe 5-10 of these for each homework)
4 = Very Solid / no mistakes (or really minor)
3 = Good / some mistakes
2 = Fair / some major conceptual errors
1 = Poor / did not finish?
0 = Did not participate / did not hand in

A 4 constitutes a perfect grade, and getting all 4s is equivalent to an A. A combination of 4s and 3s end up being A- to B, and so on. TFs will evaluate your work holistically beyond mechanical correctness and focus on the overall quality of the work. In addition to the scores the TFs will give detailed written feedback.

 


Part 1: Regular Expressions (30%)

To warm up your regex-chops up, answer the following questions (you do not need to include answers in your write-up). You can check your answers against a regular expression checker online. On this page you can also find a nice reference of operators by clicking the button “Regular Expression Exemples” (spelling error is not ours!).

1) Which of the following matches the regexp a(ac)*[d-z]*d

a. abcd
b. aac
c. ad
d. aacacacd

2) Which of the following matches the regex c(aa)+d

a. caad
b. cd
c. caaad
d. aaaad

3) Which of the following matches the regex b(\w)*st

a. beast
b. b1324@st
c. b99999st
d. boast
e. bst

4) Which of the following matches the regex [^b]at

a. at
b. cat
c. chat
d. bat

5) Which of the following matches the regex <[^>]+>[^<]*</[^>]+>

a. <b>is this a tag?</b>
b. <apple<cake>>
c. <apple>cake<pie>
d. <xml>this is xml

6) Which of the following matches the regex 123|abc

a. 123|abc
b. 12abc
c. 123bc
d. 123

7) Which of the following matches the regex ([a-z]+[!?.]\s){2,}

a. hey.
b. boom. boom. pow!
c. hello.goodbye!
d. !!!

Now that you are a pro at regular expressions, answer the following questions in your write-up.

1) Write a regular expression to match a US phone number of the format XXX-XXX-XXXX where X is a digit between 0-9 and the first digit cannot be a 0.

2) Write a regular expression for each of the following sets of binary strings (ie. string w/ only 0's and 1's)

a) includes the substring 101
b) doesn't contain three consecutive 0's
c) alternates 0 and 1, (i.e. 010101, 1010, 10, 01, 1, 0 are all valid)

3) Write a regular expression to match words that are palindromes of length 5 (ie, radar, civic, kayak, level)

4) Write a regular expression to match words which start and end with a vowel, but have no vowels in between (ie. apple, asthma, urge)



Part 2: Scraping Text Data with Python (50%)


Get Started

For the following exercise, you can feel free to use any tool you'd like to gather data in an automated fashion, but the course staff will only support Python. If you have Unix or Mac OSX, Python should already be installed (try typing "python" into your terminal application). If you have Windows, you may need to get it here (you probably want Python 2.6.4 Windows Installer).  You can always ssh onto nice.harvard.edu and work there. 

Download the following scraper.zip file from Samir's lecture on Python. This has the BeautifulSoup.py library and a handy util.py library. It also has the code written during lecture, which is a good place to get started (to remind yourself of how scraping works).


Scrape Text

Pick at least 3 websites from which you want to collect data. Using the code in the scraper.zip file you downloaded, modify or write your own script to scrape some interesting text from the websites you chose, and output the text to output.txt. Note: from the command line, you can use the ">" command to pipe data printed out to the terminal to a data file.

Here are a few ideas of places you could find your text. Feel free to do something else, though, if you have other interesting ideas.

  • What words does your favorite musical artist use most often? Use azlyrics or another lyrics site to find a list of their lyrics and scrape it!
  • What words appear most often in Shakespeare's plays? Use gutenberg to find out.
  • Beatle's lyrics can be scraped from here.
  • What trends word-usage trends do your favorite bloggers follow? Scrape their blogs!

This task is meant to be fairly open-ended --- you may use any text you'd like to make your output.txt file. You will turn in the following for this part:

  • an output.txt file,
  • a folder called DataCollection with the code that does the data scraping,
  • in your write-up, a description of how to run your code --- it's fine if this is something simple like "run python myscript.py at the command line".

There are a few rules you must follow:

  • No copy-pasting text into your output.txt file. This must be done programmatically.
  • You must gather text from at least 3 different URLs. These can be multiple pages of data from a single website or they can be from several different websites altogether.  More is encouraged, but let this be a minimum.  The point is to equip you to scrape any website for data.
  • You don't need to do this all in one script (although it is encouraged for simplicity). However, you must explain how exactly you generated output.txt in your write-up especially if there are several steps.
WRITEUP: Tell us what you liked/hated/found easy/found difficult about mining data.  Outline a question that you hope to answer through this visualization. 

Visualize

You are going to visualize you text data using the ManyEyes site. Upload your output.txt file to ManyEyes, and create a Tag Cloud visualization of your data. Include the visualization in your write-up (i.e. click on "share this visualization", and copy-and-paste the link into your write-up). Note: There is a 5MB data limit for uploading to ManyEyes. If you go over that limit, you are allowed to manually delete text from your output.txt file so that you can make your Tag Cloud.

WRITEUP: Briefly explain in your writeup what you like and dislike about the visualization you've produced through ManyEyes.  Does it answer the question you intended to answer about your data?  Embed your visualization in the writeup.  

 


Part 3: Your First Processing Sketch (20%)

The goal of this section of the assignment is to get your feet wet with Processing, particularly for those with less programming in their backgrounds.


Reading

To begin this part of the homework, read Chapter 4 of the Fry book after you download the timeseries sketch example from the chapter. In this downloaded file you will find the top level sketch file timeseries_sketch.pde, two additional classes FloatTable.pde and Integrator.pde, and the data data/milk-tea-coffee.tsv. Each of the visualization and interaction techniques discussed in Chapter 4 are included somewhere in the sketch, so make sure to look all the way through the code. Read carefully through this chapter, and follow along in the example code - make sure you explore the different methods as this sketch will be the framework you use to visualize data that you will scrape from the web.  We've commented this sketch exhaustively for students who have less experience programming. 


Poking around in Processing

Your tasks (include a screenshot of each onein your writeup):


  1. Change the color of the dots to make them red.  You'll do this by changing the hex color code inside a call to stroke() in the draw() function.  See a hex color code chart here
  2. Get rid of the scattered dots and change them to a line graph with no dots.  You'll do this by commenting out the code that draws scattered data points and uncommenting the code that draws a line. 
  3. Change the plot to a bar chart.

If you are a less experienced (or totally inexperienced programmer), you should begin working through the first four lessons of the orange Shiffman text. Alternatively, you might consider working through at least the first 4 or 5 tutorials here

WRITEUP: In your writeup, be sure to include the screenshots noted above.  Also include a few comments about what you liked/hated/found hard/found easy about this first exposure to Processing.

 


Extra Credit: Change it up


Add New Numerical Data

Think about an area that interests you, and find a source of numerical data online that you can scrape -- sports stats, the price of shoes on Zappos, the course numbers of the classes offered in different departments at Harvard, etc. This data should contain three different sets of numbers with at least 50 numbers each. Like the milk-tea-coffee example, you'll need an additional column of numbers that will serve as your x-axis (such as year). If your data does not contain this information, you can just number the entries (1 -> number_of_entries). Note that all three of your numerical sets should contain the same number of items. Write a Python script to scrape this data, and place your script and data file in the timeseries_sketch/data directory. Include a README that explains how to run your script.  If you would like to use a tool other than Python, do so and simply explain in your README. 

The data file should mimic the data/milk-tea-coffee.tvs file format. You can see an example file here. You can find some example datasets here or here, though you should pick one that interests you personally, even if that means using a different data source!

WRITEUP: Include a link to your script and your data file in your writeup and state briefly why you chose this dataset and what question you hope to answer through this visualization. 


Visualize

You will be modifying the timeseries sketch, so be sure to save the original off somewhere if you want to refer back to it later. Modify the sketch to read your new data file. You may need to change the sizes of the tabs to accomodate larger data labels, and you may also need to change the frequency of tick marks on the axes, along with any other changes you think are necessary to effectively show this new data.

Once you get your data into the sketch and the program running smoothly, think of some way to modify the visualization to better reflect your data. For example, perhaps rendering your data on a log scale will show the trends in your data more clearly. Or, maybe your data needs to be sorted from low to high. Feel free to change the color scheme, and pick a way to render your data that makes the most sense (ie. bars, scattered points, curved line, etc).

WRITEUP: Export your sketch as an applet, and include the applet in your write-up. Include a discussion about your data and why you choose it, and be sure to include links to the original data source. Next, discuss your design decisions for the modifications you made to the original sketch.



Submission Instructions


Click here for submission instructions. 



Many Eyes Example

If you don’t see the Many Eyes visualization here wait a few seconds. The servers can be slow, unfortunately.