CS 171 Visualization

Untitled Document

Homework 1: Data Scraping

Due date: February 18th at 5pm EST


The goal of this homework is to learn how to scrape data from the web using Python and to start getting comfortable with Processing. You will practice writing Python scripts, and then write a small program to scrape information from a website of your choice. Next, you will visualize your data using the online system ManyEyes. Finally, you will start to dig into Processing, and modify a Processing sketch to visualize numerical data you scrape from the web.

 


 

Part 1: Regular Expressions (20 points)

To warm your regex-chops up, answer the following questions (you do not need to include the answers in your write-up) . You can check your answers against the regular expression checker used in class. On this page you can also find a nice reference of operators by clicking the button "Regular Expression Exemples" (spelling error is not ours!). The following questions were taken from the University of Brighton.

1. Which of the following matches regexp a(ab)*a

a) abababa
b) aaba
c) aabbaa
d) aba
e) aabababa

2. Which of the following matches regexp ab+c?

a) abc
b) ac
c) abbb
d) bbc

3. Which of the following matches regexp a.[bc]+

a) abc
b) abbbbbbbb
c) azc
d) abcbcbcbc
e) ac
f) asccbbbbcbcccc

4. Which of the following matches regexp abc|xyz

a) abc
b) xyz
c) abc|xyz

5. Which of the following matches regexp [a-z]+[\.\?!]

a) battle!
b) Hot
c) green
d) swamping.
e) jump up.
f) undulate?
g) is.?

6. Which of the following matches regexp [a-z][\.\?!]\s+[A-Z]
(\s matches any space character)
a) A. B
b) c! d
c) e f
d) g. H
e) i? J
f) k L

7. Which of the following matches regexp <[^>]+>

a) <an xml tag>
b) <opentag> <closetag>
c) </closetag>
d) <>
e) <with attribute="77">

Now that you are a pro at evaluating regular expressions, answer the following questions in your write-up.

1. Write a regular expression to describe strings containing only the letters {a, b, c} that are in sorted order.

2. Write a regular expression for each of the following sets of binary strings (ie. strings that contain only 0's and 1's).

a) contains at least three consecutive 1s
b) contains the substring 110
c) contains the substring 1101100
d) doesn't contain the substring 110

3. Write a regular expression that would return words whose letters are in alphabetical order, e.g., almost and beef.

 


 

Part 2: Scraping Text Data with Python (40 points)

Get Started

If you have Unix or Mac OSX, Python should already be installed (try typing "python" into your terminal application). If you have Windows, you may need to get it here (you probably want Python 2.6.1 Windows Installer).

Download the following zip file from Samir's lecture on Python. This has the BeautifulSoup.py library and a handy util.py library. It also has the code written during lecture, which is a good place to get started (to remind yourself of how scraping works).

Scrape Text

Pick several websites from which you want to collect data. Using the code in the scraper.zip file you downloaded, modify or write your own script to scrape some interesting text from the websites you chose, and output the text to output.txt. Note: from the command line, you can use the ">" command to pipe data printed out to the terminal to a data file.

Here are a few ideas of places you could find your text. Feel free to do something else, though, if you have other interesting ideas.

  • What words does your favorite musical artist use most often? Use azlyrics or another lyrics site to find a list of their lyrics and scrape it!
  • What words appear most often in Shakespeare's plays? Use gutenberg to find out.
  • What trends word-usage trends do your favorite bloggers follow? Scrape their blogs!

This task is meant to be fairly open-ended --- you may use any text you'd like to make your output.txt file. You will turn in the following for this part:

  • an output.txt file,
  • a folder called DataCollection with the code that does the data-mining,
  • in your write-up, a description of how to run your code --- it's fine if this is something simple like "run python myscript.py at the command line".

There are a few rules you must follow:

  • No copy-pasting text into your output.txt file. This must be done programmatically.
  • Updated: You must gather text from at least 5 different URLs. These can be multiple pages of data from a single website or they can be from several different websites altogether.  More is encouraged, but let this be a minimum.  The point is to equip you to scrape any website for data.
  • You don't need to do this all in one script (although it is encouraged for simplicity). However, you must explain how exactly you generated output.txt in your write-up especially if there are several steps.

Visualize

You are going to visualize you text data using the ManyEyes site. Upload your output.txt file to ManyEyes, and create a Tag Cloud visualization of your data. Include the visualization in your write-up (i.e. click on "share this visualization", and copy-and-paste the link into your write-up). Note: There is a 5MB data limit for uploading to ManyEyes. If you go over that limit, you are allowed to manually delete text from your output.txt file so that you can make your Tag Cloud. At the end of this homework is an example visualization of Beatle's lyrics scraped from here.

 


 

Part 3: Your First Processing Sketch (40 points)

The goal of this section of the assignment is to get some more experience scraping data, and to get your feet wet with Processing.

Reading

To begin this part of the homework, read Chapter 4 of the Fry book after you download the timeseries sketch example from the chapter. In this downloaded file you will find the top level sketch file timeseries_sketch.pde, two additional classes FloatTable.pde and Integrator.pde, and the data data/milk-tea-coffee.tsv. Each of the visualization and interaction techniques discussed in Chapter 4 are included somewhere in the sketch, so make sure to look all the way through the code. Read carefully through this chapter, and follow along in the example code - make sure you explore the different methods as this sketch will be the framework you use to visualize data that you will scrape from the web.

Scrape Numerical Data

Think about an area that interests you, and find a source of numerical data online that you can scrape -- sports stats, the price of shoes on Zappos, the course numbers of the classes offered in different departments at Harvard, etc. This data should contain three different sets of numbers with at least 50 numbers each. Like the milk-tea-coffee example, you'll need an additional column of numbers that will serve as your x-axis (such as year). If your data does not contain this information, you can just number the entries (1 -> number_of_entries). Note that all three of your numerical sets should contain the same number of items. Write a Python script to scrape this data, and place your script and data file in the timeseries_sketch/data directory. Include a README that explains how to run your Python script.

The data file should mimic the data/milk-tea-coffee.tvs file format. You can see an example file here.

Visualize

You will be modifying the timeseries sketch, so be sure to save the original off somewhere if you want to refer back to it later. Modify the sketch to read your new data file. You may need to change the sizes of the tabs to accomodate larger data labels, and you may also need to change the frequency of tick marks on the axes, along with any other changes you think are necessary to effectively show this new data.

Once you get your data into the sketch and the program running smoothly, think of some way to modify the visualization to better reflect your data. For example, perhaps rendering your data on a log scale will show the trends in your data more clearly. Or, maybe your data needs to be sorted from low to high. Feel free to change the color scheme, and pick a way to render your data that makes the most sense (ie. bars, scattered points, curved line, etc).

Export your sketch as an applet, and include the applet in your write-up. Include a discussion about your data and why you choose it, and be sure to include links to the original data source. Next, discuss your design decisions for the modifications you made to the original sketch. Also include a few comments about what you liked/hated/found hard/found easy about data scraping and Processing.

 


 

Submission Instructions

Your write-up will be submitted as a webpage -- you can use any webpage layout your wish, or, cut and paste the html source from this page and plug in your work. To submit your homework, create a folder named lastname_firstinitial_hw1 and place your webpage write-up and your DataCollection directory in this folder, along with your modified timeseries sketch directory -- please make sure that all of the links in your write-up are relative to this folder! Compress the folder (please use .zip compression) and submit on the course iSite page in the HW 1 dropbox.

If we cannot access your work or links because these directions are not followed correctly, we will not grade your work.

 


 

ManyEyes Example

If you don't see the Many Eyes visualization here wait a few seconds. The servers can be slow, unfortunately.