Homework 4Due Date: Wednesday April 23 at noon EST
OverviewThe goals of this homework are for you to:
Part 1: Collecting Data with Python (40 points)In this part of the assignment you will write a Python program to scrape information from websites and create an output.txt file. This output.txt file will contain a body of text which you will uplaod to ManyEyes and visualize as a Tag Cloud.This task is meant to be fairly open-ended --- you may use any text you'd like to make your output.txt file. You will turn in the following for this part:
There are a few rules you must follow:
Getting StartedYou may use any language you'd like, but the course staff will only support Python. If you want to try something else, feel free to, but we may not be able to help. Feel free to post to the message board, though, as some of your classmates may have answers about other languages.If you have Unix or Mac OSX, Python should already be installed (try typing "python" into your terminal application). If you have Windows, you may need to get it here (you probably want Python 2.5.2 Windows Installer). Next, download the following zip file from Thomas's lecture on Python. This has the BeautifulSoup.py library and a handy util.py library. It also has the code written during lecture, which is a good place to get started (to remind yourself of how scraping works). The regular expression checker used in class is here. It may be useful as you start to write your own regular expressions.
ScrapingPick several websites from which you want to collect data. Using the the code in the scraper.zip file you downloaded, modify or write your own script to scrape the interesting data from the websites you chose, and output the data to output.txt. Note: from the command line, you can use the ">" command to pipe data printed out to the terminal to a data file.Here are a few ideas of places you could find your text. Feel free to do something else, though, if you have other interesting ideas!
Next, upload your output.txt file to the Many Eyes site, and create a Tag Cloud visualization of your data. Include the visualization in your write-up (i.e. click on "share this visualization", and copy-and-paste the link into your write-up). Note: There is a 5MB data limit for uploading to ManyEyes. If you go over that limit, you are allowed to manually delete text from your output.txt file so that you can make your Tag Cloud. Here is an example visualization of Beatle's lyrics scraped from here.
A big thanks to Thomas Carriero for putting together this great Python exercise!
Part 2: Exploring Data with Tableau (20 points)INFO 424 at University of Washington is a visualization course by Maureen Stone and Polle Zellweger. In the fall of 2007, the class filled out a survey about themselves to use a dataset in Tableau. The survey questions and answers (raw and cleaned) can be found here. For this part of the assignment, you will load this survey data in Tableau and explore the data.You will do the following:
For each question, in your write-up include: the question, the answer, and a visualization that supports your answer. Try to find the best (concise, clear) visualization for each question. Each visualization you will get max 5 points. The most interesting questions and visualizations will be presented in class.
Part 3: Networks, Graphs, and the T (40 points)For this part of the assignment you will be using data about Boston's subway network system, the T, to create a graph visualization with Processing. The T can be modeled as an undirected graph (Stops are nodes/vertices, and Connections are edges).Did you know?... the Red Line is so named because its northernmost station used to be at Harvard University, whose school color, as you all know, is crimson. To learn more about the history of the T, you can go here and here. 3.1) Getting the Tools in HandFirst, download this zip file cantaining the first two example sketches from Chapter 8 Fry's. Place the sketches in your Processing sketchbook directory.Second, read carefully through the first part of Chapter 8 (pp. 220 - 242, Subsections: Simple Graph Demo, A More Complicated Graph and Approaching Network Problems) and work through the two example sketches. Both sketches are named according to their subsection title in the chapter. 3.2) The TIn this part of the assignment you will apply your Processing skills to visualize a real world network: the T. We'll focus not only on visualizing the network, but also on data acquisition from the web (preparation for the final projects).3.2.1) Setting UpCreate a new folder named "HW4" in your sketchbook directory and download the following Framework sketch. Copy the unzipped sketch folder "Framework" to your newly created "HW4" directory.Open the sketch and familiarize yourself with the code. Run the sketch. You can drag Nodes by pressing and holding down the left mouse button. If you press "p" the next call to draw() writes everything into a pdf file "output.pdf" and places it into your sketch folder. Did you know?... The PDF file format is a vector format. This allows to resize the image file without loss of resolution. Save a copy of the "Framework" sketch as "Ex_3_2_1" in the "HW4" directory. You will now add a field col to the Edge class, which allows us to specify its color. Add a fourth argument col of type String to the constructor of the Edge class. Assign an edge color based on the first letter of the argument col. We only allow four different colors:
The addEdge() routine in the main tab should also take a fourth argument col. Change the code accordingly. You also have to make changes to the draw() method of the Edge class. Add color to the edges of the graph. They are also a little bit too thin -- increase their stroke weight by one. 3.2.2) Acquiring the DataNext, you'll acquire all the needed data by parsing the following webpage: data. We collected station and connection data from the MBTA website. (Note: Data for the "Minutes" column of the "Connections" table was not available for all lines -- some values may differ from the real amount of time it would take to travel between the two corresponding stations.)Your task is now to write two Python scripts (or any other language that you used for the first part of this assignment): "stations.py" and "connections.py". The first extracts the list of stations from the first table of data. The second reads the connection data. The output of "stations.py" should be a list of station names, with one station name per row. Write the list to a file "stations.csv". The file extension ".csv" stands for comma-separated values. It's similar to the ".tsv" file format we were using before. The only difference is that the tabs are replaced by commas. (We introduce it here because it's a widely used format to store data and may be useful for your final projects.)
The information about the connections should be stored into a file named "connections.csv". The row format there is defined as follows: "From", "To", "Color", "Minutes". Store copies of your Python scripts in a folder called "TheTDataCollection" in your "HW4" directory. Include instructions in your write-up about how to run the scripts. Now we are ready to define the Node locations on the screen. Download the Using_Your_Own_Data sketch and place the unzipped sketch folder into your "HW4" directory. We used this sketch in HW2 to define locations on a map. We can reuse this code to define the locations of the stations now. Save a copy of "Using_Your_Own_Data" as "Ex_3_2_2". Add the following MBTA map to the sketch. We'll use it to define the 2D locations of our T stops. Change the code so that it reads and shows station names. Use the Table class to read in the "stations.csv" file. Store the station names together with their x and y coordinates to a file "locations.csv". (Row format: "Station Name", "x", "y".)
Remark: If you change the argument "TAB" of the split command to ',' in the Table class, it is able to read comma-separated files. By adding another line of code, we make sure that the white space at the begining and end of the strings are removed: Whew! Take a deep breath as the painful part is done -- the data is acquired. 3.2.3) Visualization of the NetworkWe'll now visualize the network. Before you start, save "Ex_3_2_1" as "Ex_3_2_3". Add the Table class to the sketch (changes from the above remark).Load the two files "connections.csv" and "locations.csv" into tables. Load this data by making changes to the loadData() and the addNode() routines. If you now run the sketch you will see the T network. Next, add code so that the station name is displayed in the upper right corner of the screen, when the mouse rolls over a station. 3.2.4)Shortest PathSave the previous sketch as "Ex_3_2_4".
What is a shortest path in a network? That's pretty simple.
Let us assume the following scenario: You do not have to code a shorest path algorithm yourself --- we've done that for you. Download the ShortestPath file and add it to your sketch. It is (hopefully!) straight-forward to use:
1. Add two global arrays of type "boolean", one named activeNodes and the other named activeEdges. Intialize them by adding initializeActiveDataStructures(); to the end of the setup function. If activeNodes[nodeIndex] evaluates to true, then the node with index NodeIndex is part of the shorest path. Similar for the edges: If activeEdges[edgeIndex] evaluates to true, then the edge with index EdgeIndex is part of the shorest path.
For those who are interested in the underlying algorithm, may find the article about Dijkstra's Shorest Path algorithm a good starting point. Ok. Now it's your turn. Add four global variables: A and B of type "Node", numOfNodes of type "int" and numOfMinutes of type "float". Add code to the mousePressed() method so that the following requirements are met:
1. If you click on a node with the right mouse button it should assign the corresponding node to the global variable A and increment the numOfNodes variable. Run the sketch and try it... there is still something missing. You do not display the time it takes to travel from A to B! Add code to the draw() routine in the main tab so that it writes the shortest path information in the upper left corner:
That's it. 3.2.5) Color effectSave the previous sketch as "Ex_3_2_5".Wouldn't it be nice to not keep the nonactive nodes and edges in the background? For example by coloring the edges in gray? We could animate this change in color.
To do so, download the Integrator class and add it do the sketch. Use the Integrator class to animate the change in color. It's better to do this interpolation in the HSB color space. Leave the hue the same, but change the "target" saturation and "target" brightness for all the nonactive edges to 0 and to 200. Set this new "target" whenever a new shorest path is going to be computed. If the display changes again to display all the nodes, change the target edge color to its original color again:
Extra Credit: The New Trip Planner (20 points)Choose your favorite city and visualize its public transportation system. Boston your favorite? Add the bus lines to the above network. You like Chicago more? Map out the L and the associated commuter rails.You must acquire the data (stations, connections, travel time) from the corresponding public transportation websites (automatically if possible). Add at least one new feature to your visualization, such as a query to an online time-table that checks when the next train is leaving and displaying this information near the station. You could add a ranked list of your favorite coffee shops and a function that determines the highest ranked one you can grab a cup of joe from before the start of class. What else would you love to coordinate? Feel free to change the design, features, and city. We will award points based on the amount of information you portray, and the helpfulness of your visualization.
Submission InstructionsTo submit your homework, create a folder named lastname_firstinitial_hw4, and place your write-up, Python directory, and Processing sketches into the directory. Compress the folder and send it as an email attachment to miriah@seas.harvard.edu. |