Taro Kuriyama
CS 171: Data Visualization
Problem Set #1



Part 1: Regular Expressions (20 points)

Answers denoted in bold.

1. Which of the following matches regexp a(ab)*a

a) abababa
b) aaba
c) aabbaa
d) aba
e) aabababa

2. Which of the following matches regexp ab+c?

a) abc
b) ac
c) abbb
d) bbc

3. Which of the following matches regexp a.[bc]+

a) abc
b) abbbbbbbb
c) azc
d) abcbcbcbc

e) ac
f) asccbbbbcbcccc

4. Which of the following matches regexp abc|xyz

a) abc
b) xyz
c) abc|xyz


5. Which of the following matches regexp [a-z]+[\.\?!]

a) battle!
b) Hot
c) green
d) swamping.
e) jump up.
f) undulate?
g) is.?


6. Which of the following matches regexp [a-z][\.\?!]\s+[A-Z]
(\s matches any space character)
a) A. B
b) c! d
c) e f
d) g. H
e) i? J

f) k L

7. Which of the following matches regexp <[^>]+>

a) <an xml tag>
b) <opentag> <closetag>
c) </closetag>

d) <>
e) <with attribute="77">

Now that you are a pro at evaluating regular expressions, answer the following questions in your write-up.

1. Write a regular expression to describe strings containing only the letters {a, b, c} that are in sorted order.
      ^a*b*c*$
      (Assuming that the phrase "only the letters {a, b, c}" implies "only the letters {a, b, c} if any" rather than "only and each of the letters {a, b, c}")

2. Write a regular expression for each of the following sets of binary strings (ie. strings that contain only 0's and 1's).

a) contains at least three consecutive 1s
      [01]*1{3,}[01]*
b) contains the substring 110
      [01]*110[01]*
c) contains the substring 1101100
      [01]*1101100[01]*
d) doesn't contain the substring 110
      ^((0*(10)*)*1*|1*)$

3. Write a regular expression that would return words whose letters are in alphabetical order, e.g., almost and beef.

      ^[aA]*[bB]*[cC]*[dD]*[eE]*[fF]*[gG]*[hH]*[iI]*[jJ]*[lL]*[mM]*[nN]*[oO]*[pP]*[qQ]*[rR]*[sS]*[tT]*[uU]*[vV]*[xW]*[xX]*[yY]*[zZ]*$       (Assuming that "return words" means "matches words" as the examples "almost" and "beef" suggest.)



Scraping Text Data with Python (40 points)


The script is designed to scrape five books each from the King James Version Old and New Testaments, then print them as free text to output.txt. Of course, the parameters can be changed easily to scrape the entire texts of both testaments. For the purposes of visualization on many eyes (that is, to keep the file size below 5MB), the five books from the old and new testaments were outputted seperately and combined manually (for some strange reason, the file size was 8MB when all ten books were outputted at once). When run from the command line, the script will generate a single file output.txt

To execute the script, run the following at the command line:
> python ScraperKuriyama.py

Data Source: King James Version E-Text. http://www.ebible.org/bible/kjv/kjv.htm

The visualization on many eyes:



Part 3: Your First Processing Sketch (40 points)


I chose to visualize the trends in the stockmarket over the past year by scraping the values of three major indexes: S&P 500, Dow Jones, and Nasdaq. Below is a list of modifications that I made to the original sketch:

(1) Rather than using a series of points, I chose a simple line graph to suggest the continuous nature of the indexes as they fluctuate over time.
(2) I modified the code to rescale the y-axis dynamically, based on the min/max values of the current index. Because the S&P 500, Nasdaq, and Dow Jones are indexed at very different levels, this rescaling allows one to perceive changes in each index more clearly.
(3) I included the option of switching between linear and logarithmic scales as a clickable area in the upper right corner. The default is set to linear. The difference is negligible in this short time frame of 52 weeks, but the code would certainly be useful for longer time frames. Note that switching to logarithmic mode also rescales the y-axis (with the minumum y-value no longer being 0).
(4) To accomodate the rescaling, the code for interpolation was commented out.
(5) Because indexes over the long-term are followed for their trends rather than specific values, I used relatively minimal markings on both the y and x axis. However, a mouse rollover allows for the exact value and date of each discrete datapoint to be identified.

I included two versions of the python script that scrapes the data. With the first, ScraperIndexesKuriyama.py, the output does not work well with Processing because the dates include strings and the floats include commas. I therefore wrote a second version, ScraperIndexesKuriyama2.py. (Their respective output files are Indexes.tsv and Indexes2.tsv.)

To execute the scraper script, run the following at the command line:
> python ScraperIndexesKuriyama2.py

Note: When last tested on OS X Leopard, the applet ran smoothly on Safari and Opera but not on Firefox.

Data Source: Google Finance. http://www.google.com/finance?q=INDEXDJX:.DJI,INDEXSP:.INX,INDEXNASDAQ:.IXIC



Comments

Although the scraping was not difficult (despite the mediocre Beautiful Soup documentation online), I did have some trouble with encoding. My output in part II included non-ASCII characters that my terminal refused to print, so I had to change python's default encoding. At this point, my output.txt file became suspiciously large, and I was unable to bring down the file size by re-saving it using non-Unicode encoding...

For the Processing part, getting the output.tsv file into the correct format was a bit challenging. It took me a while to realize that the functions defined in FloatTable.pde were not recognizing floats with commas, such as "1,999.22". In addition, it was a hassle to convert the timeseries dates to integers, then import the original date strings into the sketch using an array (ideally, this would have been done using java's own calendar/date functions).

I enjoyed the facility with which Processing switches between different types of visual representation (line graph, bar graph, area, etc).