High Throughput Screening Viewer (Link)
Robin Huang
Background
High throughput sequencing generates massive quantities of data that are hard to manage without a proper visualization. Unfortunately, this also means dealing with rough, dry and confusing displays. While the investigator may require many features, the public viewer may get lost in the clutter. My desire in this project is to promote a chic view of the screening world.
ChemBank is an excellent public database for screens conducted at the Broad Institute. I selected a screen seeking activators or inhibitors of glucose regulation. The screen uses a cell that emits light in the presence of a substrate luciferin used in glow sticks. A detector measures the emitted light and determines the average signal.
Drug designers must start with a library
of core compounds and iteratively eliminate ineffective compounds. In
a process of compressed chemical evolution, each round of screening
reveals possible structural interactions of interest. The biggest piece
of data everyone wants to know is if a drug has activity. By normalizing
the signal, the effect of each compound is graded using a z-score –
one standard score from the mean being significant.
Design
Compounds naturally group into families due to structural similarity. Clusters are an intuitive method to display similarity since the eyes naturally group proximal objects together. Searching for an active compound is like discerning a particular star in the Milky Way. Compounds in the dataset each have several z-scores representing independent experiments. Clustering instantly links compounds to z-scores and reduces clutter.
Experimentation is always part of the design process. From the start, z-scores were a crucial data set so a scatter plot was necessary. I started out by clustering z-scores in a ring and quickly gave into tinkering with animation effects. By chance, changing radial parameters created flower clusters. Expanding clusters imposed on nearby flowers. This was solved by fading out the other flowers.
User interaction was pivotal in deciding how to link data. Data can easily overwhelm the display. The most direct approach is best. A simple, minimalist approach depicts only the compound ID and average z-score – at first glance this is what most users want. Cursor changes lead the user to clicking to expand clusters and reveal the linked data in detail. Besides y-axis position, color helped to properly scale z-scores and became a matching theme for the linked data.
Aesthetics played a big role in this visualization. Though unnecessary for the general viewer, I showed the chemical structures and funny names because they looked good and maybe also to try and shroud chemical biology in cloud of alchemy. Colors helped immensely to explain the data properties. The tint function helped color code the images and highlight the z-score plot. Without using a legend for the z-score, the viewer can instantly associate the highlighted normal distribution curve with the z-score.
Loose Ends
Although the chemical structure appears, sorting based on structural similarity would help to find patterns in activation or inhibition of glucose regulation. I had to default and sort by chemical ID along the x-axis. I am not a medicinal chemist so I could not come up with a pattern between z-score and structure. Also, some of the z-scores have high error margins and having a metric for variability between samples would be useful. The extra dimension would probably be linked to the radius of the expanded cluster.
I tackled this project in bits and pieces. There was a lot of trial and error and granted frustration along the way. In exchange I happened upon happy accidents and ideas that I would have missed just knowledgeably plowing through – call it beginner’s luck. Honestly, doing this project over would be daunting, but the second time around my coding would be more organized and efficient.
The biggest roadblock was definitely controlling the draw loop properly to avoid unwanted effects and stuttering. I had to use movement to iterate the loop. A side effect is that after expanding a cluster, the user must move the mouse for the cycle to complete. Eventually the user will have to move to roll over, but the slight jitter is annoying. Also, more knowledge of arrays would have helped manage the data and provide more statistical analysis and content. However, I am happy because it looks good and works decently.