A few years ago no one doing research in life sciences would have heard of the catch phrase “Big Data”, now in many of its areas it is not possible to not come across it. Proteomics is no exception to such a trend; we start to see bigger and more complex datasets as the mass spectrometers become more precise and affordable.
So how do we integrate, analyze and visualize such a collection of data is my main focus. In the Lamond Lab we maintain the highest level of annotation and one the biggest proteomics repositories in the world, and extracting useful information from such a repository is one of the main goals and priorities.
My area of focus is not on the Laboratory Information Management side of things, we already have a fantastic team that deals with those challenges, my emphasis lies on quantified proteomics data and the challenges brought forth by its scale and complexity. The Encyclopedia of Proteome Dynamics ( peptracker.com/epd ) is one of the answers to such challenge, an integrated polyglot persistent database that contains all our published proteomics data. It is our constantly evolving answer to share and easily explore proteomics data for our lab, and for the general public as well. Here we have harnessed the power of popular noSQL solutions from the commercial sector, such as Cassandra, Neo4j, D3.js and Spark to provide scalable, resilient and performant solutions.
However on the analytics side we have still many interrogations to answer, and there too lies my focus: How do calculate Global False Discovery rates to deal with FDR inflation on integrated repositories that contain numerous independently quantified datasets? How do we quantify proteomics experiments that contain thousands of raw files? How do we integrate different PSM’s if such a quantification run had to be split into several smaller ones? How do determine if the output of one mass spectrometer is comparable to another? Are mass spectrometers of different models producing comparable results? How do we determine when a sample was affected by a mass spectrometer with degraded performance? Can we find a significant difference in performance after a mass-spec has been serviced? What’s the variation on data produced by Label Free when compared to SILAC or TMT? What patterns do we find within a large scale PSM repository? What is the effect of the quantification software version on the PSM and downstream elements?
It’s an exciting time where we have far more questions than answer; but we’re already working to change this.