Double post day! I’ve been wondering whether I should do a more “professional” content chapter instead of general personal rambles I’ve been doing so far, and I think today we’ll be doing a casual introduction to proteomics. This is going to be poorly cited, if at all, and not peer reviewed or particularly in depth, it’s just my own description of how I think about proteins. Although I’m professionally a deeply trained “proteomicist”, I still think of proteins in a kind of personified view. For whatever reason, that’s just how my brain works. There’s some great academic resources on the history of proteins and the study of proteins (“proteomics”), like this one from the Human Proteome Organization (HUPO). This chapter isn’t going to be that, though.
Some people have taken a biology class in high school before and gone over the classical “central dogma” of biology, which is that DNA is transcribed into RNA, which is then translated into proteins, and then proteins are the molecular machines that do most of the “work” around a cell. Proteins are the motors that pull cargo from one side of the cell to another, they’re the catalysts that turn one molecule into another, and they’re the control switches that turn genes on or off. Each of those “jobs” make up classes or families of proteins: in the previous sentence, those are kinesins, enzymes, and transcription factors, respectively. Transcription factors are my personal favorite lately, with the work I do at my startup company, but all proteins are pretty cool and it still blows my mind to watch the artistic interpretations of proteins in action like this or the paintings by David S. Goodsell.
Measuring proteins is a lot harder than measuring DNA or RNA. While genome sequencing has become pretty mainstream, with direct-to-consumer kits to sequence yourself (23andMe) or even your dog (Embark), proteomics has had a harder time getting up to the Moore’s Law trajectory. DNA is made up of four nucleic acids; proteins are made up of 20 amino acids. The range of size and chemical properties across the 20 amino acids is much more varied than the sizes and chemical properties of the four nucleic acids. Between that and the combinatorics, these base components of each molecule is one thing that makes measuring proteins so hard.
Another complication is the “dynamic range” of the molecules. Dynamic range refers to how some molecules are very common and highly abundant, while other molecules are rare and lowly abundant. This is the “needle in a haystack” problem, where some protein molecules are very common and abundant (the hay) and others are rare (the needle) – it’s hard to count how many needles might be in a haystack because there’s just an overwhelming amount of hay to deal with. It’s the same thing with proteins, where some proteins are just really abundant, like albumin, while other proteins are very low abundant, such as hormones like testosterone or estradiol. There are tricks you can do to manipulate the metaphorical haystack to make it easier to find the needles. Enrichment techniques like immunopurification bind the protein of interest using an antibody to fish the protein out of the mix, basically like using a magnet to pull out the metal needles from the hay; depletion techniques light the hay on fire to leave behind the needles. At the end of the day, even with sample manipulations, dynamic range is still a challenge. For a more academic breakdown of the dynamic range problem, one of my PhD advisors did a theoretical breakdown of why the dynamic range of proteomics makes it magnitudes more difficult than other ‘omes.
A third challenge is that there is no copy-paste for proteins. DNA and RNA have an amazing trick called polymerase chain reaction (PCR), which essentially means that scientists can take one single molecule and make as many copies of it as they want. Proteins, sadly, don’t have this miraculous invention. It would surely be a Nobel Prize to anybody who figured it out though. So, whatever you want to measure, you’re not only going to have problems based on the complexity of the biochemistry, but also the analytical dynamic range, and now also you can’t make more if you need to try it again.
Nevertheless, there are indeed ways that we measure proteins. A lot of ways, really, and scientists have been doing it for a long time. This won’t be an exhaustive list by any means (again, this isn’t peer reviewed or anything, it’s just me riffing off the top of my head and the tips of my fingers) but I’ll focus here on the measurement techniques that tell you protein identity or components, rather than overall protein concentration like Bradford, BCA, or Nanodrop.
There’s old school Edman degradation, where a protein is stretched out into the linear string of its component amino acids, then each amino acid is chopped away one by one to come up with the full sequence of the protein. There’s antibody-based approaches, like Western blots or fluorescent assays, where you have a tool kit of antibodies which will recognize partial shapes of proteins, and you can use those antibodies to “light up” different proteins; if an antibody lights up, it means that protein is present in the sample. There’s mass spectrometry, which smashes up the protein and then reassembles the pieces using the mass (as the name implies) of each amino acid as a puzzle piece to fit the sequence of the protein back together. Finally, there’s new technologies like nanopore sequencing, which uses electric currents passed through each amino acid of the protein to identify which amino acid it is, and piece the protein sequence together that way.
All are pretty wildly creative ways to solve the protein measurement problem. For the record, the most common way to do DNA sequencing uses the “light up” approach, where each of the four base pairs gets a specific color assigned to it, and then bases are sequentially chopped off kind of like the Edman degradation approach (specifically, it’s called Sanger sequencing) – the last, terminal DNA base will light up “red”, a tiny microscope inside the sequencer will see the “red” and assign that to its respective nucleic acid and that base gets chopped off. The next DNA base is run, and lights up “green”, and the microscope records that nucleic acid, which gets chopped off for the next; and so on and so on until the entire piece of DNA has been sequenced, base by base. It’s pretty smart, and there’s some newer ways to do DNA sequencing (“next gen sequencing”) but that’s really the gist of it.
Unfortunately it doesn’t work for proteins, at least not for proteins in a complex mixture like, for example, blood. It’s a math problem in that doing sequencing with the light-up bits and microscopes is inherently limited by the physics of light, something my PhD advisor details more in his manuscript, but the TL;DR is that the size of the experiment for DNA sequencing (flow cell) easily fits in the size of your hand, but based a maximum density of the wavelength of light, the experiment (flow cell) to do protein sequencing would need to be 1 m^2.
Some people are trying to get around that problem by using the enrichment/depletion tricks I mentioned before, so that the proteome is less complex and therefore doesn’t require such a big experiment (flow cell) system, but then you’re also limited to only measuring a portion of the proteome/haystack.
Probably the “next gen proteomics” approach I’m most excited about is nanopore sequencing, where they pull the protein through a literal pore (which, hilariously, is actually made up of OTHER proteins itself) and use electrical currents to determine which amino acid is going through the pore. There’s been a lot of work on this for DNA sequencing, but again, DNA is easier because there’s only four bases and they’re all chemically similar, while proteins have challenge 1 and 2 from above, in that there are 20 amino acids and they’re highly varied in size and chemistry. The size and chemistry variation makes it hard to fit everything through the “same” pore – if the literal hole in the pore is too big, the electrical current won’t be good enough to sense the small amino acids, but if the hole is too small, the really big amino acids won’t be able to fit. So while nanopore sequencing is, to me, the most science-fiction/fantasy approach of them all, it’s also quite a ways from being commercialized for proteins, I think.
We’re stuck with mass spectrometry for now, for the most part. That’s not to say mass spectrometry is perfect. It’s a cool $1M to buy a mass spectrometer, so we desperately need better ways to measure proteins. I’ve written a ton about that in the past, both formally and informally. There’s lots of startup companies trying to solve the protein measurement problem using the various approaches above, as mentioned, or at least introduce some alternative ways to measure proteins, but unfortunately nothing is really available yet. Some are close, I think, but price is going to be a barrier still because I’m sure they’ll be expensive.
In the end, I’m excited to see what might come out in the next 2-5 years. I’m not personally or professionally banking on mass spectrometry remaining the only really viable technology for proteomics although I certainly will always have a soft spot for mass spec. It’s cool to think about what would be possible with a cheaper, faster, and/or more sensitive protein measurement system.