CHAPTER FOURTEEN: PRODUCTIVITY

One of the most common currencies for scientists is publications. There are, of course, many other measures like patents, grants or other funding, Investigational New Drug (IND), (text)books, code repository forks/downloads, and many more. In general though, every measure of productivity breaks down into some condensed summarization of a ton of work, crafted for consumption by others. 

Across almost all fields, not just science, people use resumes or curriculum vitae (CV, which is just an extra long resume basically) in order to quickly and clearly communicate career productivity. Of course, resumes and CVs are imperfect, as they will never fully capture the nuance of someone’s life and experience, but that also highlights how important written communication is. There’s a whole rant here about scientific productivity being measured most commonly by written communication, while never really getting trained well in written communication as STEM majors, which is in part what launched my whole aspirational goal this month of writing 50k words. (Aside, to be clear, 50k words is not happening, as it’s 7PM on November 30 here and I have at the moment just over 15k words, so I don’t think it’s possible to crank out another 35k words in five hours. I’ll probably write tomorrow a conclusions or summary post that’ll continue from this one as a reflective retrospective of how I think this inaugural #NovemberWritingChallenge went. Spoiler: I’m of course disappointed that I didn’t even come close to 50k, but also I’ve learned a lot about what holds me back from writing.) 

I’ve written previously about the COMMUNICATION aspect of publications and other avenues for dissemination of scientific work, but since the last chapter on MESS I’ve been thinking about it from the standpoint of repeatability and reproducibility in scientific literature. First, to clarify, repeatability typically refers to the same person repeating the same experiment with the same system and getting the same result, while reproducibility refers to a different person attempting the same experiment with the same or similar system and getting the same conclusion. Anything published should really be repeated, because if you can’t get the same result you’re claiming in your publication, then you probably shouldn’t publish it. But reproducibility can be harder. Borrowing from statistical descriptions of missingness, in my mind, there’s irreproducibility “at random” and irreproducibility “not at random”. The former is where biology is just hard and there’s some hidden aspect of the experimental system that is unknown, and this is where the scientist is not really at fault for irreproducibility. Irreproducibility “not at random” is where the scientist just did a terrible job of describing the methods, the system, or the analysis. I’m assuming laziness here and not straight malicious lack of detail, although there are of course examples of malicious intent by manipulated data or straight fake analyses.

Irreproducible “not at random” is at least in part bad methods, speaking to the specific section of a scientific manuscript. Methods sections are the second easiest place for me to start writing a paper, aside from the results or figures, because I’m just describing what I did. Usually my methods are pretty generic and widely used in the field so they don’t need much detail, but sometimes there’s specific twists that I’ve added. It’s not unlike cooking and having recipes. Most people have some idea of what goes into a chocolate chip cookie recipe, but some people might have a specific twist based on their personal taste preferences, like using brown butter instead of regular, or based on necessary accommodations, like adjustments to account for baking at high elevations. So the equivalent is that maybe the scientist of the irreproducible work is just not realizing that their method works great for them because their experiment is being done in Denver, but a scientist at sea-level in Boston needs a different recipe, i.e. protocol or method.

Not to get back into the whole artificial intelligence (AI) debate, but maybe AI would be helpful for reproducibility of analyses. I’d be shocked if a lot of the papers coming out now aren’t using analyses that were written, at least in part, by AI like ChatGPT, Claude, etc. If people are already relying on AI to write their data analyses (and therefore guide their conclusions), then it’s not a huge leap to use the same AI to take things one step further and capture the whole “chat” and publish that as a supplemental method. At the bare minimum, people should be capturing the code and publishing those scripts or notebooks alongside their papers for repeatability, but I know a lot of people put terrible code out there that can’t be rerun by anybody else due to hardcoded paths or missing dependencies, and many more people just never even make their figure-producing code available.

Maybe AI could go one step further though? If we capture the protocols, alongside the data processing, and the result-producing post-processing analyses, that should be the most ideal reproducibility scenario. I don’t know exactly what that might look like in practice, but that’s something that would massively help me in implementing other people’s methods in my own lab. Upload someone else’s paper, and have the AI generate a shopping list based on the methods so that I can get all the supplies I need, and then spit out the protocols based again on the methods and a bit of the results, maybe. There’s probably also something like that out there, if only for cooking.

Anyway, all of this reproducibility ramble got me away from the initial thought, which was productivity as measured by writing. In management-speak, what goes into the resume or CV are the OKRs (Objectives and Key Results) and what helps get you to the OKR is the KPIs (Key Performance Indicators). Setting goals, or even “resolutions” as we’re now coming up on the end of 2025, is usually at the OKR level, but without some KPI to help get you there, you’re probably never going to make your goal or resolution. When you set out to run a marathon, you don’t usually just walk up to the start line and then bang out 26 miles. Usually you decide that you want to run the marathon (OKR) and then break it down in a training plan with gradually increasing mileage each week (KPI). As another example, this whole writing challenge this month to write 50k words (OKR) came with some clear daily mini-goals like shooting for 2k words/day (KPI). 

One place I think people struggle, both with methods for papers and just in general productivity measurement, is figuring out whether the KPI is really what the audience needs to know, or if the audience really just cares about the OKR. For the methods section, you really need to be specific and detailed, but for reproducibility, it’s enough to be shooting for the OKR – in fact, it’s probably even better for the scientific community and furthering human knowledge if we can reproduce the idea or conclusion by orthogonal means, rather than directly reproducing the exact experimental conditions which might be correlated but not causative to the conclusion being drawn by the original scientist.

Similarly with personal productivity, it’s easy to punch out a bunch of KPIs and make progress in ticking off to-do list boxes, but if you’re not keeping the OKR in mind, you may be doing a bunch of busy work without making meaningful progress on the real goal. “Not everything that can be counted counts, and not everything that counts can be counted”, as William Bruce Cameron is quoted as saying. A few weeks ago, I was talking about this with some other women at a professional event, where we were discussing the challenges of accommodating alternative paths. Specifically, the conversation turned to the subject of childcare stipends for conferences, and a couple young mothers emphatically supported the idea. To be clear, I love the concept and stipends to help people afford traveling to professional events and career advancement opportunities. That said, a couple other women, myself included, cautioned that bringing children to the conference may prevent you from getting the full value of the conference, which isn’t usually in the formal programming but rather the informal networking that happens after the official programming ends and tends to happen in the evenings. For me this is a personal observation, as when I’ve tried to bring my child to conferences, it just ends up with me doing a pretty terrible job both professionally and personally, and nobody got my full attention because I couldn’t engage fully in either setting. While, yes, I hit the KPI of “attend conference, give talk” and “conduct bedtime routine”, I didn’t really make progress on the OKRs of “advance career” or “be a present, available parent”. That said, I also recognize that I’m lucky to have the option of leaving my child with my partner when I go to conferences now, so that while I miss spending time with my family when I travel to events, it’s a choice that not everyone has the luxury of making.

I certainly wish there was a better system, both speaking specifically of conference networking and also more broadly of productivity measurements, but until scientific society at large changes, I think we’re stuck with the aforementioned productivity metrics and figuring out tools to manage it or otherwise cope with it.

CHAPTER TWELVE: COMMUNICATION

A brief break from the deep subject matter expertise posts, because it has me thinking broadly about scientific communication. Specifically, that formal training (i.e. undergrad and grad school) never prepared me for real scientific communication.

The focus in school was always communicating science to non-scientists, usually children from middle school through high school. While getting kids interested and excited about STEM subjects, this focus on middle/high school scientific communication really did me a disservice in my professional life and probably even my personal life.

There are so many more facets to scientific communication that really don’t get addressed sufficiently in higher education circles. I’m thinking about how I have, professionally, had to figure out communicating very technical, jargon-heavy concepts to a variety of really smart people who have expertise outside of my hyper-focused niche. The way I might approach communicating what I do (mass spectrometry, proteomics, transcription factors, etc) completely depends on the audience. A venture capital investor, for example, is thinking about things from a financial perspective, while a pharmaceutical scientist is thinking about this from a drug discovery perspective, and an oncologist is thinking about the outcome and impact on patients in a clinical trial. Nobody is “wrong”, nobody is smarter than anybody else, everyone’s just thinking about the same thing from a different perspective and with a different lens. 

It follows, then, that communicating the same topic needs to be framed specifically for different audiences. It doesn’t mean that the core concepts are changing, just that the language needs to be different for the most effective communication. Maybe language is an interesting parallel, where translating the same message into different languages shouldn’t change the core message, but using the right language for the right audience is going to make communication much easier than forcing everyone to do the translations themselves, or even worse just zone out and not even listen to the message at all.

None of my undergraduate or graduate school outreach opportunities touched on this concept of science communication to adults, really. As far as I can recall, it always focused on making fun, hands-on “labs” or demonstrations for kids to learn scientific concepts. Then I got into the “real world” and suddenly cute little demonstrations aren’t really working anymore.

An obvious example of where scientific/STEM communication can go really right or insanely wrong is with policy. Policy makers might consult scientists and doctors and other professionals, piece together all of the best expert advice to write into laws or regulations or recommendations, but without effective communication, policies are relying solely on people following guidance based on an appeal to authority. Sometimes that works, but a lot of times it doesn’t. In my own work, appeal to authority has very rarely worked out. I don’t have much authority outside of my hyper-niche specialty, and so the communication (good or bad) is weighed much more heavily.

I think there’s a lot we could learn about communication, as a scientific community, from novelists and screen/scriptwriters. Crafting a story can help hook an audience into a message. This is another place where grad school trained me, but maybe trained me specifically to give scientific talks to scientific audiences; I’ve had to relearn how to build a “story” based on the sort of “plot” or pacing that communicates the message best.  I think there’s parallels between classic literature tropes (e.g. hero’s journey, tragedy, comedy) that could get translated really well through the lens of telling a scientific story. There’s some examples of nonfiction biographies or histories that have done this for biotech stories, like the books Living Medicine and Billion Dollar Molecule, that retell the history of bone marrow transplants and the pharmaceutical company Vertex’s founding respectively, but they do it with a framing that helps tell the historical story and scientific journey in a way that is nice to read.

I’m sure with both of those books that the exact history isn’t perfectly captured, and in part can never be because the way the stories are told involves so many peoples’ specific memories and emotions and motivations, but the general approach is something I really admire from a communication standpoint.

I think the books work so well, in part, because the reader can pattern match the general story arc to other novels. There’s some backstory setting up the scene and the characters, there’s some tension or suspense that puts the main characters through a challenge, and then there’s a resolution by the end of the book. There’s some subplots along the way, maybe some romance or comedy. 

Lack of storytelling is why a lot of scientific communication falls so flat. I’ve sat through a lot of scientific presentations that are just a linear chronology of all the experiments that the presenter has ever done. The right story, though, is almost never the chronological story. Although there’s some contexts where the chronological story is motivational to the audience, I think most audiences are thinking ”Why should I care?” so you need to directly say out loud why they should bother listening and paying attention to you. For startup pitch decks, usually the first slide is either directly stating the problem that the company proposes solving, or it’s the financial opportunity (market size); both of which directly tell the audience why they should care because it’s a problem that they themselves can recognize or it’s an opportunity to make money. For a scientific presentation, usually an element of teaching goes a long way, so that the audience cares because you’ve taught them something new. (Put another way, making the audience feel smart/smarter is a good motivator for a scientific audience, who is probably always looking to learn more.)

In scientific manuscripts for peer reviewed journals, there’s definitely a certain pattern that I’ve come to expect from papers. For a basic four-figure paper, the first figure is some kind of method or overall experimental schematic. The second figure is a high-level visualization of the data, like a heatmap or a dimension reduction like principal component analysis (PCA), t-SNE, UMAP. The third figure is a deep dive into some slice of that big dataset, just visualized differently. And then the final figure is some orthogonal experiment to prove figure three correct, and/or a schematic of some biological mechanism that the data suggests. Pattern matching that template helps get through papers pretty quickly, because I can just flip to the figures and usually they follow some general flow like that.

Having some portion of the communication being predictable helps get the message across. If the message itself is unexpected or difficult, then having the medium be predictable or the presentation be predictable can help, I think. Predictability isn’t a bad thing. There’s something comfortable about knowing what to expect, and when it changes abruptly, it can be jarring. I’m thinking about things like when your favorite band has a particular style of music that they produce, but then there’s that one weird random song that doesn’t fit the vibe and sticks out badly. (Of course, some people are amazing across genres, and there’s some scientists like that, too, who can easily hold their own across multiple fields.)

Something I use almost always in my scientific presentations is a “three-act” structure. My talks almost always start with some brief introduction to set the “scene”, maybe 10-15% of the total talk. Then, I set up three main, take-home messages for the audience, and each of the three is about 25-30% of the talk and somewhat builds on each other. Finally, with the remaining 10-15% of the talk, I have some “cliff-hanger” future work, but not whatever the next obvious logical step would be based on what I’ve said, but some more distant future vision. It’s not something that I happen to do, it’s something that I pointedly intentionally do whenever I sit down to organize a talk. I usually start by deciding on the three main messages, set each of those up, then do a little scene-setting/exposition at the beginning that gives just enough context for the three messages, and then a little bit of forward-thinking “cliff-hanger” at the end. I don’t claim to be the best presenter or anything, but I’ve been invited to speak quite a bit so I figure something must be resonating.

None of this is meant to be an immediate solution to any scientific communication struggles, and again to be clear I don’t mean to imply that I’m a significantly better communicator than anybody else. In part, this is because I don’t think we scientists get enough training on communication, and in part because it’s just hard anyway, even if we did get trained. I’m inspired by writers, though, because I think if we structured scientific communication to lean on common literature, like plot devices and story structure, we’d probably capture a wider audience and have more buy-in and support from policy makers, funding agencies, and even the general public.

CHAPTER ELEVEN: INTEGRATION

This chapter will be a continuation of the last, “PROTEINS”, staying along the central dogma measurement theme. Previously I focused primarily on how to measure proteins, but don’t get me wrong, measuring the other molecules (DNA and RNA) are also valuable and important. No one molecule is going to hold all the information to crack biology, and that sentiment is what’s generally behind the hype and excitement of using “AI” for biology.

A quick aside, I don’t love the “artificial intelligence” term. Back in my day (ha) it was “machine learning” but for some reason “AI” has replaced “ML”. And for that matter, when “AI” is used lately, it’s almost always synonymous with “large language model”/LLM i.e. ChatGPT. Which is one type of AI (ML) but while impressive, I don’t think LLMs are really going to be the type of ML to crack biology.

The hype or promise of making AI really useful for biology is to improve drug discovery, or personalized medicine, or diagnosing disease, or even just finding new fundamental knowledge and connections we haven’t made previously. That last part is where my doubts come in. We (speaking for the greater scientific community) haven’t really had much success integrating multiple molecule measurements together. There’s a lot of work towards multi-omics, that is, collecting multiple ‘omic measurements on the same or similar samples, like doing both DNA sequencing and RNA sequencing and even proteomics all on the same sample.

But I haven’t seen much multi-omics analyses that are truly integrating the data together. Most analyses I see analyze each individual data type or measurement, arrive at an individual data type’s conclusion, and then move onto the next data type. It’s more sequential ‘omics than multi-omics, in my opinion.

It’s interesting, too, because there’s significant efforts to use one data type in order to predict another data type. For example, using RNA sequencing data (transcriptomics) to predict which proteins will be present and at what abundance. These predictions are imperfect at best, and just straight wrong at worst. Obviously, if they worked better, it would be great to use a cheaper or easier measurement to predict a more expensive or laborious measurement.

Because we haven’t had much success in “simple” computational approaches to integrate across multiple measurement systems, I’m a bit skeptical that we’re going to be able to put together an AI that really cracks being able to predict biology in general. We just don’t have a good enough understanding of the data yet, so I’m not sure how you can train a computer to make sense of something that we don’t yet understand ourselves.

The lack of “training data” is the big hurdle, I think. LLMs like ChatGPT were enabled, in part, because there’s such a huge volume of text to train on, but there’s nowhere near the same volume of data for biology. Further, of the biological data we do have, the majority of it is poorly annotated, if at all. In other words, there’s a data file somewhere, but we don’t really know where it came from, what it’s supposed to be measuring, and whether the quality is good or not. 

(Aside, this is for sure true of proteomics data measured by mass spectrometry, although perhaps some other fields are better about data annotation and labeling. The “metadata” of files, including even some simple things like what organism was measured, isn’t always provided with the files. There’s efforts to fix this, both by prospectively requiring metadata files at the time of data submission for academic, peer-reviewed publications, and also by retroactively going back through archives of the most popular or high quality data and backfilling metadata for them. Overall, it’s going to be tough for mass spectrometry proteomics either way, because even if we get the data files annotated with correct metadata, there’s a second problem of needing to have those raw data files be processed in some way to make them more useful for machine learning, at least ML with the intent to infer or predict biology.)

I might be totally wrong though. There’s the classic essay, “The Bitter Lesson”, by Rich Sutton, which suggests that we don’t need more data, or better data, or be able to integrate data together; instead, that essay suggests that “the only thing that matters in the long run is the leveraging of computation”, and therefore we don’t need to worry about current limitations or challenges with multi-omics or amount of data. With the Bitter Lesson philosophy, instead we should feed the computers what we have so far, and see if they can train on that and discover something new.

So far, I’d say, at least from my own personal experience, the computers have only gotten good enough to predict the average, at least when it comes to predicting the DNA-binding activity of proteins given a certain biological context. For example, any protein has only the average DNA-binding activity of any other context that the model has seen. This is the real-life equivalent of the computer looking at a haystack and saying “it’s all hay” even though we, as humans, know that there’s a few needles in there; but because the pile is mostly hay, the computer averages the pile out. 

While the Bitter Lesson philosophy can be interpreted into biological AI/ML modeling as letting the computers figure it out for themselves, I think biological data is still too abstract to completely leave the computers to their own devices (pun not intended) without having some human knowledge injected. Biological data (the “input”) is already too abstracted away from conclusions that we try to make from it (the “output”), so it’s not really fair to expect a computer to be able to go from one side to the other without giving it enough examples to learn from.

There’s still a requirement, then, for having enough data, and really it’s not just having enough data, but having enough signal in the data for the computers to learn from. The real-life equivalent here is like if you trained a computer to label animals in a photograph, but only fed the computer photos of dogs, cats, and birds to learn from, then it wouldn’t be surprising for the computer to not know what a squirrel is. Similarly with biological AI/ML, if we’re not feeding the right data, we shouldn’t be able to expect the computer to predict something it’s never seen before.

In current times, we’re seeing similar things with ChatGPT, like the funny observations where ChatGPT can’t count how many letter “b” there are in the word “blueberry”; ChatGPT is a model trained to predict the next words based on the previous words, and the way it “thinks” about words isn’t even in words but in “tokens”, so it’s not surprising that the model doesn’t work correctly when it’s being used in a way it wasn’t trained to be used.

The idea of foundation models fits into this, in my mind, because they’re meant to be stand-alone models trained for a specific use, but could go beyond their original task to other more general tasks. Foundation models are quite popular in biology and chemistry lately, with everyone looking to build their own foundation model based on whatever their niche expertise happens to be, and I suppose hoping that the “generalizability” will come as a happy surprise after the model is built. Guilty of this myself, as we’re building some foundation models at my company; why not, if we have the data and the expertise to try it?

There’s plenty of criticism for foundation models as being overhyped these days, since so many people are working on them. Some of these models are doomed to fail, I think, because they’re relying too much on publicly available data which suffers from the annotation and quality challenges I mentioned above. Others will fail because of hubris, where there’s some pretty big egos who think that the only reason biology hasn’t been “solved” is because their genius input hasn’t been contributed yet. Realistically, I think there’s still some fundamental housekeeping work that’s required to get to useful AI/ML for biology, namely in data processing and cleanup (again, like the metadata and quality issues above), but also just in how we present the data to the computer. You can’t just feed data to a computer, you need to turn the data into something the computer understands. This is something I’m not sure we’ve figured out yet.

In the end, despite all the negativity I’ve got going on in this ramble, it’s not only inevitable that breakthrough AI/ML models are going to happen for biology and chemistry, but it’s already happened so it’s just a matter of time. The 2024 Nobel Prize for chemistry went to an AI/ML model of protein structure, AlphaFold/Rosetta, so I think it’s unlikely it’ll be a one-hit-wonder and we won’t get more interesting (if imperfect) models in the coming years.

After all, “all models are wrong” (George Box) but some are useful.

CHAPTER TEN: PROTEINS

Double post day! I’ve been wondering whether I should do a more “professional” content chapter instead of general personal rambles I’ve been doing so far, and I think today we’ll be doing a casual introduction to proteomics. This is going to be poorly cited, if at all, and not peer reviewed or particularly in depth, it’s just my own description of how I think about proteins. Although I’m professionally a deeply trained “proteomicist”, I still think of proteins in a kind of personified view. For whatever reason, that’s just how my brain works. There’s some great academic resources on the history of proteins and the study of proteins (“proteomics”), like this one from the Human Proteome Organization (HUPO). This chapter isn’t going to be that, though.

Some people have taken a biology class in high school before and gone over the classical “central dogma” of biology, which is that DNA is transcribed into RNA, which is then translated into proteins, and then proteins are the molecular machines that do most of the “work” around a cell. Proteins are the motors that pull cargo from one side of the cell to another, they’re the catalysts that turn one molecule into another, and they’re the control switches that turn genes on or off. Each of those “jobs” make up classes or families of proteins: in the previous sentence, those are kinesins, enzymes, and transcription factors, respectively. Transcription factors are my personal favorite lately, with the work I do at my startup company, but all proteins are pretty cool and it still blows my mind to watch the artistic interpretations of proteins in action like this or the paintings by David S. Goodsell

Measuring proteins is a lot harder than measuring DNA or RNA. While genome sequencing has become pretty mainstream, with direct-to-consumer kits to sequence yourself (23andMe) or even your dog (Embark), proteomics has had a harder time getting up to the Moore’s Law trajectory. DNA is made up of four nucleic acids; proteins are made up of 20 amino acids. The range of size and chemical properties across the 20 amino acids is much more varied than the sizes and chemical properties of the four nucleic acids. Between that and the combinatorics, these base components of each molecule is one thing that makes measuring proteins so hard. 

Another complication is the “dynamic range” of the molecules. Dynamic range refers to how some molecules are very common and highly abundant, while other molecules are rare and lowly abundant. This is the “needle in a haystack” problem, where some protein molecules are very common and abundant (the hay) and others are rare (the needle) – it’s hard to count how many needles might be in a haystack because there’s just an overwhelming amount of hay to deal with. It’s the same thing with proteins, where some proteins are just really abundant, like albumin, while other proteins are very low abundant, such as hormones like testosterone or estradiol. There are tricks you can do to manipulate the metaphorical haystack to make it easier to find the needles. Enrichment techniques like immunopurification bind the protein of interest using an antibody to fish the protein out of the mix, basically like using a magnet to pull out the metal needles from the hay; depletion techniques light the hay on fire to leave behind the needles. At the end of the day, even with sample manipulations, dynamic range is still a challenge. For a more academic breakdown of the dynamic range problem, one of my PhD advisors did a theoretical breakdown of why the dynamic range of proteomics makes it magnitudes more difficult than other ‘omes.

A third challenge is that there is no copy-paste for proteins. DNA and RNA have an amazing trick called polymerase chain reaction (PCR), which essentially means that scientists can take one single molecule and make as many copies of it as they want. Proteins, sadly, don’t have this miraculous invention. It would surely be a Nobel Prize to anybody who figured it out though. So, whatever you want to measure, you’re not only going to have problems based on the complexity of the biochemistry, but also the analytical dynamic range, and now also you can’t make more if you need to try it again.

Nevertheless, there are indeed ways that we measure proteins. A lot of ways, really, and scientists have been doing it for a long time. This won’t be an exhaustive list by any means (again, this isn’t peer reviewed or anything, it’s just me riffing off the top of my head and the tips of my fingers) but I’ll focus here on the measurement techniques that tell you protein identity or components, rather than overall protein concentration like Bradford, BCA, or Nanodrop. 

There’s old school Edman degradation, where a protein is stretched out into the linear string of its component amino acids, then each amino acid is chopped away one by one to come up with the full sequence of the protein. There’s antibody-based approaches, like Western blots or fluorescent assays, where you have a tool kit of antibodies which will recognize partial shapes of proteins, and you can use those antibodies to “light up” different proteins; if an antibody lights up, it means that protein is present in the sample. There’s mass spectrometry, which smashes up the protein and then reassembles the pieces using the mass (as the name implies) of each amino acid as a puzzle piece to fit the sequence of the protein back together. Finally, there’s new technologies like nanopore sequencing, which uses electric currents passed through each amino acid of the protein to identify which amino acid it is, and piece the protein sequence together that way.

All are pretty wildly creative ways to solve the protein measurement problem. For the record, the most common way to do DNA sequencing uses the “light up” approach, where each of the four base pairs gets a specific color assigned to it, and then bases are sequentially chopped off kind of like the Edman degradation approach (specifically, it’s called Sanger sequencing) – the last, terminal DNA base will light up “red”, a tiny microscope inside the sequencer will see the “red” and assign that to its respective nucleic acid and that base gets chopped off. The next DNA base is run, and lights up “green”, and the microscope records that nucleic acid, which gets chopped off for the next; and so on and so on until the entire piece of DNA has been sequenced, base by base. It’s pretty smart, and there’s some newer ways to do DNA sequencing (“next gen sequencing”) but that’s really the gist of it.

Unfortunately it doesn’t work for proteins, at least not for proteins in a complex mixture like, for example, blood. It’s a math problem in that doing sequencing with the light-up bits and microscopes is inherently limited by the physics of light, something my PhD advisor details more in his manuscript, but the TL;DR is that the size of the experiment for DNA sequencing (flow cell) easily fits in the size of your hand, but based a maximum density of the wavelength of light, the experiment (flow cell) to do protein sequencing would need to be 1 m^2. 

Some people are trying to get around that problem by using the enrichment/depletion tricks I mentioned before, so that the proteome is less complex and therefore doesn’t require such a big experiment (flow cell) system, but then you’re also limited to only measuring a portion of the proteome/haystack.

Probably the “next gen proteomics” approach I’m most excited about is nanopore sequencing, where they pull the protein through a literal pore (which, hilariously, is actually made up of OTHER proteins itself) and use electrical currents to determine which amino acid is going through the pore. There’s been a lot of work on this for DNA sequencing, but again, DNA is easier because there’s only four bases and they’re all chemically similar, while proteins have challenge 1 and 2 from above, in that there are 20 amino acids and they’re highly varied in size and chemistry. The size and chemistry variation makes it hard to fit everything through the “same” pore – if the literal hole in the pore is too big, the electrical current won’t be good enough to sense the small amino acids, but if the hole is too small, the really big amino acids won’t be able to fit. So while nanopore sequencing is, to me, the most science-fiction/fantasy approach of them all, it’s also quite a ways from being commercialized for proteins, I think.

We’re stuck with mass spectrometry for now, for the most part. That’s not to say mass spectrometry is perfect. It’s a cool $1M to buy a mass spectrometer, so we desperately need better ways to measure proteins. I’ve written a ton about that in the past, both formally and informally. There’s lots of startup companies trying to solve the protein measurement problem using the various approaches above, as mentioned, or at least introduce some alternative ways to measure proteins, but unfortunately nothing is really available yet. Some are close, I think, but price is going to be a barrier still because I’m sure they’ll be expensive.

In the end, I’m excited to see what might come out in the next 2-5 years. I’m not personally or professionally banking on mass spectrometry remaining the only really viable technology for proteomics although I certainly will always have a soft spot for mass spec. It’s cool to think about what would be possible with a cheaper, faster, and/or more sensitive protein measurement system.