CHAPTER ELEVEN: INTEGRATION

This chapter will be a continuation of the last, “PROTEINS”, staying along the central dogma measurement theme. Previously I focused primarily on how to measure proteins, but don’t get me wrong, measuring the other molecules (DNA and RNA) are also valuable and important. No one molecule is going to hold all the information to crack biology, and that sentiment is what’s generally behind the hype and excitement of using “AI” for biology.

A quick aside, I don’t love the “artificial intelligence” term. Back in my day (ha) it was “machine learning” but for some reason “AI” has replaced “ML”. And for that matter, when “AI” is used lately, it’s almost always synonymous with “large language model”/LLM i.e. ChatGPT. Which is one type of AI (ML) but while impressive, I don’t think LLMs are really going to be the type of ML to crack biology.

The hype or promise of making AI really useful for biology is to improve drug discovery, or personalized medicine, or diagnosing disease, or even just finding new fundamental knowledge and connections we haven’t made previously. That last part is where my doubts come in. We (speaking for the greater scientific community) haven’t really had much success integrating multiple molecule measurements together. There’s a lot of work towards multi-omics, that is, collecting multiple ‘omic measurements on the same or similar samples, like doing both DNA sequencing and RNA sequencing and even proteomics all on the same sample.

But I haven’t seen much multi-omics analyses that are truly integrating the data together. Most analyses I see analyze each individual data type or measurement, arrive at an individual data type’s conclusion, and then move onto the next data type. It’s more sequential ‘omics than multi-omics, in my opinion.

It’s interesting, too, because there’s significant efforts to use one data type in order to predict another data type. For example, using RNA sequencing data (transcriptomics) to predict which proteins will be present and at what abundance. These predictions are imperfect at best, and just straight wrong at worst. Obviously, if they worked better, it would be great to use a cheaper or easier measurement to predict a more expensive or laborious measurement.

Because we haven’t had much success in “simple” computational approaches to integrate across multiple measurement systems, I’m a bit skeptical that we’re going to be able to put together an AI that really cracks being able to predict biology in general. We just don’t have a good enough understanding of the data yet, so I’m not sure how you can train a computer to make sense of something that we don’t yet understand ourselves.

The lack of “training data” is the big hurdle, I think. LLMs like ChatGPT were enabled, in part, because there’s such a huge volume of text to train on, but there’s nowhere near the same volume of data for biology. Further, of the biological data we do have, the majority of it is poorly annotated, if at all. In other words, there’s a data file somewhere, but we don’t really know where it came from, what it’s supposed to be measuring, and whether the quality is good or not. 

(Aside, this is for sure true of proteomics data measured by mass spectrometry, although perhaps some other fields are better about data annotation and labeling. The “metadata” of files, including even some simple things like what organism was measured, isn’t always provided with the files. There’s efforts to fix this, both by prospectively requiring metadata files at the time of data submission for academic, peer-reviewed publications, and also by retroactively going back through archives of the most popular or high quality data and backfilling metadata for them. Overall, it’s going to be tough for mass spectrometry proteomics either way, because even if we get the data files annotated with correct metadata, there’s a second problem of needing to have those raw data files be processed in some way to make them more useful for machine learning, at least ML with the intent to infer or predict biology.)

I might be totally wrong though. There’s the classic essay, “The Bitter Lesson”, by Rich Sutton, which suggests that we don’t need more data, or better data, or be able to integrate data together; instead, that essay suggests that “the only thing that matters in the long run is the leveraging of computation”, and therefore we don’t need to worry about current limitations or challenges with multi-omics or amount of data. With the Bitter Lesson philosophy, instead we should feed the computers what we have so far, and see if they can train on that and discover something new.

So far, I’d say, at least from my own personal experience, the computers have only gotten good enough to predict the average, at least when it comes to predicting the DNA-binding activity of proteins given a certain biological context. For example, any protein has only the average DNA-binding activity of any other context that the model has seen. This is the real-life equivalent of the computer looking at a haystack and saying “it’s all hay” even though we, as humans, know that there’s a few needles in there; but because the pile is mostly hay, the computer averages the pile out. 

While the Bitter Lesson philosophy can be interpreted into biological AI/ML modeling as letting the computers figure it out for themselves, I think biological data is still too abstract to completely leave the computers to their own devices (pun not intended) without having some human knowledge injected. Biological data (the “input”) is already too abstracted away from conclusions that we try to make from it (the “output”), so it’s not really fair to expect a computer to be able to go from one side to the other without giving it enough examples to learn from.

There’s still a requirement, then, for having enough data, and really it’s not just having enough data, but having enough signal in the data for the computers to learn from. The real-life equivalent here is like if you trained a computer to label animals in a photograph, but only fed the computer photos of dogs, cats, and birds to learn from, then it wouldn’t be surprising for the computer to not know what a squirrel is. Similarly with biological AI/ML, if we’re not feeding the right data, we shouldn’t be able to expect the computer to predict something it’s never seen before.

In current times, we’re seeing similar things with ChatGPT, like the funny observations where ChatGPT can’t count how many letter “b” there are in the word “blueberry”; ChatGPT is a model trained to predict the next words based on the previous words, and the way it “thinks” about words isn’t even in words but in “tokens”, so it’s not surprising that the model doesn’t work correctly when it’s being used in a way it wasn’t trained to be used.

The idea of foundation models fits into this, in my mind, because they’re meant to be stand-alone models trained for a specific use, but could go beyond their original task to other more general tasks. Foundation models are quite popular in biology and chemistry lately, with everyone looking to build their own foundation model based on whatever their niche expertise happens to be, and I suppose hoping that the “generalizability” will come as a happy surprise after the model is built. Guilty of this myself, as we’re building some foundation models at my company; why not, if we have the data and the expertise to try it?

There’s plenty of criticism for foundation models as being overhyped these days, since so many people are working on them. Some of these models are doomed to fail, I think, because they’re relying too much on publicly available data which suffers from the annotation and quality challenges I mentioned above. Others will fail because of hubris, where there’s some pretty big egos who think that the only reason biology hasn’t been “solved” is because their genius input hasn’t been contributed yet. Realistically, I think there’s still some fundamental housekeeping work that’s required to get to useful AI/ML for biology, namely in data processing and cleanup (again, like the metadata and quality issues above), but also just in how we present the data to the computer. You can’t just feed data to a computer, you need to turn the data into something the computer understands. This is something I’m not sure we’ve figured out yet.

In the end, despite all the negativity I’ve got going on in this ramble, it’s not only inevitable that breakthrough AI/ML models are going to happen for biology and chemistry, but it’s already happened so it’s just a matter of time. The 2024 Nobel Prize for chemistry went to an AI/ML model of protein structure, AlphaFold/Rosetta, so I think it’s unlikely it’ll be a one-hit-wonder and we won’t get more interesting (if imperfect) models in the coming years.

After all, “all models are wrong” (George Box) but some are useful.

CHAPTER NINE: LIMINAL

The purgatory of being in-between naive beginner excitement and rational experienced master is basically where I live for most skills, just right there in the Trough of Disillusionment on the Gartner hype cycle. It feels like there’s a ton of support for “getting started”, and a ton of support for highly niche specialization, but just not a lot that helps you get through the purgatory of “intermediate”. This goes for learning a new language, or how to code, or baking, or entrepreneurship, or writing, or managing, or whatever. There’s so much out there to support the zero-to-one initialization, and then there’s deep subject matter expertise, but the “messy middle” is really hard, seemingly endlessly wandering through a liminal space.

I don’t really feel like I’m hitting “enlightment” in anything, just maybe finding how much farther down the “disillusionment” goes, but I’m also not anywhere near “inflated expectations” for any of my skills so that seems to put me pretty solidly in the “intermediate” range. I love picking up new things – I mentioned before that I challenged myself to learn how to bake macarons, and that I wanted to learn how to code so I joined a machine learning lab – and it’s so frustrating to get the basics down then just have zero resources to get through “intermediate” to “fluent” or “advanced”. There’s the 10,000 hours rule, which says that it takes 10,000 hours to master a skill, but that seems to emphasize the point that there’s always tons of resources to help you with the first 10-100 hours, but after that it’s just supposed to be grinding until you reach near-mastery and can get into the ultra-deep niche groups, I guess.

So how do you get through “intermediate” to be considered a “master”? Although it’s the 10,000 hour rule, realistically that’s more like 5 years of fairly dedicated training, so about the average American science PhD program. The first year is structured classes like high school or college, with more in depth materials, but after the first year or two it’s all unstructured research, largely self-guided with some input from your PhD advisors and thesis committee members. In the end, when you defend and get the “PhD” letters after your name, society generally recognizes you as a “master” in that subject, which is itself a hilariously sub-sub-sub-field specific niche, a tiny drop in the vast vast ocean of human knowledge.

In the things where I’m “intermediate”, I don’t feel like I make a lot of progress after those first 1-2 years of structured learning. Maybe because the rest all needs to be self-guided? I wish there was more structure out there for intermediate anything, to at least learn more about what I need to learn. 

I probably just need to learn to embrace the journey that is being “intermediate” and find ways to enjoy the process more.