machine learning – Lindsay K. Pino

This chapter will be a continuation of the last, “PROTEINS”, staying along the central dogma measurement theme. Previously I focused primarily on how to measure proteins, but don’t get me wrong, measuring the other molecules (DNA and RNA) are also valuable and important. No one molecule is going to hold all the information to crack biology, and that sentiment is what’s generally behind the hype and excitement of using “AI” for biology.

A quick aside, I don’t love the “artificial intelligence” term. Back in my day (ha) it was “machine learning” but for some reason “AI” has replaced “ML”. And for that matter, when “AI” is used lately, it’s almost always synonymous with “large language model”/LLM i.e. ChatGPT. Which is one type of AI (ML) but while impressive, I don’t think LLMs are really going to be the type of ML to crack biology.

The hype or promise of making AI really useful for biology is to improve drug discovery, or personalized medicine, or diagnosing disease, or even just finding new fundamental knowledge and connections we haven’t made previously. That last part is where my doubts come in. We (speaking for the greater scientific community) haven’t really had much success integrating multiple molecule measurements together. There’s a lot of work towards multi-omics, that is, collecting multiple ‘omic measurements on the same or similar samples, like doing both DNA sequencing and RNA sequencing and even proteomics all on the same sample.

But I haven’t seen much multi-omics analyses that are truly integrating the data together. Most analyses I see analyze each individual data type or measurement, arrive at an individual data type’s conclusion, and then move onto the next data type. It’s more sequential ‘omics than multi-omics, in my opinion.

It’s interesting, too, because there’s significant efforts to use one data type in order to predict another data type. For example, using RNA sequencing data (transcriptomics) to predict which proteins will be present and at what abundance. These predictions are imperfect at best, and just straight wrong at worst. Obviously, if they worked better, it would be great to use a cheaper or easier measurement to predict a more expensive or laborious measurement.

Because we haven’t had much success in “simple” computational approaches to integrate across multiple measurement systems, I’m a bit skeptical that we’re going to be able to put together an AI that really cracks being able to predict biology in general. We just don’t have a good enough understanding of the data yet, so I’m not sure how you can train a computer to make sense of something that we don’t yet understand ourselves.

The lack of “training data” is the big hurdle, I think. LLMs like ChatGPT were enabled, in part, because there’s such a huge volume of text to train on, but there’s nowhere near the same volume of data for biology. Further, of the biological data we do have, the majority of it is poorly annotated, if at all. In other words, there’s a data file somewhere, but we don’t really know where it came from, what it’s supposed to be measuring, and whether the quality is good or not.

(Aside, this is for sure true of proteomics data measured by mass spectrometry, although perhaps some other fields are better about data annotation and labeling. The “metadata” of files, including even some simple things like what organism was measured, isn’t always provided with the files. There’s efforts to fix this, both by prospectively requiring metadata files at the time of data submission for academic, peer-reviewed publications, and also by retroactively going back through archives of the most popular or high quality data and backfilling metadata for them. Overall, it’s going to be tough for mass spectrometry proteomics either way, because even if we get the data files annotated with correct metadata, there’s a second problem of needing to have those raw data files be processed in some way to make them more useful for machine learning, at least ML with the intent to infer or predict biology.)

I might be totally wrong though. There’s the classic essay, “The Bitter Lesson”, by Rich Sutton, which suggests that we don’t need more data, or better data, or be able to integrate data together; instead, that essay suggests that “the only thing that matters in the long run is the leveraging of computation”, and therefore we don’t need to worry about current limitations or challenges with multi-omics or amount of data. With the Bitter Lesson philosophy, instead we should feed the computers what we have so far, and see if they can train on that and discover something new.

So far, I’d say, at least from my own personal experience, the computers have only gotten good enough to predict the average, at least when it comes to predicting the DNA-binding activity of proteins given a certain biological context. For example, any protein has only the average DNA-binding activity of any other context that the model has seen. This is the real-life equivalent of the computer looking at a haystack and saying “it’s all hay” even though we, as humans, know that there’s a few needles in there; but because the pile is mostly hay, the computer averages the pile out.

While the Bitter Lesson philosophy can be interpreted into biological AI/ML modeling as letting the computers figure it out for themselves, I think biological data is still too abstract to completely leave the computers to their own devices (pun not intended) without having some human knowledge injected. Biological data (the “input”) is already too abstracted away from conclusions that we try to make from it (the “output”), so it’s not really fair to expect a computer to be able to go from one side to the other without giving it enough examples to learn from.

There’s still a requirement, then, for having enough data, and really it’s not just having enough data, but having enough signal in the data for the computers to learn from. The real-life equivalent here is like if you trained a computer to label animals in a photograph, but only fed the computer photos of dogs, cats, and birds to learn from, then it wouldn’t be surprising for the computer to not know what a squirrel is. Similarly with biological AI/ML, if we’re not feeding the right data, we shouldn’t be able to expect the computer to predict something it’s never seen before.

In current times, we’re seeing similar things with ChatGPT, like the funny observations where ChatGPT can’t count how many letter “b” there are in the word “blueberry”; ChatGPT is a model trained to predict the next words based on the previous words, and the way it “thinks” about words isn’t even in words but in “tokens”, so it’s not surprising that the model doesn’t work correctly when it’s being used in a way it wasn’t trained to be used.

The idea of foundation models fits into this, in my mind, because they’re meant to be stand-alone models trained for a specific use, but could go beyond their original task to other more general tasks. Foundation models are quite popular in biology and chemistry lately, with everyone looking to build their own foundation model based on whatever their niche expertise happens to be, and I suppose hoping that the “generalizability” will come as a happy surprise after the model is built. Guilty of this myself, as we’re building some foundation models at my company; why not, if we have the data and the expertise to try it?

There’s plenty of criticism for foundation models as being overhyped these days, since so many people are working on them. Some of these models are doomed to fail, I think, because they’re relying too much on publicly available data which suffers from the annotation and quality challenges I mentioned above. Others will fail because of hubris, where there’s some pretty big egos who think that the only reason biology hasn’t been “solved” is because their genius input hasn’t been contributed yet. Realistically, I think there’s still some fundamental housekeeping work that’s required to get to useful AI/ML for biology, namely in data processing and cleanup (again, like the metadata and quality issues above), but also just in how we present the data to the computer. You can’t just feed data to a computer, you need to turn the data into something the computer understands. This is something I’m not sure we’ve figured out yet.

In the end, despite all the negativity I’ve got going on in this ramble, it’s not only inevitable that breakthrough AI/ML models are going to happen for biology and chemistry, but it’s already happened so it’s just a matter of time. The 2024 Nobel Prize for chemistry went to an AI/ML model of protein structure, AlphaFold/Rosetta, so I think it’s unlikely it’ll be a one-hit-wonder and we won’t get more interesting (if imperfect) models in the coming years.

After all, “all models are wrong” (George Box) but some are useful.

Tag: machine learning

CHAPTER ELEVEN: INTEGRATION