First in Human Episode #32 featuring Nicolas Tilmans

For episode 32, we chat with Nicolas Tilmans, Founder & CEO at Anagenex. Stay tuned to learn how Anagenex is combining the power of machine learning and biochemistry to revolutionize the drug discovery process. First In Human is a biotech-focused podcast that interviews industry leaders and investors to learn about their journey to in-human clinical trials. Presented by Vial, a tech-enabled CRO, hosted by Simon Burns, CEO & Co-Founder & guest host Co-Founder, Andrew Brackin. Episodes launch weekly on Tuesdays.

Simon Burns: [00:00:00] Nicholas, thank you for joining us on First Inhuman.

Nicolas Tilmans: Thank you so much for the opportunity.

Simon Burns: You come from an interesting mix of both bio and computer science. You’re one of this new breed of tech bio founders. Tell us about your background. How did you get into the space?

Nicolas Tilmans: I’ve been interested in genetics from a very young age. At the time, I thought that was genetics, but it turns out that was biochemistry. I also loved computers. I taught myself how to program in high school in C and C++. When I was looking for colleges, I thought, maybe I should do this biochemistry thing.

I was interning at the NIH. The post-doc I was interning with said, “Hey, you should really look at this bioinformatic stuff that’s coming online, and maybe you want to do that.” And I thought, “All right, I’ll just add computer science. How hard can it be as a major?” I did both biochemistry and computer science as an undergrad then thought, “You can’t really answer questions as a bioinformatician. You get a lot of questions to answer, a lot of ideas, but you don’t actually answer what is the answer until you do an experiment.”

So, I went to grad school for biochemistry at Stanford. I ended up getting drawn back into a technology approach. I love the idea of finding, and building new tools. I worked as a graduate student with Pehr Harbury working on a flavor of DNA encoded libraries that ended up eventually getting spun out into DICE. I was at the bench doing a lot of chemistry, and molecular biology in this complex DNA encoded library scheme.

After that, I decided I didn’t want to be pipetting anymore. (laughs) I went back to computers as a data scientist. I worked at a couple different places. Eventually ended up running a data and machine learning engineering team at a small company called, Lumiata. Which dealt mostly with patient data, lots of insurance and EHR data. I was trying to harmonize and build models on that before starting this company in late 2019, early 2020.

I’ve gone back and forth between the bench and the keyboard. My dream is to see projects where the computer can create new experiments that can generate novel computational solutions that there’s this virtuous cycle between the two. Anagenex is one of those places in particular because of the kinds of data sets that we generate, allow us to ask computational questions and then to design experiments that couldn’t be done otherwise. So that’s the grand vision how I got here and why I do it.

Simon Burns: Let’s drill into Anagenex. Your focus is taking a machine learning approach to love the work being done in the Dell, data encoded library space. Give us a sense of the opportunities are there for novel computational driven approaches in advancing therapeutics.

Nicolas Tilmans: While we’re using DNA encoded libraries a great deal, the goal of Anagenex isn’t to do machine learning on DNA encoded libraries. The goal of Anagenex is for machine learning in particularly discovering novel chemical matter in drug discovery. We’ll start at the higher level of what is the real problem that we think is important to solve in drug discovery. There’s so many of them. But, one of them is if you have a new target that you think is interesting, it takes thousands, sometimes even tens of thousands of compounds. And two, sometimes even ten years, to even get to a point where you get to test that idea, that hypothesis with a molecule in living organism. Finding new chemical matter that will be useful to a new target or a new biological hypothesis is a major challenge in drug discovery.

Machine learning has made a lot of promises on how it can help that process. It has failed to deliver as a general rule. It’s just very hard to get machine learning to help with that. Machine learning works in places where you have tons of data where everybody’s looking at things like ChatGPT, and StabilityAI, Stable Diffusion, Mid-Journey. All of these beautiful models that are all over Twitter today are working because they have extraordinarily large amounts of data that power those architectures and that data simply does not exist in chemistry.

If you look at the way a high throughput screen or traditional screening process works, if you are at the biggest pharmaceutical companies you’re going to have access to low single digit millions compounds. You’re going to do a rough screen with that. You’re going to have some set of data from that. Then, you’ll probably do a focus screens in the high tens, low hundreds of thousands of compounds. Then, another follow on screen. So, essentially, by the time you’re done, you’re going to have maybe between tens of thousands or hundreds of thousands of high quality data points to train your model. That’s nowhere near the kinds of data sets that you would want to go into true machine learning.

So why we use Dells is not because we want to use Dells with ML, it’s because we think it’s the only place to get billion data point scaled data sets to be able to power these novel architectures. We use a variety of tools, not just Dells, but also Affinity selected mass spectrometry and a lot of other approaches to be able to create the data sets to train the modeling. That’s step one, can you create a big data set, train a model off of that, that has some predictive power?

We think that the next thing in machine learning that’s underappreciated is how do you make the model better? How do you improve it after generation one? The only guarantee you have in machine learning is that your first model is probably not your best model. So how do you get it to be better?

That’s where our unique ability [00:05:00] to build libraries really fast allows us to say, “Hey, machine learning, don’t just tell me what are next hundred compounds or so that you can buy off Unimin or whatever. Tell me what the next one million, ten million compounds we could build are, then let’s go out and build those, test them again, and now we can reinforce the model and improve its abilities to predict things. Eventually, you’d have a model that’s good enough to be able to generate new chemical matter on its own or to be a good arbiter of what’s a good molecule or not to be able to test more innovative in different structures than you might have otherwise.

Simon Burns: You talked about the scale of the data. Give us more color there. How much data are we talking? How do you manage that scale of data? Your data engineering process or just your data stack?

Nicolas Tilmans: At this point, the total data set within the company is in the hundreds of billions of data points. Every time we run an experiment, we’ll have something like two billion data points as a single condition. We’re usually running at least three or four, often 10 different conditions. Depending on the experiment, each experiment can be on the order of 20 billion new measurements of compound to some response; does it actually interact with the protein?

The way we do this now is use a lot of, obviously, AWS. In terms of the data engineering stack, you’ve got several different data sets that feed it. You have the lab work, so, smaller scale work assays. How did you build the library, all of that stuff? Everything is stored in Benchling as our record for experimental data. Anything that involves a DNA encoded library at the scale of those tens of billions of data points, though, is way too large to fit in something like an ELM.

So that’s stored in S3, in AWS, and we interface with it using a variety of database solutions, and Spark. It’s all the same kinds of processes that you would see in a tech company. We’re using large, flat files and various tools that are pretty standard now, to interface with that to make it look like a database.

All of this is hosted in a Kubernetes cluster that we run ourselves. We orchestrate this, I believe at this point we’re on Airflow Two is mostly orchestration. The machine learning experiments used to be tracked in ML Flow. I think we’ve been experimenting with a new different tool to track ML experiments. It’s very important to be able to go back and say, you did this experiment, you had this result. How does it compare to your newest result, your newest model, your newest idea?

That’s the high level overview of the tools and the data architecture we use In short. We use the same kinds of data architectures that you would find at a more traditional tech company that has nothing to do with biotech.

Simon Burns: Give us a sense of some of the challenges in building the company; it can be data, or, non data. What was maybe most surprising in terms of some of the challenges if you would’ve thought back to your earlier self?

Nicolas Tilmans: The hardest thing for any tech biocompany, in particular, a tech bio company that combines computation and lab, is that interface. We have a huge number of technical advantages that have been challenging. How do you build a DNA encoded library at the million compound scale in about two weeks, which is faster than anybody in the industry? We’ve solved that. How do you train on these data sets scalable-y? A lot of companies have struggled there, we’ve solved that. There’s a lot of technical challenges we’ve solved that has given us an edge. The major advantage we have is that culture and our ability to mix the computer team and the lab team, have them communicate well together. We specifically hire for it.

Some of the questions that I ask are: teach me something that one of your experiments, teach me something about one of your machine learning tasks. We screen for the ability to communicate and interface across groups. That is one of the central cultural challenges. We work on it every day. I think we’re the best in the world at it. I still think there’s a lot of room for improvement.

There’s two very different mindsets. A lab person gets if they’re lucky, ballpark, one experiment run a day. They’ll say, “Hey, I tested these compounds and at the end of the day it worked or it didn’t.” A computational person is, compiling or running their code dozens of times an hour. They’re getting data points all the time.

The cost of executing one versus the other, the amount of thought process that is required designing a great lab experiment and designing a quick computational experiment, it’s just a very different set of thought processes. How do you merge those two worldviews? It’s something we work on every day. There’s still more to do.

I guess one of the next questions is how do we do that? I mentioned hiring. I didn’t mention this, but we are a partially distributed company where the lab is all in one place. It’s very hard to virtualize in mass spec, it turns out. Compute can be done anywhere. We bring everybody into the office twice a year to have focus discussions. We’ll have a set of conversations, which are group discussions. That’s usually in the afternoon and the morning is more freeform where people can get demonstrations in the lab. People interface more ad hoc.

We designed these team onsites to create more of that cohesion. We require any remote person to be in the lab one week per quarter. That actually induces somewhat of a cost on us as an organization. It is increased burn in some ways, but you’d be surprised it’s not that much. In the end, the benefit is that you can [00:10:00] hire the best talent anywhere. You’re not really restricted in where and how you hire talent. We’ve found that to be a pretty significant advantage at a number of positions. So, bringing everybody together, hiring for the right people, having clear communications day to day, those are the ways we solve the problem.

Simon Burns: Remind me, you guys found, seemingly, the Goldilocks near Boston, but, cost effective and scalable. Is it Waltham?

Nicolas Tilmans: We were in Woburn, which was, very cost effective. We recently moved to Lexington, which is somewhat more expensive, but still pretty inexpensive relative to downtown Boston. It is a nice Goldilocks because we have now access to public transit in a way that we didn’t in Woburn. So that’s been a boost for us.

Simon Burns: Take us through the transition from early stage discovery, approaching clinic. What changed internally about the company culture as you do that? Give us a sense of what we should be looking forward to as you guys move into the clinic.

Nicolas Tilmans: The biggest hire we’ve made recently was our CSO, Ryan Krueger, a pretty experienced drug hunter who was formerly VP of Biology at Foghorn Therapeutics. What’s changing since we brought him on is raising the level of understanding and rigor around the drug discovery process. What are the experiments you need to run to turn a promising compound that may have some flaws to it into a honest-to-God drug that could go into a person and well, frankly, you’re hiring a lot of people, so we’re hiring different people on assays. We’re hiring different people on medicinal chemistry. Those are the two places where you have to invest a lot of resources.

Eventually, we’re going to have to build out or hire a head of DMPK. So that’s another person that’s going to come in probably in the next year or two. You start focusing more on what it takes to polish the molecules. That requires just building more compounds, testing more compounds. We are amazing at finding early hit matter, and converting those hits into early leads. That’s where the machine learning and our parallel approach really shines right now.

Over time, as the models get better, we’ll be able to push where they contribute further and further down to the drug discovery process later into the lead optimization. But, the truth of the matter is that at some point you need a med chemist to build compounds that are a little bit more bespoke. That are maybe not quite as off the shelf as you would like, or the building blocks aren’t available, or we’re going to have to do this sophisticated reaction. I do not see a place where that’s going to be replaced.

You need to eventually hire a bunch of med chemists. The goal for us isn’t so much to say we’re going to eliminate the medicinal chemists. That’s a red herring that gets hung around machine learning for drug discovery a little bit, unfairly. But to be fair, a lot of hype has made that claim. The reality is what machine learning for drug discovery will do is it’s going to empower that med chemist to test more ideas, and to be smarter about the experiments they run and be more effective.

The analogy I would use is, today we see the impacts of ChatGPT and other large language models. We’re starting to see those impacts especially in places like programming, where software developers find themselves much more efficient, yes, they have to do a lot of code themselves still, but some of the boiler plate, some of the easier stuff, the quick ideas can be offloaded a little bit to ChatGPT. Then, all you’re doing is a bit of an editing job of saying, “Okay, so there’s bugs in this code,” or “I’d, redo it this way.” But, you’re much more efficient because you have that machine learning tool next to you.

I think that’s the future of machine learning and drug discovery. Our goal when we do our machine learning is how do we develop something that enable that for the medicinal chemist. We’re still going to have to hire a bunch of them. We’re still going to have to be able to test those compounds. That’s not going to change, but hopefully we’ll be able to get two or three programs, maybe 10 programs out of the same effort as it would take to usually do one.

Simon Burns: On that topic, let’s zoom out five, 10 years. What do you think the impact of a lot of these, yourself and the whole field of machine learning, computational approaches for drug discovery will have had? What does the field look like once these things are fully baked and operationalized?

Nicolas Tilmans: Everyone will do it in some ways the way we do. There’s going to be other things that would get bolted on, but the broad flow of can we generate a large scale if a little noisy set upfront, train a model and get to really much better search space very fast. That will be as a matter of course, throughout drug discovery. What will then be happening, something like five, 10 years from now is we’re going to evolve a little bit into a place where we’ve solved hit to lead. Where do we go from here with machine learning? Because now you’re inevitably in much smaller dataset land. Those iterations necessarily are smaller because a human has to make them. So how do you blend those two data sets? How do you bring in a lot more information around ADME talks, ADME PK? Which are the things that can become very challenging to optimize?

That’s where the field is going to be five years from now. Ten years from now, every med chemist will have a supercharged set of machine learning tools around them that will be predicting a compound that worked well in the last assay. They’ll have windows next to them with a hundred different ideas, maybe even a thousand different ideas of what to make next. [00:15:00] Use an interface to filter those out and visualize that very quickly. Another window that helps them learn the best way to synthesize it. All of that will be super integrated into a set of easily usable tools for the med chemists.

Underappreciated in all of this, and I’ll admit we haven’t done a ton of work here ourselves yet, I talked a little bit about the culture problem of mixing compute and lab. That also materializes with how does the med chemist use your machine learning? There’s a user interface question: How do you get buy-in from the med chemists? There’s some aligning of these characters in the computational world of, you know, the crusty med chemist, they don’t want to do new things. They don’t want to try new tools. They get very stuck in their old ways.

The better way of thinking about that is people who have been building drugs have been doing this often for 20, 30 years. In that time, they’ve seen mostly failure, right? Even if they had nothing to do with computation or whatever. Most projects in drug discovery fail. Particularly, if you’re going after novel compounds, novel targets.

That means that they’re right, when they say this’ll never work. They’re probably right, Most of the time they’re going to be correct. You have to find a way to say, okay, I know you’re going to be right a lot, but it’s still worth trying because I’m going to help you be better and we’re going to have more chances of success.

I don’t think it’s productive to go into these kinds of conversations of saying, “Hey, we’re going to be so much better than this. These people are a little bit. Old school, et cetera.” We have to, as an industry and as these, interlopers, if you will, find a way to communicate and say, “We’re going to be your partner. We’re going to bring you along and make it easy for you to use our tools and to understand how they help you. That’s a cultural gap that we also need to bridge out.

Simon Burns: The two of us are meeting a lot of people coming from tech trying to get into this landscape. What advice do you have for them on better understanding everything getting involved? Also, not contributing to the divide between lab, and the technologists?

Nicolas Tilmans: I find myself occasionally suffering from this, even though having had experience at both the bench and the keyboard. It all comes down to the people who come from compute are used to data having a particular shape and a particular reliability. Let’s take a little bit of a stereotype. You come out of Facebook, want to change some aspect of the UI. You are going to have a huge amount of infrastructure behind you that allows you to spin up an AB test, maybe even several AB tests. You could spin up all that, tooling and infrastructure in like a week.

And then you’re going to have your information back with quite literally hundreds of thousands of data points, beautiful plots everywhere in the next two weeks. So in a month, you get this massive data set that’s beautifully explained and that is actually fairly internally reliable.

The benefit for you there that’s underlying that is you have a very high degree of control over every aspect of your experiment. If you show a thing on a screen, the JavaScript displays exactly that button the way you thought it would, the person will click or not. It’s a binary variable. Yes/ No. There’s a lot of things that are very clear.

Almost nothing in biology has anywhere near that kind of control. Even with the simplest assays there’s 10 different elements to your buffer. Maybe this one has a little too much salt, and now your enzyme’s not working the way it should. Or, maybe you need to control this other variable. And, the concentration of the enzyme is a little off, or the tubes are a little sticky, and so you thought you were adding this much enzyme, but in reality you were adding a tenth that much enzyme.

There’s a thousand different ways in which experiments go wrong that are not obvious at the protocol level. In tech, you describe your experimental protocol, it’ll be executed exactly that way, and you have high confidence it happened. In biotech, basically, no. So understanding how noisy biological data is, how unreliable it is, that’s the biggest thing tech people need to understand and why this whole machine learning thing is so much more challenging in that context, you need to really think about noise in a different way than anybody coming from tech is used to.

The second part is, and I struggled because I don’t have a solution for this just yet, the intuition for what a biological system actually is. I think about this as a biochemist, so I’ve got a bunch of proteins in the cell. They’re interacting with one another. They’re sticking with one another or not. They’re all a little bit loosey-goosey, They’re very flexible. These mixtures and an intuition for saying, what’s an equilibrium? What’s a chemical structure in real life? Chemical structure is not just an image on a screen of all these things connected. It’s electrons that are flowing around. Some of them are in different positions. It’s getting that intuition for what the entities in your machine learning problem are, is very challenging and I wish there was a good way of teaching it. Sort of a biochem, molecular Biology 101 course for people in from tech would be interesting to me.

Simon Burns: I would pay for it if you wanted to make one. I would be customer number one. (both laugh)

Nicolas Tilmans: That’s something the community needs to work on. I actually would be interested in participating, trying to think about how to do that. It’s a crazy text. This is a bit much to ask for somebody to read straight outright, [00:20:00] but probably the best single textbook I’ve seen, that covers everything, is a textbook called Molecular Biology of the Cell. It’s this giant tome, if you get through…

Simon Burns: …It’s a Red Book. No?

Nicolas Tilmans: I have a gray version and then there was a blue version. I don’t know which edition they’re on now, it’s the bible of cell biology. Bruce Alberts is the head author. That goes through enough cell biology and biochemistry for most people to get a sense. It’s just a lot of material. So how do you condense that into maybe a couple hundred pages to make it more simple to understand?

Simon Burns: With that, I really enjoyed the conversation. Thank you so much. Much for joining.

Nicolas Tilmans: Thank you so much, Simon. You have a good rest of your day. I hope that was helpful.